COMPUTER ORGANIZATION AND ARCHITECTURE

V. Rajaraman
Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore

T. Radhakrishnan
Professor of Computer Science and Software Engineering Concordia University Montreal, Canada

New Delhi-110001 2007

Rs. 295.00 COMPUTER ORGANIZATION AND ARCHITECTURE V. Rajaraman and T. Radhakrishnan © 2007 by Prentice-Hall of India Private Limited, New Delhi. All rights reserved. No part of this book may be reproduced in any form, by mimeograph or any other means, without permission in writing from the publisher. ISBN-978-81-203-3200-3 The export rights of this book are vested solely with the publisher.

Published by Asoke K. Ghosh, Prentice-Hall of India Private Limited, M-97, Connaught Circus, New Delhi-110001 and Printed by Rajkamal Electric Press, B-35/9, G.T. Karnal Road Industrial Area, Delhi-110033.

CONTENTS
Preface xi

1.

Computer Systems—A Perspective
Learning Objectives 1 1.1 Introduction 1 1.2 A Programmer’s View of a Computer System 3 1.3 Hardware Designer’s View of a Computer System 5 1.4 Objectives of the Computer Architect 7 1.5 Some Invariant Principles in Computer design 9 Summary 11 Exercises 12

1–13

2.

Data Representation
Learning Objectives 14 2.1 Introduction 14 2.2 Numbering Systems 17 2.3 Decimal to Binary Conversion 19 2.4 Binary Coded Decimal Numbers 23 2.4.1 Weighted Codes 25 2.4.2 Self-Complementing Codes 25 2.4.3 Cyclic Codes 26 2.4.4 Error Detecting Codes 28 2.4.5 Error Correcting Codes 29 2.5 Hamming Code for Error Correction 30 2.6 Alphanumeric Codes 32 2.6.1 ASCII Code 33 2.6.2 Indian Script Code for Information Interchange (ISCII) Summary 35 Exercises 37
iii

14–38

34

iv 3.

Contents

Basics of Digital Systems
Learning Objectives 39 3.1 Boolean Algebra 40 3.1.1 Postulates of Boolean Algebra 40 3.1.2 Basic Theorems of Boolean Algebra 41 3.1.3 Duality Principle 42 3.1.4 Theorems 42 3.2 Boolean Functions and Truth Tables 43 3.2.1 Canonical Forms for Boolean Functions 44 3.3 Binary Operators and Logic Gates 46 3.4 Simplifying Boolean Expressions 48 3.5 Veitch–Karnaugh Map Method 50 3.5.1 Four-Variable Karnaugh Map 54 3.6 NAND and NOR Gates 60 3.7 Design of Combinatorial Circuits with Multiplexers 64 3.8 Programmable Logic Devices 70 3.8.1 Realization with FPLAs 70 3.8.2 Realization with PALs 72 3.9 Sequential Switching Circuits 74 3.10 A Basic Sequential Circuit 74 3.11 Flip-Flops 77 3.12 Counters 85 3.12.1 A Binary Counter 85 3.12.2 Synchronous Binary Counter 86 3.13 Shift Registers 88 Summary 92 Exercises 96

39–97

4.

Arithmetic and Logic Unit–I
Learning Objectives 98 4.1 Introduction 98 4.2 Binary Addition 99 4.3 Binary Subtraction 101 4.4 Complement Representation of Numbers 103 4.5 Addition/Subtraction of Numbers in 1’s Complement Notation 4.6 Addition/Subtraction of Numbers in Two’s Complement Notation 4.7 Binary Multiplication 109 4.8 Multiplication of Signed Numbers 112 4.9 Binary Division 113 4.10 Integer Representation 116 4.11 Floating Point Representation of Numbers 117 4.11.1 Binary Floating Point Numbers 120 4.11.2 IEEE Standard Floating Point Representation 123

98–147

104 107

Contents

v

4.12 Floating Point Addition/Subtraction 127 4.12.1 Floating Point Multiplication 129 4.12.2 Floating Point Division 130 4.13 Floating Point Arithmetic Operations 130 4.14 Logic Circuits for Addition/Subtraction 132 4.14.1 Half and Full-Adder Using Gates 133 4.14.2 A Four-bit Adder 135 4.14.3 MSI Arithmetic Logic Unit 139 4.15 A Combinatorial Circuit for Multiplication 142 Summary 143 Exercises 145

5.

Arithmetic Logic Unit–II
Learning Objectives 148 5.1 Introduction 148 5.2 Algorithmic State Machine 149 5.3 Algorithmic Representation of ASM Charts 157 5.4 Designing Digital Systems Using ASM Chart 159 5.5 Floating Point Adder 165 Summary 168 Exercises 169

148–170

6.

Basic Computer Organization

171–204

Learning Objectives 171 6.1 Introduction 171 6.2 Memory Organization of SMAC+ 172 6.3 Instruction and Data Representation of SMAC+ 173 6.4 Input/Output for SMAC+ 177 6.5 Instruction Set of SMAC+ 177 6.5.1 Instruction Set S1 of SMAC+ 178 6.5.2 Instruction Formats of SMAC+ 178 6.6 Assembling the Program into Machine Language Format 180 6.7 Simulation of SMAC+ 181 6.8 Program Execution and Tracing 183 6.9 Expanding the Instruction Set 185 6.10 Vector Operations and Indexing 188 6.11 Stacks 190 6.12 Modular Organization and Developing Large Programs 193 6.13 Enhanced Architecture—SMAC++ 197 6.13.1 Modifications in the Instruction Formats for SMAC++ 199 6.14 SMAC++ in a Nutshell 200 Summary 201 Exercises 202

vi 7.

Contents

Central Processing Unit
Learning Objectives 205 7.1 Introduction 205 7.2 Operation Code Encoding and Decoding 207 7.3 Instruction Set and Instruction Formats 210 7.3.1 Instruction Set 211 7.3.2 Instruction Format 212 7.4 Addressing Modes 216 7.4.1 Base Addressing 217 7.4.2 Segment Addressing 218 7.4.3 PC Relative Addressing 219 7.4.4 Indirect Addressing 219 7.4.5 How to Encode Various Addressing Modes 220 7.5 Register Sets 221 7.6 Clocks and Timing 223 7.7 CPU Buses 226 7.8 Dataflow, Data Paths and Microprogramming 229 7.9 Control Flow 233 7.10 Summary of CPU Organization 236 Summary 238 Exercises 239

205–240

8.

Assembly Language Level View of Computer System
Learning Objectives 241 8.1 Introduction 241 8.2 Registers and Memory 242 8.3 Instructions and Data 244 8.4 Creating a Small Program 246 8.5 Allocating Memory for Data Storage 248 8.6 Using the Debugger to Examine the Contents of Registers and Memory 250 8.7 Hardware Features to Manipulate Arrays of Data 252 8.8 Stacks and Subroutines 256 8.9 Arithmetic Instructions 260 8.10 Bit Oriented Instructions 261 8.11 Input and Output 263 8.12 Macros in Assembly Language 265 8.13 Instruction Set View of Computer Organization 268 8.14 Architecture and Instruction Set 269 Summary 271 Exercises 271

241–272

Contents

vii

9.

Memory Organization
Learning Objectives 273 9.1 Introduction 273 9.2 Memory Parameters 274 9.3 Semiconductor Memory Cell 276 9.3.1 Dynamic Memory Cell 277 9.3.2 Static Memory Cell 277 9.3.3 Writing Data in Memory Cell 278 9.3.4 Reading the Contents of Cell 279 9.4 IC Chips for Organization of RAMs 280 9.5 2D Organization of Semiconductor Memory 282 9.6 2.5D Organization of Memory Systems 284 9.7 Dynamic Random Access Memory 286 9.8 Error Detection and Correction in Memories 289 9.9 Read Only Memory 290 9.10 Dual-Ported RAM 293 Summary 294 Exercises 295

273–296

10. Cache and Virtual Memory
Learning Objectives 297 10.1 Introduction 297 10.2 Enhancing Speed and Capacity of Memories 298 10.3 Program Behaviour and Locality Principle 299 10.4 A Two-Level Hierarchy of Memories 301 10.5 Cache Memory Organization 304 10.6 Design and Performance of Cache Memory System 315 10.7 Virtual Memory—Another Level in Hierarchy 318 10.7.1 Address Translation 319 10.7.2 How to Make Address Translation Faster 321 10.7.3 Page Table Size 322 10.8 Page Replacement Policies 323 10.8.1 Page Fetching 326 10.8.2 Page Size 326 10.9 Combined Operation of Cache and Virtual Memory 327 Summary 328 Exercises 330

297–331

11. Input-Output Organization
Learning Objectives 332 11.1 Introduction 333 11.2 Device Interfacing 334 11.3 Overview of I/O Methods

332–380

336

viii

Contents

11.4 Program Controlled Data Transfer 338 11.5 Interrupt Structures 340 11.5.1 Single Level Interrupt Processing 341 11.5.2 Handling Multiple Interrupts 343 11.6 Interrupt Controlled Data Transfer 344 11.6.1 Software Polling 344 11.6.2 Bus Arbitration 345 11.6.3 Daisy Chaining 346 11.6.4 Vectored Interrupts 347 11.6.5 Multiple Interrupt Lines 348 11.6.6 VLSI Chip Interrupt Controller 349 11.6.7 Programmable Peripheral Interface Unit 350 11.7 DMA Based Data Transfer 351 11.8 Input-Output (I/O) Processors 356 11.9 Bus Structure 357 11.9.1 Structure of a Bus 357 11.9.2 Types of Bus 358 11.9.3 Bus Transaction Type 358 11.9.4 Timings of Bus Transactions 358 11.9.5 Bus Arbitration 361 11.10 Some Standard Buses 363 11.11 Serial Data Communication 365 11.11.1 Asynchronous Serial Data Communication 366 11.11.2 Asynchronous Communication Interface Adapter (ACIA) 11.11.3 Digital Modems 367 11.12 Local Area Networks 369 11.12.1 Ethernet Local Area Network—Bus Topology 369 11.12.2 Ethernet Using Star Topology 373 11.12.3 Wireless LAN 374 11.12.4 Client-Server Computing Using LAN 375 Summary 376 Exercises 378

366

12. Advanced Processor Architectures

381–429

Learning Objectives 381 12.1 Introduction 381 12.2 General Principles Governing the Design of Processor Architecture 382 12.2.1 Main Determinants in Designing Processor Architecture 12.2.2 General Principles 385 12.2.3 Modern Methodology of Design 386 12.2.4 Overall Performance of a Computer System 390 12.3 History of Evolution of CPUs 391 12.4 RISC Processors 395

382

Contents

ix

12.5 Pipelining 397 12.6 Instruction Pipelining in RISC 400 12.7 Delay in Pipeline Execution 403 12.7.1 Delay due to Resource Constraints 403 12.7.2 Delay due to Data Dependency 405 12.7.3 Pipeline Delay due to Branch Instructions 407 12.7.4 Hardware Modification to reduce Delay due to Branches 408 12.7.5 Software Method to reduce Delay due to Branches 412 12.7.6 Difficulties in Pipelining 414 12.8 Superscalar Processors 416 12.9 Very Long Instruction Word (VLIW) Processor 418 12.10 Some Example Commercial Processors 419 12.10.1 Power PC 620 420 12.10.2 Pentium Processor 421 12.10.3 IA-64 Processor Architecture 423 Summary 425 Exercises 426

13. Parallel Computers

430–473

Learning Objectives 430 13.1 Introduction 431 13.2 Classification of Parallel Computers 432 13.2.1 Flynn’s Classification 432 13.2.2 Coupling between Processing Elements 435 13.2.3 Classification Based on Mode of Accessing Memory 435 13.2.4 Classification Based on Grain Size 436 13.3 Vector Computers 439 13.4 Array Processors 441 13.5 Shared Memory Parallel Computers 442 13.5.1 Synchronization of Processes in Shared Memory Computers 442 13.5.2 Shared Bus Architecture 446 13.5.3 Cache Coherence in Shared Bus Multiprocessor 447 13.5.4 State Transition Diagram for MESI Protocol 449 13.5.5 A Commercial Shared Bus Parallel Computer 453 13.5.6 Shared Memory Parallel Computer Using an Interconnection Network 454 13.6 Distributed Shared Memory Parallel Computers 455 13.7 Message Passing Parallel Computers 461 13.8 Cluster of Workstations 465 13.9 Comparison of Parallel Computers 466 Summary 468 Exercises 470

x

Contents

Appendix A Appendix B References Index

Decision Table Terminology Preparation, Programming and Developing an Assembly Language Program

475–476 477–482 483–485 487–493

or second year B. Specifically we develop ASM charts for common arithmetic/logic operations and their implementation using sequential logic. Chapter 4 presents algorithms for addition. It does not assume an extensive knowledge of electronics or mathematics. MUXes and PALs. Machine language programs for xi . We also evolve the logic circuits for all arithmetic operations and describe the functions of MSI chips which perform arithmetic and logic operations. It has sufficient information for computer science students to appreciate later chapters. subtraction. In Chapter 5 we present algorithmic state machines to express sequential algorithms relevant to the design of digital systems. binary codes for representing decimal numbers and characters are presented. Number systems for numeric data. application programmers. The third chapter reviews the basics of digital systems which makes the book self-contained. The book begins with an introduction to computer systems by examining the perspectives of users.Tech or an American sophomore would be able to follow the book. system programmers and hardware designers. An algorithm to simulate this machine is given which can be converted to C programs by students. A student in the final year B. In Chapter 6 we discuss the logical organization of a small hypothetical computer called SMAC+ (small computer).PREFACE This book is an elementary introduction to computer organization and architecture. It then describes the basics of sequential circuits including various types of flip-flops and shift registers.Sc. This is followed by a chapter on representation of data in digital systems. We then present a simple version of a hardware description language and use it to describe arithmetic logic circuits. It also presents the views of digital logic designers and architects of computer systems. Knowledge of programming in Java or C would be useful to give the student a proper perspective to appreciate the development of the subject. The next two chapters describe the design of Arithmetic and Logic Unit of computers. It starts with a brief discussion of Boolean algebra and the design of combinatorial logic circuits using gates. IEEE754 standard for floating point numbers is described. multiplication and division of binary integers and floating point numbers.

Various choices for instructions. It concludes with a discussion of Local Area Networks that connects several computers located within a few hundred metres. Chapter 9 is on memory organization. This chapter also describes various standard buses. We also describe the factors which upset pipelining and how they are alleviated using both hardware and software methods.. registers. An assembly language brings out the details of the structure of CPU of specific real processors and gives an opportunity to students to write assembly programs and test them. bus structure. . Based on this we have abstracted certain invariant principles which have prevailed over decades in designing processors. etc. In Chapter 11 we describe the techniques used to interface peripheral devices with the other units of the computer and procedures used to efficiently transfer data between these devices and the main memory. Various methods of organizing caches are presented. indexing. In this chapter we give an overview of how processors have evolved from the sixties to the present. etc. The next chapter is on Advanced Processor Architectures. Chapter 7 discusses in greater detail and in more general terms the design of central processing unit of computers. We describe Static and Dynamic Random Access memories and their organization as 2D and 2. This leads us to a discussion of pipelining and its use in RISC. Basics of micro-programming data and control flows are also presented in this chapter.5D structures. instruction formats. The book concludes with a chapter on Parallel Computers. addressing methods. We write small assembly language programs using NASM assembler. This chapter also describes how the addressable memory size can be increased using the idea of virtual memory which combines a large capacity disk with a smaller main random access memory. In this chapter we describe how processors are interconnected to create a variety of parallel computers.xii Preface SMAC+ are presented. The next chapter describes how the effective speed of memories can be increased by combining DRAM used as main memories with SRAMs used as caches. We use this assembler as it is open source and can be downloaded by teachers and students. We then give the motivation for development of Reduced Instruction Set Computers (RISC). are discussed in detail. This provides a deeper understanding of organizational issues. We strongly feel that it is important for a student studying computer organization to understand how assembly language programs are written. The need for extra machine instructions and organizational improvements needed to conveniently solve a variety of problems on this machine is clearly brought out by writing machine language programs for this machine. The machine is slowly improved by enhancing its features such as addressing modes. While new features are added to the machine. VLIW and IA-64 architectures. We describe computer organization at the assembly levels using Pentium as the running example focusing mostly on its 32-bit structure. In this chapter we also describe Read only Memories and dual ported memories. The chapter concludes with a description of instruction level parallelism and how it is used in superscalar processors. the student could modify and enhance the simulator to enable him to write programs for the enhanced versions of the hypothetical computer.

untiring and endless support of Mrs. We then describe distributed shared memory parallel computers (commonly known as Non Uniform Memory Access— NUMA—computers) and message passing multi computers including cluster of workstations. V. RADHAKRISHNAN . Indian Institute of Science. Dr. We then describe various methods of organizing shared memory parallel computers. This leads us to the problem of cache coherence and its solution. It is our great pleasure to acknowledge the dedication. readers. Radhakrishnan would like to thank Concordia University. T. As a matter of fact. mere words cannot express our grateful thanks to her. RAJARAMAN T. Rajaraman would like to thank the Director. We would like to thank Ms. She checked the manuscript and proofs with care. In short. Prof. especially the Dean of Faculty of Engineering and Computer Science for facilitating him in his visits to Bangalore while writing this book. We would like to thank all our students. Dharma Rajaraman. reviewers and colleagues for the many constructive comments given by them which have greatly improved the book. she made this book writing project enjoyable and successful. V. T.Preface xiii We briefly describe vector computers and array processors. indexed and supported this project in every possible way. for the facilities provided. Mallika for an excellent job of typing and other secretarial assistance.

by lower layers being used by higher layers. The wide availability of the Internet has enhanced the use of computers for information sharing and communication.1 INTRODUCTION Computers are used in every walk of life to assist us in the various tasks we perform. The views of digital logic designers and those of the architects of computer systems. 1. How to view a computer system as a layered system with facilities provided Block diagram view of a computer system. system and application programmers. software developers nowadays consider the ease of use as the most important 1 . low-cost personal computers are available in plenty at homes and workplaces. not only by specialists but also by casual users for a variety of applications. Powerful supercomputers can be accessed from remote terminals through communication networks. Computers are more widely used now than ever before.COMPUTER SYSTEMS—A PERSPECTIVE 1 LEARNING OBJECTIVES In this chapter we will learn: â â â â How to view a computer system from the perspectives of end users. Today. and hardware designers. Because of the widespread use of computers.

personnel management. organizations such as banks. allot rooms. Besides this. Banks use them to look up user accounts. E-mail is another popular application among home users. Apart from individual users. a system designer uses the model for understanding and analyzing system before embarking on the costly processes of its design and development. A language used for describing algorithms meant for execution on computers is known as a Programming Language. they are used for e-shopping. The kind of details that we abstract from the model or conversely that we present in the model. Such large software is developed by a team of developers to operate reliably and maintained and improved over long periods to be available with minimal breakdowns. These users are not professionals. scheduling operation theatres. print account statements.2 Computer Organization and Architecture design criterion. It is a formal language . depends on the purpose for which or for whom the model is created. and/or ASCII characters. marriage partners. From the end users’ perspective. The hardware and the system software of a computer system are complex and they can be abstractly modelled to get a simplified view. and algorithms—which systematically transform the input data using a step-by-step method and give the desired output. etc. etc. Hospitals use them in patients’ admission process. we need two major entities: data—which suitably represent the relevant details of the problems to be solved in a form that can be manipulated or processed by computers. routine patient care. trains. bill customers. To solve problems using computers. billing and so on. They are lay users and their main concern is ease of learning how to use diverse applications and use them in a simple convenient manner. They are structured collections consisting of primitive data elements such as integers. Representation and processing of the primitive data elements are supported by the computer hardware. the number of lines of code is irrelevant. Home computers have proliferated and are used primarily for searching the World Wide Web for information on diverse matters such as health. and so on. reals. etc. We need a language to describe such algorithms. booking tickets by airlines. many application programs consist of tens of thousands or even millions of lines of code. Generally. the complexity of application programs has increased enormously. keep diaries. etc. investment and news. hospitals. Users simply interact with the application program viewing it as a ‘black box’ and use it through an appropriate graphical user interface. Another use is to draft letters. hotels. Hence the system software and hardware of a computer system should have facilities to support such application software developments. Further. Hotels use them to check room reservation status. Application designers have to cater to a wide range of requirements. and we need structured representation of data called ‘data structures’ to make them amenable for computer processing. Today computers have become universal machines used by almost everyone. governments and industrial manufacturers use computer systems interactively to assist them in their functions. Thus. look for jobs. There are also myriad applications in the “back offices” of the organizations for payroll. Booleans.

it could look like this: “Reserve a one-way ticket from Bangalore to Montreal by Air France for next week Sunday.” It is the responsibility of the application designer to develop a clear specification for all the relevant higher level operations pertinent to the domain of application. 1. when a computer system is used for reserving an airline ticket.2 A PROGRAMMER’S VIEW OF A COMPUTER SYSTEM Consider the following program segment written in a higher level language to sum 64 numbers stored in an array and find their average. test if a particular bit is ON or OFF. compare. at the application level it can be quite complex. move data from one register to another.1. Programming languages can be categorized into: (i) higher level languages like JAVA or C++ which are independent of the hardware. we will study about the hardware layer and the assembly language layer in detail. and (ii) assembly languages that are specialized and dependent on the hardware of the computer.1 A layered view of a computer system. FIGURE 1. Often a computer system is viewed as consisting of several layers depicted in Figure 1. etc. On the other hand. Every statement of a formal language is interpreted by the computer hardware in exactly one way. The operation is very primitive at the hardware level: add. A program consists of several statements each of which performs an operation on some data. In this introductory textbook. Thus such a language has precisely defined syntax and unambiguous semantics.Computer Systems—A Perspective 3 which is very different from a natural language such as English. . Each layer will be designed with careful considerations to the needs of the upper layers supported by it as well as objectives fulfilled by that layer and the constraints of the technology used for its implementation. Usually the higher layers depend upon the facilities provided by the lower layers. subtract. For example. Such a layered view provides a functional overview of a complex system such as computer. An algorithm expressed using one of the many programming languages is called a program.

4 Computer Organization and Architecture Total := 0 For i=1 to 64 Total := Total + Marks (i) End Average = Total / 64 Later in Chapter 8 we will study the details of NASM. therefore. A compiler. it is quite large. . allocated and . where to store the program and data in memory. be properly managed. The assembly language level programmer. . for example disks. the above program segment can perform this task. Thus. which is a software system. but also on finer details such as which hardware registers of the processor to use. several users of a computer system can keep their large data files in it. as the end user of a computer software. and so on. As the demand for storage is very high in a modern computer system. there are several compilers to facilitate the users to develop application programs in the languages that are suitable for them. For a given computer system. addnext add inc loop shr add the i-th element to eax increment by 4 to get the next element this instruction uses the ecx register a faster way to divide by 64 (26) Let us suppose that a professor has 64 students in his class.0 esi.4 addnext eax. Disks are much cheaper than RAMs and hence it is cost effective to have billions of bytes (giga bytes or GB) of disk storage in a computer. If we add up the storage needed by all compilers. that is. Compilers are thus one important software resource of a computer system. would have to focus not only on the algorithm and the data structures. .64 eax. loop count the eax register to zero index ‘i’ to 0 base address of the array in ebx . can perform the task of translating a given higher level language program into its assembly language equivalent automatically.0 ebx. It should. The above program segment can be written in assembly language as below: mov mov mov mov ecx. Their needs are relatively more specialized than the needs of the application programmers. issues a command: “Find the average mid-term marks of my class. on the other hand. disk space is yet another resource that is shared by users. . .” If the software has the knowledge that this professor has 64 students in his class and their mid-term marks are stored in the vector Marks. the assembly language of the popular Pentium processor. When large disk storage is available. marks eax. He. there are at least two kinds of memories: Primary memory or Random Access Memory (RAM) and Secondary memory. A programmer using the higher level language would focus more on the algorithm and the data structures needed to solve the problem. and what instructions would efficiently implement the operations needed on the data. . People who develop such compilers are also called programmers. initialize initialize initialize store the ecx register to 64. .6 . [ebx+esi] ebx.

This classical model of a computer hardware system consists of five basic subsystems: 1. Apart from memory. Coordination of multiple tasks together solves a more complex problem. . FIGURE 1. Their needs and focus vary. Such a high capacity can be shared among multiple tasks. and so on. another resource to be managed in a modern computer system is the “processor’s time”.3 HARDWARE DESIGNER’S VIEW SYSTEM OF A COMPUTER A block diagram model of the computer system as viewed by a hardware designer is shown in Figure 1. Memory 4. Processor 5. Control These subsystems communicate with each other by transferring data and control signals among themselves which are denoted by the lines connecting the blocks in Figure 1. There are programmers of different kinds—operating system developers.2.2 Block diagram of a computer. A typical processor of today can execute billions of instructions per second. application developers. In this figure the solid lines denote the data flow and the dotted lines denote the control flow. Output 3. Manager of all such resources in a computer system is also a software that is known as the operating system.2. 1. Input 2.Computer Systems—A Perspective 5 de-allocated to users. compiler writers.

we have explicitly shown the communication path. and the processor bus connects the CPU to memory. The processing subsystem performs both arithmetic (ADD. The program is a step-by-step description of the algorithm which describes the process of transforming the input to the desired output. The control unit knows how to control and coordinate the various subsystems in executing a machine instruction. We see that there are two entities which communicate with memory.3. Figure 1. On the other hand.2 that memory is the ‘heart’ of this hardware organization. The control subsystem is needed to correctly sequence the execution of various machine instructions. CPU is made of hundreds of millions of electronic circuits that are suitably packaged into components. Two buses are shown in this figure and both of them are bi-directional. A program is stored in memory and the data on which it operates is also stored in memory. This is done by an output subsystem. in the form a hardware bus.3 brings out an important problem in computer design. It should be noted that the steps of an algorithm. Their speeds are very different (ratio of 1 to 1000).2 as shown in Figure 1. It is normally expressed in a language that the hardware can understand.6 Computer Organization and Architecture As a computer system is used to solve a problem it is necessary to feed data relevant to the problem. For ultimate execution. OR. How should we deal with this speed mismatch? How do we resolve the conflict which may arise if both the . NOT) operations and thus it is also known as Arithmetic and Logic Unit or ALU. written in the form of a ‘program’ are also stored in the memory and fetched from memory one machine instruction at a time for execution. The memory subsystem is needed to store data as well as the program. There is a one-to-one correspondence between the machine instructions and the assembly language instructions. Let us call these primitive operations as machine instructions. SUBTRACT. We can redraw Figure 1.2. This is done via the input subsystem. Instructions from this program are fetched by the control subsystem one after another in an orderly fashion and the operations corresponding to that instruction are performed by the ALU. DIVIDE) and logic (AND. between the various blocks. the I/O is made of electro-mechanical and electro-optical components and thus they operate only at several kilo cycles per second (KHz). These components are used to build subsystems and they are controlled by a clock that runs at mega cycles per second (abbreviated MHz) or giga cycles per second (GHz). The I/O bus connects the input/output subsystems to memory. We note in Figure 1. In this new organization. a program will be composed of such machine instructions. MULTIPLY. A bus in the hardware is a ‘bunch’ of parallel wires and the associated hardware circuits which transport the binary signal from source to destination. we have drawn a solid line between memory and control blocks in Figure 1. These multiple entities may require the services of memory (READ or WRITE services) simultaneously. Sometimes we refer to the combination of processing and control subsystems as CPU. To denote the fact that the machine instructions coming from the program stored in memory is used by the control unit. The results obtained by solving the problem must be sent to the person (or even a machine) who gave the problem. The processor is required to transform the input data based on the primitive operations. which the hardware is designed to perform.

we briefly explain the three terms: digital logic.Computer Systems—A Perspective 7 subsystems attempt to simultaneously access the same memory location? The computer hardware designers and the system software designers together should address these two questions. developed by a software engineer.3 Bus-based organization of a computer. We will see later in this book how these problems are satisfactorily solved. we note the design of processing unit is closely linked with the machine language instructions one would like to support. This two-stage approach forms the basis for the “stored program concept” which we will explain in greater detail in the subsequent chapters. A computer system. can be viewed as an abstraction consisting of several layers. “How is this hardware used by the software?” The organized collection of machine instructions is what we call a ‘program’. The second stage is when the program is executed and the end user interacts with the computer and performs his or her intended tasks. The machine language .4 OBJECTIVES OF THE COMPUTER ARCHITECT In this section. Digital logic is primarily concerned with the hardware layer.2. Normally. Computer organization’s primary concern encompasses both hardware level and computer language levels. Referring to Section 1. The first stage is the one in which the program is developed by a programmer or a team of software engineers using some software tools and it is stored in the memory with the help of the input subsystem. The program is written in an appropriate programming language that the machine can interpret and execute. either at the hardware level or at the software level. The program embodies the method of solving a problem or the process that transforms the input data into the desired output result. FIGURE 1. the program. 1. may have thousands or millions of instructions put together and it is stored in the memory. computer organization and architecture. we can distinguish between the two clear stages. The program will then be executed under the control of the control subsystem instruction by instruction. Thus. A question arises.

A computer architect has to be aware of both the hardware and software layers and their interactions. Going back to our analogy. Specifically an architect uses facilities . taking into account the needs of prospective occupants of the building. Thus in computer organization one uses the idea of designing a hierarchy of memories to reduce the speed mismatch. This choice is governed by many factors such as the hardware-software trade-off and the compatibility of the instruction sets of the computer as it evolves from one generation to the other.8 Computer Organization and Architecture instructions to be supported depend primarily on the type of application programs one would like to run on the machine. a person dealing with computer organization is similar to a structural engineer. One of the major problems in designing memory is the speed mismatch between memory and CPU. The complete set of instructions supported by a computer in its hardware is called the instruction set of that machine. This is primarily due to economics. Specifically we have seen that computers are now widely used by diverse users. A building architect’s primary concern is planning a building. We have described in some detail the end users’ perspective. This is coupled with aesthetics of look and feel of the building. Computer organization is also concerned with the organization of memory and its interconnection with CPU and I/O subsystems. We will describe how to tackle this problem in this book. A computer architect’s job is somewhat similar. The trade-off implies that a large number of instructions supported by complex hardware would simplify writing machine language programs but it would also increase the complexity of the hardware and hence the cost. The choice of instruction set is one of the main concerns of computer organization. A computer architect is in overall control of the design of a computer system. The computer architect must thus be conversant with computer organization. The architect must be knowledgeable about the design of structures and be able to interact with structural engineers who will design the structure. The architect has to understand the prevailing technology and changing user requirements and accordingly tune various systems and their parameters to interoperate and create a balanced system. While these are the major objectives they have to be met most economically using the latest materials available for construction. He has to pay special attention to deal with the speed mismatch between main memory and CPU as well as that between I/O systems and memory/ CPU. Here the problems are the diversity of I/O devices which range in speeds of 1 to 100 and their mode of operation also vary widely. While some devices transfer data one byte at a time. This is one of the major topics we will discuss. in the sense that large memories are needed to support most applications and thus slower cheaper devices are used to design them compared to CPU. Lastly the organization of I/O subsystems and their interconnection with memory and CPU is a major concern. Over the past decades devices used to design CPU have always been 10 times faster than those used in memory. We may compare the job of a computer architect with that of a building architect. The primary objective of a computer architect is to meet the requirements of the end users of the system making optimal use of hardware components available at reasonable cost. others send streams of byte in chunks of the order of 200.

There are several such problems which arise and pose challenging issues to today’s computer system architects. As we pointed out I/O devices are diverse and slow. newer problems arise. 1. Methods of alleviating this speed mismatch by both hardware and software are devised by architects. Another area of importance to computer architects is I/O systems. This has been enabled by computer network architects who evolved several standards. Parallel computers operating cooperatively achieve high speeds needed to solve highly complex problems. e-mail and e-commerce. l Increase the effective capacity of memory by combining a semi conductor Random Access Memory (RAM) (whose capacity is small but whose speed is high) with a disk memory (whose capacity is 1000 times more than that of RAM but is 1000 times slower) to provide a “Virtual memory” with capacity of a disk and a speed approaching that of the RAM. By combining the above two techniques an architect succeeds in providing a large capacity fast memory required by today’s large applications. today people are able to fabricate multiple processors in a single component. software designers and computer systems architects. both at the hardware and software levels. In modern times. which make the interoperability of diverse computers possible. two other important developments have taken place. A computer architect needs to . In doing this. the services of the operating system as well as the underlying hardware are made to cooperate by the computer system architect. Both for increasing the computational power and the availability under certain failures.Computer Systems—A Perspective 9 provided by the operating system to solve the following problems related to memory systems. A challenge to an architect is to effectively use this power provided by the hardware to speed up problem solution.5 SOME INVARIANT PRINCIPLES IN COMPUTER DESIGN Computer systems are designed cooperatively by hardware designers. searching for information on the web. The second important development is the design of parallel computers. One of them is the Internet where computers across the world are interconnected by a communication network. They could become a bottleneck to computation. parallel use of multiple computers have become common. With developments in the fabrication of electronic components like the CPU. l Increase the effective speed of memory by combining a large low cost memory and several fast smaller memories to provide an effective faster access to those data immediately required by the CPU. This interconnection has led to several important applications such as the World Wide Web. With a wide diversity and access. Privacy and security become very important issues. A further challenge arises in making several independent processors share a large memory system and access it in an orderly manner.

Amdahl’s Law This law states the simple fact that when a system consists of many subsystems. it is essential to make sure that software developed for earlier machines is able to run with little change in the new generation machines. The trade-off between speed and cost changes as technology evolves. A number of data structures such as vectors. it is essential to increase uniformly the speed of all parts of computing systems. There is. Software-Hardware trade-off It is possible in computers to obtain certain functionalities either by hardware or by software. For example. In spite of these rapid changes there are some major principles which have remained invariant over a longer period. the locality of reference has remained an important invariant in design. the need for effective use of the final system and the prevailing technology. Such parallelism is used in pipelined processors. there have been huge shifts in user requirements and prevailing technology of both hardware and software. In the last five decades. Besides this. which is possible if different phase in the execution of an instruction can be overlapped when a sequence of instructions are carried out. They are: Upward Compatibility This is essential as software development is human intensive and thus expensive to develop and maintain. Sequential thinking seems to be natural for humans and sequential programming has always been a dominant programming style. In over five decades of evolution of computer systems. Thus in order to decrease the overall time for solving an application. These change quite rapidly. cache memory and virtual memory are all dependent on this principle. Data parallelism occurs when the same instruction can be carried out on several data items such as simultaneous processing of pixels in a picture or a video frame. Thus as new computer hardware is introduced. Pipelined processors use temporal parallelism. however. in order to increase the speed of the system as a whole all subsystems must also be speeded up by the same amount. The design of pipelined processors. always a possibility of hardwaresoftware trade-off. Thus computer designers have to pay particular attention to increase the speed of the slowest subsystem such as I/O by appropriate architecture. sequential files and strings are also linear and sequential.10 Computer Organization and Architecture understand the current requirements of the population of users. . Parallelism in applications There is inherent parallelism in many programs. Locality of reference of instruction and data A computer stores a sequence of instructions in the memory in the order in which they are executed. This is what distinguishes computer system design from other areas of engineering systems design. Hardware is faster but more expensive whereas software is cheaper but slow. In other words. it is possible to overlap operations of CPU. if one data item in the structure is referred. the operation of two multiplications and a division in the expression (a*b-c/d+f*z) can all be carried out simultaneously if there are two multipliers and a divider which can work independently. it is most likely that its neighbors will be accessed next. We will describe this in great detail in this book.

This situation has now changed. Assembly language uses mnemonics for hardware instructions and symbolic representation of memory address. It is thus important for a computer architect to identify this critical 20% of the system and spend maximum efforts to optimize its operation. 7. Memory. experimental data on working computers were not easily available. SUMMARY 1. We can view a computer system as consisting of six layers. assembly language. A computer system may be viewed from the perspective of end users. A computer hardware system consists of five subsystems. we will describe each one of them in detail. system and application programmers and hardware designers. Thus the actual performance of computers is available allowing an architect to understand the critical parts of a system. 8. Hardware designer’s concern is to provide the best design with currently available technology at reasonable cost. They are Input. The modern methodology of design consists of collecting a large number of typical application programs known as benchmark programs using which a proposed design will be assessed. Before committing hardware resources and building a system. 20% of the system consumes 80% of the resources. It is a one-to-one representation of machine language. 9. The lowest to highest layers are respectively hardware. End users’ primary concern is ease of use and good user interfaces which may be graphical.Computer Systems—A Perspective 11 The Pareto principle or 20-80 law This law states that in a large number of systems. Output. 5. A complex system such as a computer is usually viewed as consisting of several layers. audio and even video. operating system and application programs. A lot of experimental data on several computers using a variety of applications programs have been gathered. The main concern of system and application programmers is the facilities provided by the hardware which eases their task and enables them to develop error-free maintainable programs. 3. machine language. 2. high level language. it is simulated in an existing computer and various alternatives are tried to arrive at an optimal design. . At the lowest level in the hierarchy of languages is machine language which consists of instructions provided by the hardware of the computer. Thus the design used mostly common sense and several hunches which experienced engineers had. 4. 6. During early days of computers. Higher layers depend on the services provided by the lower layers. In the rest of this book. Each layer may be independently designed. Central Processing Unit and Control Unit.

15. 13. What does a hardware layer consist of? What are the functions of the control unit of a computer? . 12. Six important principles. An operating system is used to effectively manage hardware and software resources of a computer system. These have remained invariant during the last five decades and are likely to remain the same. slow speed hard disks with high speed main memory made of semiconductors. Computer organization is primarily concerned with designing of CPU. The job of a computer architect is similar to that of a building architect. Another major concern is to organize and connect a diverse group of I/O devices with varying speeds (all low) to CPU/memory without lowering the speed of the overall system. A computer architect must be abreast of prevailing technology of CPU. parallelism inherent in most applications. The proposed system is simulated on an existing computer and the benchmark programs are executed on the simulated system and thereby optimize the design. are: upward compatibility. Modern methodology of design uses typical input data from a class of application programs to optimize the performance of a proposed system.12 Computer Organization and Architecture 10. which should be used by computer system architects. large capacity slow memory with high-cost. memory and I/O systems with the existing technology. What is the reason we view a computer system as consisting of several layers? What are the different layers? 2. The main problems in design are the need to pick appropriate instruction set of a computer to conveniently solve a representative set of applications. Amdahl’s law and Pareto principle (or 20-80 law). It has facilities to allow orderly execution of several users’ programs concurrently in a computer system. The main problem in memory design is to get best performance by combining low-cost. 18. EXERCISES 1. 16. 17. High level languages provide users with facilities to express algorithms with ease. memory and I/O devices and find appropriate methods of interconnecting them to operate as a coherent system to cost-effectively solve classes of applications of interest to users of the system. 14. locality of reference of instruction and data. 11. The addressable storage is increased by combining high capacity. high speed smaller memories. the possibility of software-hardware trade-off. 19.

What are the major problems in computer organization? What are the important principles used to alleviate these problems? 8. What are the six major principles which are used by computer architects? Why are they important? 10.Computer Systems—A Perspective 13 3. What is an operating system? Why is it needed? What are the functions of an operating system? 6. What do you understand by the term ‘upward compatibility’? Why is it important? . When does one use a higher level language? When is assembly language preferred over a higher level language? 5. What is computer organization? How is it different from digital logic? 7. What is the extra flexibility available to computer architects which is not available to a building architect? 9. What is the difference between machine language and assembly language? What are the advantages of using assembly language instead of machine language? 4. What is the role of a computer architect? Compare the roles of a building architect and a computer architect.

perform an operation on it and produce another number at its output. How decimal numbers are converted to binary numbers and vice versa. etc. At the simplest level. a processor to find the square of a one-digit number would fall in this category.DATA REPRESENTATION LEARNING OBJECTIVES In this chapter we will learn: 2 â â â â â â Why binary digits are used to represent. store and process data in computers. mathematical symbols. processes them according to precise rules and produces a stream of symbols at its output. For example. At a more complex level a large number of symbols may be processed using extensive rules.1 INTRODUCTION A digital system accepts a stream of symbols. stores them. a digital processor may accept a single number at its input. 2. How a decimal number is coded as a binary string and the different codes used and their features. Coding methods used to represent characters such as English letters. Why number systems other than binary are used to represent numbers. 14 . Why redundant bits are introduced in codes and how they are used to detect and correct errors.

we should distinguish between the design of a general purpose digital computer and that of a specialized digital subsystem. These features are: 1. These symbols. maintainability and ease of fabrication. 3. Instructions for manipulating symbols are to be precisely specified such that a machine can be built to execute each instruction. A digital computer has a storage unit in which the symbols to be manipulated are stored. The logic design of digital computers and systems consists of implementing the four basic steps enumerated above keeping in view the engineering constraints such as the availability of processing elements. Decisions to leave space for figures should be made. Given the number of letters which could be accommodated on a line (page width) and the rules for hyphenating a word. electronic circuits to count and manipulate characters. Such a system should accept a large text or the typewritten material. Even though the four basic steps in design are common to both. are known as binary digits or bits. Complex manipulation instructions may be built using simple instructions. Examples of simple manipulation instructions are: add two bits. At this stage. The instructions for manipulation are also encoded using binary digits. Bit manipulation instructions are realized by electronic circuits. 2. The idea of building a complex instruction with a sequence of simple instructions is important in building digital computers. 0 (zero) and 1 (one).Data Representation 15 As an example consider a digital system to automatically print a book. reliability. . The processor should also arrange lines into paragraphs and pages as directed by commands. it should determine the space to be left between words on a line so that all lines are aligned on both the left and right hand sides of a page. Bits can be stored and processed reliably and inexpensively with currently available electronic circuits. Such complex processing would require extensive special facilities such as a large amount of storage. All streams of input symbols to a digital system are encoded with two distinct symbols. and a printer which has a complete assortment of various sizes and styles of letters. their cost. there are some basic features which are common to all digital processing of information which enable us to treat the subject in a unified manner. The encoded instructions for manipulating the symbols are also stored in the storage unit. A sequence of instructions for accomplishing a complex task may be stored in the storage unit and is called a program. Regardless of the complexity of processing. compare two bits and move one bit from one storage unit to another. 4. A multitude of such decisions are to be taken before a well laid out book is obtained. the constraints which are peculiar to each of these lead to a difference in the philosophy of design.

For example. one may realize a multiplication operation by repeated use of addition operation. computer organization and computer architecture.1 Computer Architecture Computer Organization Computer Logic Layered view of computer design. This book deals with two aspects of digital computer design namely. In this book we will also be describing important hardware-software trade-offs to ensure optimal functioning of a computer. What basic tasks are to be performed by hardware and what are to be done by combined software and hardware is an engineering design decision which depends on cost versus speed requirements and other constraints prevailing at a given time. namely. One can thus build up a hierarchy of programs. He can use the hierarchy of programs which constitute the software of a computer and which is an integral part of a general purpose digital computer. which can be invoked by any user to perform a very complex task. A set of macros could be used to perform more complex tasks.1. it is also about designing memory. Computer architecture primarily deals with methods of alleviating speed mismatch between CPU. The processing rules to be followed vary widely. Computer organization primarily deals with combining building blocks described in computer logic as a programmable computer system. One may realize a complex operation by using various sequences of elementary operations. Each task requires the execution of a different sequence of processing rules. which are used for arithmetic and logic operations. I/O systems and ensuring their cooperative operation to carry out a sequence of instructions. It also deals with the interaction of the hardware with the operating system to ensure easy and optimal operation of a computer. It should be observed that it is possible to perform macro operations entirely by specially designed electronic circuits rather than by using programs. At the outset one may not be able to predict all the tasks he may like to do with a machine. all stored in the computer’s memory. Top Layer Middle Layer Bottom Layer FIGURE 2. A user need not work only with the elementary operations available as hardware functions. . a program. These electronic circuits are together called hardware. This flexibility is achieved by carefully selecting the elementary operations to be implemented through electronic circuits. software can be replaced by hardware and vice versa. Besides arithmetic logic unit.16 Computer Organization and Architecture A general purpose machine is designed to perform a variety of tasks. One of the purposes of this book is to bring out the hardware-software trade-off which is important in the design of general purpose computers. The bottom-most layer deals with digital circuits. Thus. A flexible design is thus required. There are three layers in computer design as shown in Figure 2. Memory and I/O units by a combination of hardware and software methods. which may be thought of as a macro operation.

an–1. circuits designed with transistors operate with maximum reliability when used in the “two state” namely. an is called the most significant digit of the number and a–m (the last digit) is called the least significant digit. do so by being in one out of two stable states. An example of a non-positional number system is the Roman numeral system. the number 8072.443 is taken to mean 8 ´ 103 1000th position +0 ´ 102 100th position +7 ´ 101 Tenth position +2 ´ 100 Unit position +4 ´ 10–1 1/10th position +4 ´ 10–2 1/100th position +3 ´ 10–3 1/1000th position In this notation the zero in the number 8072 is significant as it fixes the position and consequently. As was stated earlier in this section. …. In this system the position of various digits indicates the significance to be attached to that digit. The primary reasons for choosing to represent all data using only zeros or ones are: 1.2 NUMBERING SYSTEMS The most widely used number system is the positional system.Data Representation 17 This chapter discusses the representation of numbers and characters in digital systems. Physical devices used for operating on data in digital systems perform most reliably when operated in one out of two distinct states. a–m used in the above representation should be one of the r symbols allowed in the system. 2. is either a 0 or a 1. In the above representation. in a digital system all data to be processed or stored are represented by strings of symbols where each symbol. A number system with radix r will have r symbols and would be written as: an an–1 an–2 … a0 . In the decimal system the radix is 10. a–1 a–2 … a–m and would be interpreted to mean: anrn + an–1 rn–1 + … + a0r0 + a–1r–1 + a–2r–2 + … + a–mr–m The symbols an. This number system is quite complicated due to the absence of a symbol for zero. For example. the weights to be attached to 8 and 7. For example. Thus. We will present in the next section how numbers are represented using binary digits. magnetic discs store data by being magnetized in a specified direction or in an opposite direction. Positional number systems have a radix or a base. . 2. binary mode. called a bit. For example. which are currently available to store data. Most devices. 872 does not equal 8072.

TABLE 2. each group of four bits has hexadecimal equivalent. The symbol is called a bit. D. E. The hexadecimal system uses the 16 symbols 0. A number in the binary system will be written as a sequence of 1s and 0s. This is shown in Table 2.101 is a binary number and would mean: 1 ´ 23 + 0 ´ 22 + 1 ´ 21 + 1 ´ 20 + 1 ´ 2–1 + 0 ´ 2–2 + 1 ´ 2–3 The equivalent number in decimal is thus: 8 + 0 + 2 + 1 + 1/2 + 0 + 1/8 = 11.625 Table 2. For example. 7. It is. fairly simple to convert binary to hexadecimal and vice versa. a shortened form for binary digit. 1011.1 Binary Equivalents of Decimal Numbers Decimal 0 1 2 3 4 5 Binary 0 1 10 11 100 101 Decimal 6 7 8 9 10 11 Binary 110 111 1000 1001 1010 1011 Decimal 12 13 14 15 16 17 Binary 1100 1101 1110 1111 10000 10001 It is seen that the length of binary numbers can become quite long and cumbersome for human use.) TABLE 2. therefore. Hexadecimal system (base 16) is thus often used to convert binary to a form requiring lesser number of digits. 9.2 Binary Numbers and Their Hexadecimal and Decimal Equivalents Binary 0000 0001 0010 0011 0100 0101 0110 0111 Hexadecimal 0 1 2 3 4 5 6 7 Decimal 0 1 2 3 4 5 6 7 Binary 1000 1001 1010 1011 1100 1101 1110 1111 Hexadecimal 8 9 A B C D E F Decimal 8 9 10 11 12 13 14 15 .1 gives the decimal numbers from 0 to 17 and their binary equivalents. namely 24. …. (One must contrast this with conversion of binary to decimal. the number system used has a radix 2 and is called the binary system.18 Computer Organization and Architecture In digital systems and computers. A. In this system only two symbols. 0 and 1 are used. B. …. 2. namely.2. C. As its radix 16 is a power of 2. 1.

2) 3 ´ 162 + B ´ 161 + 5 ´ 160 .2 101 1010 1011 0111 2 A B 7 Convert the following binary number to hexadecimal: Binary number: 11 1011 0101 . 1101 11 Hex number: 3 B 5 . On the other hand. one may convert a binary number to hexadecimal by grouping together successive four bits of the binary number starting with its least significant bit. 13/16 + 12/256 = 949. The decimal equivalent of (3B5. when converting from binary to decimal. Because of the simplicity of binary to hexadecimal (abbreviated as Hex) conversion.1 Convert the following binary number to hexadecimal: Binary number: Hexadecimal: EXAMPLE 2.859375 2. EXAMPLE 2.3 DECIMAL TO BINARY CONVERSION In addition to knowing how to convert binary numbers to decimal. bits on the fractional part are grouped from left to right as the right-most bits of the fractional part are not significant. D ´ 16–1 + C ´ 16–2 = 3 ´ 256 + 11 ´ 16 + 5 ´ 1 . we obtain: Quotient q = d/2 = an2n–1 + an–12n–2 + … + a120 and remainder r = a0. it is also necessary to know the technique of changing a decimal number to a binary number. it is often faster to first convert from binary to Hex and then convert the Hex to decimal. These four-bit groups are then replaced by their hexadecimal equivalents.DC)Hex is (using Table 2. 13 ´ 16–1 + 12 ´ 16–2 = 768 + 176 + 5 . D C Observe that groups of four bits in the integral part of the binary number are formed starting from the right-most bit as leading 0s here are not significant. The method is based on the fact that a decimal number may be represented by: d = an2n + an–12n–1 + … + a121 + a020 If we divide d by 2.1) .2) (2.Data Representation 19 As illustrated in Example 2. (2.1.

Dividing the quotient by 2.4. The number 16 or Hex used outside the parentheses in (3B5)16 indicates that 3B5 is to be interpreted as a number in base 16. (949)10 = (3B5)16 = (3B5)Hex Check: (3B5)16 = 3 ´ 162 + 11 ´ 161 + 5 ´ 160 = 768 + 176 + 5 = (949)10. The method used for decimal to binary conversion is expressed as Algorithm 2. Decimal to Hex conversion is similar. The procedure is illustrated in Example 2.20 Computer Organization and Architecture Observe that a0 is the least significant bit of the binary equivalent of d. . Thus. 2 2 2 2 2 19 9 4 2 1 0 Remainder 1 1 0 0 1 Most significant bit Least significant bit Thus.3. For example.1.3 Convert the decimal number 19 to binary. 19 = 10011. we obtain: q/2 = d/(2 ´ 2) = an2n–2 + an–12n–3 + … + a220 (2. In this case 16 is used as the divisor instead of 2. the decimal number 949 is converted to Hex in Example 2. Check: 10011 = 1 ´ 161 + 3 ´ 160 = 16 + 3 = 19.3) and the remainder equals a1.4 Decimal Remainder 16 16 16 949 59 3 0 5 11 3 5 B 3 Hex Equivalent Thus. Observe the notation used to represent the base of a number. EXAMPLE 2. successive remainders obtained by division yield the bits of the binary number. EXAMPLE 2. Division is terminated when q = 0.

The procedure is continued till the fractional part of the product is zero. The fractional part of the product may be multiplied again by 2 to obtain the next significant bit. a–2. D:= Q end. The procedure discussed above is to convert a decimal integer to binary. d = a–12–1 + a–2 2–2 + … + a–n 2–n 2 ´ d = a–1 + a–2 2–1 + … + a–n2–n+1 ¯ 0 or 1 ¯ 0 or 1 (d1 < 1) (2. R: integer {Intermediate variables used in conversion}. 10: Output B. {D mod 2 gives the remainder when D is divided by 2} B:= Concatenate to left (R. goto 10 end. {null is a null string} if (D = 0) then begin B:= 0. begin {of algorithm} Input D. While (D ¹ 0) do begin Q:= D div 2 {D div 2 gives the integer quotient of D/2}..4).5) 2–1 + … + a–n2–n+2 2 ´ d1 = a–2 + a–3 ( d2 < 1) (2.Data Representation 21 ALGORITHM 2. The method is based on observing that a decimal fraction may be represented in the form shown in Equation (2. B) yields B = 101}. The method is given as Example 2. Q. Decimal to binary conversion var D: integer {D is the given decimal integer to be converted to binary}. .. Decimal fractions may also be converted to binary. B: null. end {of algorithm}.6) Thus.1.5 . In order to find its binary equivalent. B) {If R = 1 and B = 01 then Concatenate to left (R. we have to find the coefficients a–1. if we multiply a fraction by 2.4) (2. R:= D mod 2.. the integral part of the product is the most significant bit of the binary equivalent of the fraction. B: bitstring {B stores the binary equivalent of D}.

0 Binary 1 1 0 1 1 1 Binary equivalent = 0.2. P: real {P is an intermediate variable used during conversion}. ALGORITHM 2. Decimal 0.5 ´ ´ ´ ´ ´ ´ 2 2 2 2 2 2 = = = = = = 1. While (D ¹ 0) and length (B) £ 9 do {length (B) returns the number of bits in B and we have limited it to 9} begin P:= D * 2: INTP:= Trunc (P) {Trunc(P) returns the integer part of P}.71875 0. B:= Concatenate to right (INTP.4375 0. Conversion of decimal fraction to binary fraction var D: real {D is the decimal fraction to be converted to a binary fraction}.875 0.2.71875 0.859375 0.110111. This algorithm is used in Example 2. B:= 0. B) yields 0.6 to convert 0.4375 0.011} D:= P – INTP end.5 1. B) {If INTP = 1 and B = 0.75 0 Product 0.75 1.01 then Concatenate to right (INTP. INTP: bit {Integer part of P which can be either 0 or 1}.71875 1.75 0.875 1.875 0. Observe that a terminating decimal fraction might lead to a non-terminating binary fraction. 10: Output B.5 Convert 0. B: bitstring {B is the binary fraction equivalent of D}.3 to binary. goto 10 end. . The method is given as Algorithm 2.22 Computer Organization and Architecture EXAMPLE 2. in the algorithm we have developed the binary fraction only up to a length of 8 bits. if (D = 0) then begin B:= 0. end {of algorithm}. Thus.859375 to binary.859375 0. begin {of algorithm} Input D.4375 0. null {null is a null string}.

This method is called binary coded decimal (BCD) representation.4(C)Hex ¯ recurring 2. 0. 1.8 ´ 16 12. 0 and 1. 9.8 0.4 ´ 2 0.6 ´ 2 Thus.2 ´ 2 0. . 0.3)10 = 0.6 Decimal Product 0.2 0. There is another method of representing decimal numbers using binary digits. The number of symbols which could be represented using n bits is 2n.2 Binary 0 1 0 0 1 1 0.6 1. Thus.3 = 0.3 ´ 16 4.4 BINARY CODED DECIMAL NUMBERS We considered in the previous sections the methods of converting decimal numbers to binary form and vice versa.4 0. …. There are 10 symbols in the decimal system namely. EXAMPLE 2.7.0100(1100) ¯ recurring A similar procedure may be used to convert a decimal fraction to its Hex equivalent as illustrated in Example 2.Data Representation 23 EXAMPLE 2.8 0.6 1. Encoding is the procedure of representing each one of these 10 symbols by a unique string consisting of the two symbols of the binary system. It is further assumed that the same number of bits are used to represent any digit.3 ´ 2 0.8 1.8 Thus. namely. (0.6 ´ 2 0.7 Decimal Product Hex 4 C C 0.8 ´ 2 recurs beyond this point 0. in order to represent the 10 decimal digits we require at least four bits as three bits will allow only 23 = 8 possible distinct three-bit groups.8 ´ 16 12.

We saw that we need at least 4 bits to represent a decimal digit. There are.3) = 1. all these 3 ´ 1010 possible codes are not useful. We must at this point distinguish carefully between encoding and conversion. This is due to the fact that an algorithm involving successive division is needed for conversion whereas encoding is by straightforward table look-up. For example. We need only 10 of these 16 for encoding decimal digits. 15.3 we look up the table and get the binary code for 1 as 0001 and that for 5 as 0101 and code 15 by the binary code 00010101.3 Encoding Decimal Digits in Binary Decimal Digit 0 1 2 3 4 5 6 7 8 9 Binary Code 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 If we want to represent a decimal number.2 is a measure of the extra bits (and consequently extra storage) required if an encoding is used. The ratio (4/3. There are 30 billion ways we can pick an ordered sequence of 10 out of 16 items (in other words there are 16!/6! permutations of selecting 10 out of 16 items). Fortunately. however. it is necessary to examine BCD representation.3. In business computers. The slowness of conversion is not a serious problem in computations in which the volume of input/output is small. using the code given in Table 2. etc. On the other hand. In smaller digital systems such as desk calculators.. digital clocks. BCD representation should be considered. 16 four-bit groups. Thus. TABLE 2. 15 when converted to binary would be 1111. where input/output dominates. as compared with this 4 bits per digit are needed in encoding. Only a small number of these are used in practice and they are chosen from the viewpoint of . conversion of decimal to binary is slower compared to encoding. On the other hand.32 bits are required when decimal numbers are converted to binary. when it is encoded each digit must be represented by a four-bit code and an encoding is 00010101. for example.24 Computer Organization and Architecture The method of encoding decimal numbers in binary is to make up a table of 10 unique four-bit groups and allocate one four-bit group to each decimal digit as shown in Table 2. It should be observed that encoding requires more bits compared to conversion. On the average log2 10 = 3. These many codes can thus be constructed. it is uneconomical to incorporate complex electronic circuits to convert decimal to binary and vice versa.

2. d = S w(i)b(i) where w(i)s are the weights and b(i)s are either 0 or 1. The criterion in choosing weights is that we must be able to represent all the decimal digits from 0 through 9 using these weights. 2. Thus. TABLE 2.4. 2.2 Self-Complementing Codes If a code is constructed such that when we replace a 1 by a 0 and a 0 by a 1 in the four-bit code representation of a digit d we obtain the code for (9-d). The 8. 4. Cyclic. 4. it is called . some error detection property. Weighted codes. the same weight may be repeated twice as in the 2.Data Representation 25 ease in arithmetic. 1 code uses the natural weights used in binary number representation. and Error detecting and correcting codes. They may be used sometimes for error detection.4. The first 10 groups of four bits represent 0 through 9. Reflected or Gray codes. 1 code. Further. it is known as Natural Binary Coded Decimal or NBCD for short. 2. 2. Three weighted codes are given in Table 2.1 Weighted Codes In a weighted code the decimal value of a code is the algebraic sum of the weights of those columns in which a 1 appears. Self complementing codes. ease in coding. 3.4. and any other property useful in a given application. The useful codes may be divided broadly into four classes: 1. In other words.4 Examples of Weighted Codes Decimal Digit 0 1 2 3 4 5 6 7 8 9 Weights 8 0 0 0 0 0 0 0 0 1 1 4 0 0 0 0 1 1 1 1 0 0 2 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 Weights* 8 0 0 0 0 0 1 1 1 1 1 4 0 1 1 1 1 0 0 0 0 1 2 1 Weights 2 0 0 0 0 0 1 1 1 1 1 4 0 0 0 0 1 0 1 1 1 1 2 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 *An overbar is used to indicate a negative weight ( 2 = –2) In a weighted code we may have negative weights. The remaining six groups are unused and are illegal combinations. 4.

3   Cyclic Codes A special problem arises when we want to code a continuously varying analog signal into a digital form. Table 2. In a cyclic code. For example. cyclic code or a reflected code. A necessary condition for a self-complementing weighted code is that the sum of its weights would be 9.2 shows such a disc. each code group does not differ from its neighbour in more than one bit. An example of this would be the reading of a shaft angle of a rotating machine when it is in motion.) that one brush may touch the non-conducting segment earlier than the other. If the disc is coded such that two brushes simultaneously change from conducting to non-conducting segments (or vice versa) there is a possibility (due to misalignment. the Hamming distance between A and B is two as they differ in their first and second positions counting from the left. the 2. So a coding technique is used such that not more than one bit varies from one code to the next. 4. wearing out of brushes. . 2. 4. R. if A = 0 1 1 0 and B = 1 0 1 0. Such a code is called Gray code. The Hamming distance between two equal length binary sequences of 1s and 0s is the number of positions in which they differ.26 Computer Organization and Architecture a self-complementing code. etc. Hamming. TABLE 2.5 depicts a self-complementing code. 1 and the 8. For example.5 A Self-Complementing Weighted Code d 0 1 2 3 4 5 6 7 8 9 Code for d 2 4 2 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 Code for 9-d 2 4 2 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 9-d 9 8 7 6 5 4 3 2 1 0 2. This would give a wrong output for a short time and may not be allowed in some situations.W.4. Figure 2. To formalize this concept we will define what is known as the Hamming distance after its inventor. 2. One way of digitizing the shaft positions would be to attach a wiper with three brushes to the shaft and let this sweep a circular disc which has insulating and conducting parts to represent the binary 1 and 0. 1 codes are self-complementing.

This gives the coding scheme shown in Table 2. the fourth square in the top row is 0 0 1 0 and the fourth square in the bottom row is 1 0 1 0 and the Hamming distance between them is unity. Squares in the nth column of the top row are adjacent to those in the same column in the bottom row. the square in the second row. For example.2 Digital encoding of a shaft position.3. Start assigning the code to a decimal digit. each code group is adjacent to the next in sequence.Data Representation 27 FIGURE 2. Hamming distance between two successive code groups in a cyclic code is unity. Similarly. say. Consider the map or grid shown in Figure 2. For example. third column represents the code 0 1 1 1. FIGURE 2. are 0 0 0 1.3. . squares in the first column are adjacent to the squares in the last column. This map is coded to ensure that each square is at a unit distance from its adjacent square. In other words. 0 0 1 0 and 1 0 0 0. This map is called Karnaugh map and can be used to construct a large number of cyclic codes.3 A Karnaugh map depicting a cyclic code.2. One such code can be constructed as shown in Figure 2. 0 0 0 0. The four neighbours of 0 0 0 0 for example. 0 1 0 0. and proceed as indicated by the arrow and the mark ‘°’ in Figure 2. Thus. Each square in this map represents a four-bit code.6. in this map each square has four neighbours.

When data is transmitted between units in the system some bits might be corrupted due to noise. then if a single error occurs in one of the codes this can be detected because the corrupted code will be different from any of the codes allowed in the group. 4. The method used is to ensure that the Hamming distance between any two codes in the set of codes is a pre-assigned minimum.7.7.6 A Cyclic Code Decimal Digit 0 1 2 3 4 5 6 7 8 9 Code 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 2. For example. . If a single bit changes in any code it will be different from all other codes in this set. For example. If the minimum Hamming distance between any two codes is two. We can. some method of detection of error in codes is commonly used in digital systems. the total number of ones is not odd then we can conclude that it has a single error. This is called an even parity bit.28 Computer Organization and Architecture TABLE 2.4 Error Detecting Codes Extremely reliable storage and transmission of data between units is required in digital computing systems. 1 code with an added odd parity bit is shown in Table 2. A code with an even parity is also illustrated in Table 2. An 8. The reader should check that the minimum Hamming distance of this coding scheme is 2. We cannot detect two errors. the minimum distance between any two 5-bit codes obtained by concatenating the 4 bits of column 1 with 1 bit is column 2 in Table 2. as the odd parity will be satisfied with two or any even number of errors.4. storage of data in magnetic tapes is prone to error due to uneven magnetic surface. For example. This bit is called the parity bit. One common method of constructing such codes is the introduction of an extra bit in the code. detect any odd number of errors. If in a code. We may also introduce an extra bit in the code to make the total number of ones in the code even. The main principle used in constructing error detecting codes is to use redundant bits in codes with the specific purpose of detecting errors. suppose 0 0 1 1 is the code for 3. 2. etc. dust. however.7 is 2. Thus. We may introduce a fifth bit such that the total number of ones in this five-bit group is odd.

The row and the column automatically fix the position of the erroneous bit and thus it can be corrected.5 Error Correcting Codes Coding schemes may be devised which not only detect errors. If any bit in the group is erroneously transmitted.8 illustrates this. have found widespread application. Table 2. because checking can be easily mechanized by them. suppose that each code word has a seventh bit added as an odd parity bit. Consider. for example.8 Illustrating an Error Correcting Code Sent Data 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 Row Parity Bits 1 0 1 1 1 0 Received Data 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 ­ Parity failure 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 ¬ Parity failure 1 1 0 Column ¾® parity bits .Data Representation 29 Parity checking codes. Further. 36 bits recorded on a magnetic tape in six tracks with six bits along each track. it will cause simultaneous failure of parity on a row and a column.7 Illustrating Codes with Odd and Even Parity 8 4 2 1 Code 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 Odd Parity Bit 1 0 0 1 0 1 1 0 0 1 Non-Weighted Code 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 Even Parity Bit 0 1 1 0 1 0 0 1 0 0 2.4. as these are called. TABLE 2. but also automatically correct them. TABLE 2. Suppose an extra code group is recorded on a seventh track and it is devised to give an odd parity on the columns.

6. 1 code to make it a single error correction code. With four data bits we need at least three parity bits. TABLE 2. 5. 7. a single error will not map a code into any one of the other legitimate codes in the group. 3. 6. 3. 7. 7. 5. increasing the minimum Hamming distance between codes. then C2 = 0 else C2 = 1.9 A Single Error Correcting Hamming Code (Di: Data bits. so that the parity bits can be used to find out the error position in the seven-bit code. Pi: Parity bits) 1 P1 0 1 0 1 1 0 1 0 1 0 2 P2 0 1 1 0 0 1 1 0 1 0 3 D3 0 0 0 0 0 0 0 0 1 1 4 P4 0 1 1 0 1 0 0 1 0 1 5 D5 0 0 0 0 1 1 1 1 0 0 6 D6 0 0 1 1 0 0 1 1 0 0 7 ¬¾¾¾ bit position D7 0 1 0 1 0 1 0 1 0 1 When a code is received. Thus. Hamming showed that by systematically introducing more parity bits in the code it is not only possible to detect an error. 5. Suppose we want to add parity bits to 8. including the parity bits. The bit at position 1 is set so that it satisfies an even parity for bits 1. 7. In order to correct a single error we should know where the error occurred in the composite code. The code is constructed as follows: The individual bits in the seven-bit code are numbered from 1 to 7 as shown in Table 2. We will now examine how a Hamming code to detect and correct one error can be constructed. 5. Check even parity on positions 2. 4.5 HAMMING CODE FOR ERROR CORRECTION In the last section we saw that by adding a single parity bit to a code we can detect a single error in the code. . 2 and 4 are used as parity check bits and bits 3. More than one error can also be detected and corrected by increasing the code length with more parity bits and thereby.30 Computer Organization and Architecture 2. Check even parity on positions 1. Bit 4 is set so that it satisfies an even parity on bits 4. but also find out where the error occurred and correct it. 3. The error detection is possible as the addition of the parity bit creates a minimum Hamming distance of two between any two codes.9. 6. 2. If it passes. then C1 = 0 else C1 = 1. the following procedure is used to detect and correct the error: 1. 7 as data bits. Bit positions 1. Bit 2 is set to satisfy an even parity on bits 2. 6. 3. If it passes. 7. 2.

Thus. Check even parity on positions 4. or 7 is incorrect. by examining Table 2. 6. C4 = 0. Even parity in positions 2. these bits are used as check bits for the corresponding group of bits. then C4 = 0 else C4 = 1. 5. 6. 6. 4. Thus. If C4C2C1 = 0 then there is no error in the code. 5. 3. 5. 6. Thus. 5. The error cannot. 3. 3. 7. Even parity in positions 4. EXAMPLE 2. 3. The correct code is thus: 0 0 1 0 1 1 0 The bit positions. Thus. 5. 3. may be derived by observing that the decimal equivalent of C4C2C1 should specify the position of the error in the code. if two errors occur it can be detected as parity or parities will fail. C2 will be 1 whenever 2. TABLE 2. 7 C2 0 0 1 1 0 0 1 1 ¯ Check on bits 2. C1 = 0. C2 = 1.10 Illustrating Check bit Construction in Hamming Code Error in bit number No error 1 2 3 4 5 6 7 C4 0 0 0 0 1 1 1 1 ¯ Check on bits 4. 7 Observe that the method of construction of the Hamming code illustrated above ensures that there is a minimum distance of three between any two codes in the group. If it passes. 7 C1 ¬¾¾¾¾ Check bits 0 1 0 1 0 1 0 1 ¾¾¾¾¾® Check on bits 1. be corrected. The decimal equivalent of C4C2C1 gives the position of the incorrect bit and that bit is corrected. 7 passes. Only a single error can be detected and corrected. 7 passes. 6. Thus. or 7 is wrong. Similarly. 5. which are checked by the individual parity bits.8 Suppose the following code is received: 0 1 1 0 1 1 0 1 2 3 4 5 6 7 ¾¾¾¾® ¾¾¾¾® Received code Bit positions Even parity in positions 1. . however. 7 fails. Thus. The position of the error is thus C4C2C1 = 010 = 2. 6 or 7 is incorrect and C4 will be 1 whenever 4.Data Representation 31 3. 5. 3. 6.10 we see that C1 will be 1 whenever bits 1.

the maximum decimal number which can be represented by k bits should at least equal the total length of the code. The versatility of modern computers arises due to its ability to process a variety of data types. etc. fingerprints. photographs (both monochrome and colours). while data is transmitted between different points. Errors in reading from tapes and discs may also often have bursts of error due to dust particles. a distance of four code can correct at most one error and detect two errors. Audio data are sound waves such as speech and music. the sum of data bits and check bits. special characters and numbers when not used in calculations (e. Telephone number). It should be remembered that Hamming codes are designed to guard against errors occurring at random positions in the code. FIGURE 2. Historically. Textual data consists of alphabets. . Special coding methods for burst error correction are needed in such cases. The number of parity bits k required to detect and/or correct errors in codes with i data bits is given by the inequality (2k – 1) ³ (i + k) As the k parity bits must give the position of the error. This is the simplest data type. lightning. (Remember that an error cannot be corrected unless it is first detected!) Thus. Very often. etc. namely. a burst of errors occur due to reasons such as voltage fluctuation. They are two dimensional and time invariant. it was the first type of data processed by digital computers. handwritten data.6 ALPHANUMERIC CODES In the previous sections we saw how numerical data is represented in computers. Types of data may be classified as shown in Figure 2. 2. They are continuous and time varying signals. medical images. Picture data are line drawings.32 Computer Organization and Architecture In general. etc..g. but that is not discussed in this book.4 Types of data. if a code group has a minimum distance of L then the following inequality holds: (C + D) £ (L – 1) where D is the number of errors detected and C is the number of errors corrected.4.

the 10 decimal digits and several special characters such as +. In order to ensure uniformity in coding characters. They are actually a sequence of moving pictures. TABLE 2. These are normally the 26 English letters.11. video data is also time varying.1 ASCII Code The ASCII code is used to code two types of data. The other set is known as control characters. a standard seven-bit code. / 3 0 1 2 3 4 5 6 7 8 9 : . $.Data Representation 33 Video data are moving pictures such as that taken by movie cameras. This has made the six-bit byte inadequate to code all characters. One type is the printable characters such as digits. /.6. 2. the total number of characters was less than 64 and a string of six bits was used to code a character. In order to code these with binary numbers. ³. In this book we will describe the representation of numeric and textual data. ´. New coding schemes use seven or eight bits to code a character. In all current machines the number of characters has increased.g. one needs a string of binary digits. <. ASCII (American Standard Code of Information Interchange) has been evolved. We will not describe other data types and their representation. DEL. < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ \ ] ^ – 6 a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL . . etc. –. ESC). In the older computers. As stated. the lower case letters are also used and several mathematical symbols such as >. With seven bits we can code 128 characters which is quite adequate. letters and special characters. Besides capital letters of the alphabet. etc. which represent coded data to control the operation of digital computers and are not printed (e. The ASCII code (in Hexadecimal) is given as Table 2. have been introduced.. Like audio.. textual data consists of alphabets and special characters besides decimal numbers.11 ASCII Code for Characters Most significant Hex digit Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SP ! “ # $ % & ‘ ( ) * + .

searching.34 Computer Organization and Architecture It may be observed from Table 2. The code is given in Table 2. who have published a document IS: 13194-91 on this. . In Table 2. It is possible to add redundant bits to a seven-bit ASCII code to make an error correcting code. An abbreviation B is universally used for bytes.2 Indian Script Code for Information Interchange (ISCII) ASCII has been standardized for English letters. Thus. The Indian Standard maintains the seven-bit code for English letters exactly as in ASCII and allows eight-bit codes extensions for other scripts. in ASCII are respectively 41. This standard conforms to International Standard ISO 2022: 1982 entitled “7-bit and 8-bit coded character set code extension technique”..12. 44.11 we see that ASCII code starts with 00 and is specified up to 7F hexadecimal. A byte would be sufficient to represent a character or to represent two binary coded decimal digits. 42. We will henceforth use B for byte. A group of eight bits is known as a byte. 2. The standard English keyboard of terminals is maintained for English. The approach followed in the Indian Standard is to have a common code and keyboard for all the Indian scripts. etc. It is necessary to have a standard coding scheme for the Indian scripts to use computers to process information using Indian languages. Z. 43. 5A.11 that the codes for the English letters are in the same sequence as their lexical order. 45. that is. .6. At least four check bits are required to detect and correct a single error. The extension for Indian scripts starts from hexadecimal A1 and extends up to hexadecimal FA. The construction of Hamming codes for characters is left as an exercise to the reader. ….. More details regarding the code and the standard used in keyboard layout may be obtained from reference given at the end of the book. B. C. the order in which they appear in a dictionary. English can co-exist with Indian scripts[14]. This choice of codes is useful for alphabetical sorting. This has been done by the Indian Standard Organization. The hexadecimal codes of A. A parity bit may be added to the seven-bit ASCII character code to yield an eight-bit code. An overlay is designed on this keyboard for Indian scripts. An optimal keyboard overlay for all Indian scripts has been designed keeping in view the phonetic nature of Indian languages.

3. a–1 a–2 … a–m in a positional number system is interpreted as: anrn + an–1 rn–1 … a1r + a0r0 + a–1r–1 + a–2 r–2 + … + a–mr–m . In a digital system all data is represented by a string of 0s and 1s called bits (binary digits).W vksa k v vk b bZ m Å _ . It uses 16 bits (two bytes) for each character. the number of different types of characters which can be coded in Unicode is enormous.12 ISCII Code for Devnagari Characters (Most significant Hex digit) Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F A B C D E F EXT . virtually every character of every language in the world can be represented in this international standard code.sa .s . As 216 = 65536. The first 128 codes of Unicode are identical to ASCII. Thus.k r Fk n èk u u+ i Q o Hk e . Details of Unicode may be obtained from the website www.Data Representation 35 TABLE 2. .standard. A number: an an–1. It is thus compatible with ASCII. an–2 … a1 a0. a new coding scheme for characters called Unicode has been standardized specifically to accommodate a large number of symbols of languages other than English and mathematical symbols such as Þ . 2. Most physical devices available today to store data do so by being in either one of the two stable states. SUMMARY 1. – = vks vkS vkW d [k x èk M+ p N t > ×k < .com/unicode. Binary representation is used because electronic circuits perform most reliably when processing data in one out of two distinct states.htm. j INV -j y G G+ o 'k "k l g k f sa s S &W ksa ks kS kW &~ A ÿ 0 1 2 3 4 5 6 7 8 9 V B M h &q &w &` ATR Recently.

13. A decimal fraction is converted to its binary equivalent by multiplying it by 2. we divide by 16. F. Appending a parity bit to group of codes makes the distance between the codes at least 2. a1. B. The bits in each code group is i + k. 4. 12. 16. Conversion of decimal integer to hexadecimal is similar but instead of dividing by 2. Hamming codes are used to not only detect. In self-complementing codes.. . 11.. 18. For decimal system the radix r is 10 and the symbols are 0. 5. even parity bit to be appended to code 01000 is 1 and the code with even parity is 010001. A decimal integer is converted to its binary equivalent by dividing it and successive quotients by 2 until the last quotient is 0. An even/odd parity bit is a bit appended to a code group to make the total number of 1s in the code even/odd. 2 and 1. but also to correct errors. For example. radix r is 2 and the symbols are 0 and 1. In weighted codes each bit position has a weight. For example. 1. 10. A terminating decimal function need not have a terminating binary fractional equivalent. E. …. In cyclic codes. the distance between successive codes is exactly 1. 1. 9. where r is called the radix of the system and a0. A.36 Computer Organization and Architecture 4. …. C. The Hamming distance between two binary codes is the number of corresponding bits in which they differ. 9. 2. In binary number system. A decimal number is coded to its binary equivalent by picking 10 out of 16. cyclic codes and error detecting/correcting codes. 7.. 2. 1. are symbols chosen from a set of r symbols (0. Conversion of decimal fraction to hexadecimal is done by multiplying the fraction with 16 instead of 2. etc.. 8. The integer part of first product is the most significant bit of the binary fraction. The number of bits k required to detect/correct errors in codes with i data bits is given by the inequality (2k – 1) ³ (i + k). 14. In hexadecimal number system the radix r is 16 and the symbols are 0. The four important codes are: weighted codes. 17. If the weights are 8. 6. 2. self-complementing codes. the distance between 01010 and 11001 is 1 + 0 + 0 + 1 + 1 = 3. The first remainder is the least significant bit of the binary equivalent and the last remainder is the most significant bit. four bit strings to code each digit. 15. r – 1). it is called Natural Binary Coded Decimal Digits. . A single error in any of the codes can be detected if parity fails. D. 9. the code of (9-d) (where d is a decimal digit) is obtained by complementing each bit of the code for d.

Obtain an algorithm to find all allowable weights for a weighted BCD code. 21. Japanese and many special characters such as symbol Þ used in mathematics. ASCII also has codes for non-printable control characters such as enter.011101 3. escape. are represented by a standard code called ASCII (American Standard Code for Information Interchange).0101 (c) 1. Each code consists of seven bits.) . Convert the following binary numbers to their equivalent decimal and hexadecimal (base 16) representation: (a) 101101. English characters A. Characters of Indian languages are coded using a code called ISCII (Indian Standard Code for Information Interchange). 20. 22.32 (b) 0. Besides printable characters mentioned above. etc. …. Use the algorithm to obtain all such sets of weights..01 5. etc. a new coding scheme called Unicode has been standardized specifically to accommodate many languages such as Chinese. ASCII is a subset of Unicode. Convert the following decimal numbers to base three and to base five: (a) 73 (b) 10. Convert the following decimal numbers to binary: (a) 49. Z (both upper and lower case).25 4. (Observe that sets of weights such as 4.0111 (c) 10.0101 (b) 1010. EXERCISES 1.Data Representation 37 19. 3. Assume that all weights are positive integers.333 (c) 21.83 (c) 94. C.. +. 1 and 3. Convert the following binary numbers to decimal and octal (base 8) forms: (a) 101101110 (b) 11011. numbers and special symbols such as @. 1. found in a standard keyboard. B. Recently. 1. ASCII characters with an added parity bit gives eight bits and is called a byte. It is an eight-bit code and maintains the seven-bit code for English letters and other characters exactly as in ASCII. 1 should be taken to be one set of weights. 4.00625 2.

What are the major differences between ASCII and Unicode. Device a single error correcting code for ISCII coded characters. Base four numbers are coded by the following codes. Obtain an algorithm to find all allowable weights for a weighted self-complementing BCD code. The weights may be positive or negative. What is the minimum distance of this code group? 000. Use the algorithm to obtain all such sets of weights. 16.) 7. What are the advantages and disadvantages of using Unicode instead of ASCII? . Obtain an algorithm to detect and correct single errors in Hamming coded ASCII characters. Code base five numbers in a cyclic code. Device a single error correcting Hamming code for decimal numbers represented in 8421 code. (Permutations of weights should not be taken to be distinct. 14. Decode the following ASCII text: 1000010 1001100 1001100 1010111 1000101 1001100 1001001 1001100 1010011 13. How many bits of information can be stored in a DNA molecule? 12. Estimate the number of words in a printed page and determine the number of bits required to encode the information in ASCII. C. 110 How many errors can be detected if the coding scheme is used? How many errors may be corrected? 10. 19. 8. 15. How many parity bits are required for a double error correcting code for ASCII characters? 17. G and T. 101. 011. 9. Decode the following ISCII coded text (in Hex) in Hindi: D0 A5 BA A5 D0 A5 CC A4 C6 18. 11.38 Computer Organization and Architecture 6. A DNA molecule is made up of a linear sequence of any four nucleotides picked from this group. There are four types of nucleotides named A. Using a Karnaugh map construct a cyclic code for decimal digits such that the total number of 1s in the code is minimum. Device a single error correcting code for ASCII coded characters.

used to reduce Boolean expressions with four to five variables to an equivalent expression with minimum number of literals. A simple graphical method called the Veitch–Karnaugh method which is How to realize AND. NOT based combinatorial circuits to one using either only NAND gates or NOR gates. Programmable logic devices and how they are used to realize complex combinatorial circuits. NOT gates with NAND or NOR gates and convert AND. Various types of flip-flops.BASICS OF DIGITAL SYSTEMS 3 LEARNING OBJECTIVES In this chapter we will learn: â â â â â â â â â â The basic postulates and theorems of Boolean algebra. OR. The applications of multiplexers and demultiplexers. OR. Standard forms of representing Boolean functions of n variables. The difference between combinatorial and sequential switching circuits. How to design registers with flip-flops. The need to minimize a given Boolean function to an equivalent expression which uses minimum number of literals. 39 .

It is thus necessary to learn this algebra as it is the appropriate modelling tool to design digital systems.1 Postulates of Boolean Algebra Boolean algebra is introduced by first defining a set of elements allowed in this algebra. A unary operator is a rule that assigns for any element belonging to K another element in K. then * is a binary operator if for every a. For the formal definition of Boolean algebra we will use the postulates given by Huntington. The notation a Ï K. 1 = a for every K. then d Î K for every pair of elements a. b Î K. A set of elements is any collection of objects having a common property. Switching algebra uses a limited number of operators to connect variables together to form expressions and makes it easy to build electronic circuits to implement these operators. b. Postulate 2: (a) There exists element a Î (b) There exists element a Î an element 0 in K such that a + 0 = a for every K. if a * b = c. if @ d = h. an element 1 in K such that a . If c Ï K. a set of operators which operate with these elements as operands and a set of axioms or postulates. Postulates are not proved but assumed to be true. This is intended as review material to study the part of the book dealing with computer organization. b Î K is used to indicate that a and b are members of the set K. then * is not a binary operator. and unary operator – (a bar over the variable).1. then c Î K for every pair of elements a. indicates that a is not a member of the set K. For example. h Î K.1 BOOLEAN ALGEBRA Switching algebra was developed by Shannon in 1938 for designing telephone switching systems. . then the notation a. Boolean algebra is defined on a set of elements K. 3. Let a. 3. The theorems of the algebra are derived from these postulates. b be two elements belonging to K. Postulate 1: (a) An operator ‘+’ is defined such that if c = a + b. If K is a set and a and b are objects. For details of digital logic. for which the following postulates are satisfied. the readers are urged to read Digital Logic and Computer Organization [31] written by the authors of this book. together with two binary operators + and · . It is a subset (for two-valued variables) of a more general algebra known as Boolean algebra developed by Boole (1815–64). (b) An operator ‘·’ is defined such that if d = a . b Î K. then @ is a unary operator if for every d Î K. A binary operator defined over a set K is a rule that assigns to each pair of elements taken from K. b Î K. For example.40 Computer Organization and Architecture In this chapter we will briefly review some aspects of digital logic. c Î K.

c Î K.1. the following distributive laws hold: (a) a . Definition 2(a): The rules of operation with the operator ‘ · ’ are given in Table 3. c) = (a + b) . The two binary operators ‘·’ and ‘+’ are known as ‘AND’ operator and ‘OR’ operator respectively.Basics of Digital Systems 41 Postulate 3: For a. (a + c) Postulate 5: For every element a Î K. . ‘. The unary operator is known as ‘NOT’ operator. we will use the terms Boolean algebra and Switching algebra as synonyms. OR and NOT operations.1(a) Rules for ‘ .1(b). satisfy Huntington’s postulates. c (b) a + (b .1(a). ’ Operator a 0 0 1 1 b 0 1 0 1 a · b 0 0 0 1 a 0 0 1 1 TABLE 3. b. As we will be exclusively dealing with Switching algebra in this book. a Postulate 4: For a. (b + c) = a . the following commutative laws hold: (a) a + b = b + a (b) a . We will introduce these circuits later in this chapter. there exists an element a (read a bar) such that (a) a  a (b) a ¹ a 1 and 0 (Observe that – (bar) is a unary operator operating on the element a) Postulate 6: There are at least two elements a. This algebra is useful in digital systems mainly because simple electronic circuits exist to implement the AND.2 Basic Theorems of Boolean Algebra We will consider a special case of Boolean algebra with the following definitions: Definition 1: The set K has two elements: 0 and 1. b Î K. b Î K such that a ¹ b. 3. b + a . ’ and complementing.1(b) Rules for ‘+’ Operator b 0 1 0 1 a + b 0 1 1 1 Definition 2(b): The rules of operation with the operator ‘+’ are given in Table 3. Definition 3: The complement operation is defined as: 0 1. b = b . TABLE 3. 1 0 The student can verify that the above definition of the set K and the operators ‘+’.

2 Proof of Theorem 6(a) a 0 0 1 1 b 0 1 0 1 a + b 0 1 1 1 a b a 1 1 0 0 b 1 0 1 0 a¹b 1 0 0 0 1 0 0 0 Theorem 6(b): a¹b a  b (Dual theorem) . This important property of Boolean algebra is known as the duality principle. Theorem 1(a): a + a = a Proof: When a = 0.1.1. 3. 0 = 0 (Dual theorem) Theorem 3(a): a + (a .2 is used to show that the left-hand side equals the right-hand side for all values of a and b (columns 4 and 7 are identical). Sometimes it is easier to prove a theorem using the postulates and some of the theorems proved earlier.4 Theorems As Boolean algebra deals with a set consisting of only two elements. This principle ensures that if a theorem is proved based on the postulates of the algebra. in principle. 0 + 0 = 0 = a by definition 2(b) and a = 1. (a + b) = a (Dual theorem) Theorem 4: a a Theorem 5(a): a  ( a ¹ b) Theorem 5(b): a ¹ ( a  b) Theorem 6(a): a  b a ¹b a b a ¹ b (Dual theorem) Proof: Table 3. the theorem holds. that is. One part may be obtained from the other if ‘+’ is interchanged with ‘·’ and ‘0’ is interchanged with ‘1’.2 the first five postulates of Huntington were listed in two parts (a) and (b). Every theorem will have a dual due to the fact that duality principle holds. by exhaustive enumeration. Theorem 1(b): a . 1 + 1 = 1 = a by definition 2(b) As a = 0 or a = 1 and in both cases a + a = a. then a dual theorem obtained by interchanging ‘+’ with ‘·’ and ‘0’ with ‘1’ automatically holds and need not be proved separately. it is.3 Duality Principle Observe that in Section 2. possible to prove every theorem by considering all possible cases. TABLE 3.42 Computer Organization and Architecture 3. a = a (Dual theorem) Theorem 2(a): a + 1 = 1 Theorem 2(b): a . b) = a Theorem 3(b): a .

a + (b . c) = (a . Another Boolean variable is assigned the role of a dependent variable. 3. A set of Boolean variables is taken as independent variables. 2. b) = a Dual a a a a a a a . 0 = 0 . A rule is formulated which assigns a value to the dependent variable for each set of values of the independent variables. b = b . The student should become thoroughly conversant with this table in order to use the algebra effectively. They are known as DeMorgan’s Laws. c 3. They can be extended to n variables as given below: a1  a2  a3  "  an a1 ¹ a2 ¹ a3 ¹ " ¹ an a1 ¹ a2 ¹ a3 . c) = (a + b) . an a1  a2  a3  . TABLE 3. (a + b) = a a = a a + a . c) (3. b. b = a + b a  b a ¹b a + (b + c) = (a + b) + c a . a = 0 . a = a . 1 = a .2 BOOLEAN FUNCTIONS AND TRUTH TABLES A Boolean function is defined as follows: 1.Basics of Digital Systems 43 Theorems 6(a) and (b) are important theorems and are very useful.1) .. ( a + b) = a · b a ¹b a  b a .3. c (Dual theorem) All the important postulates and theorems of Boolean algebra are listed in Table 3. In mathematical notation we may write: z = f(a.3 Summary of Postulates and Theorems Reference Postulate 2 Postulate 3 Postulate 4 Postulate 5 Theorem 1 Theorem 2 Theorem 3 Theorem 4 Theorem 5 Theorem 6 Theorem 7 Primary a a a a a a a + 0 = a + b = b + a · (b + c) = a · b + a · c + a = 1 + a = a + 1 = 1 + (a . (b . (b . c) = (a . . (a + c) ... b) . b) . Theorem 7(b): a ...  an Theorem 7(a): a + (b + c) = (a + b) + c (Associative law) This is proved by forming truth tables for the left-hand side and the right-hand side and showing that they are identical.

c next. 0 or 1. b.44 Computer Organization and Architecture where a.4. how can we obtain an expression to represent the dependent variable in terms of the independent variables? 2. Is this expression unique? 3.1 Canonical Forms for Boolean Functions z a ¹b ¹c  a ¹b ¹c  a ¹b ¹c  a ¹b¹c  a ¹b¹c Table 3. the rule denoted by f is “NOT b first. OR with a”. Some interesting questions we may ask are: 1. namely. as each independent variable can assume only one of the two values (0 or 1) and exhaustive enumeration is feasible. b. TABLE 3. Each term in the expression is said to be a standard product or a minterm. c is denoted by f. (In other words form a  b ¹ c ). c are the independent variables. If z = a + b ¹ c . the truth table is given in Table 3. z is the dependent variable and the rule assigning values to z for each set of values of a. . The value of z for each combination (or set) of values of a.4 Truth Table for z = a + b ¹ c a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 z 0 1 0 0 1 1 1 1 It is thus easy to obtain a truth table given as a Boolean function. b and c. for each one of the eight sets of values of a.4 may also be represented by the expression (3. It is important to note that z can also assume only one of the two values.1 For the function z a  b ¹ c . a value of z is given. b and c may also be given as a table. EXAMPLE 3. That is.2. are there any criteria which would allow us to choose one among them to represent the function? 3.2) The expression for z given above is said to be in the canonical (or standard) sum of products form. AND NOT b. Given a truth table for a Boolean function. If it is not unique. It is feasible in Boolean algebra to define a function in this manner.

5 may be formed to explain the maxterm notation.4) a ¹b ¹c a ¹b ¹c a ¹b¹c a ¹b¹c a ¹b ¹c a ¹b ¹c a ¹b¹c a . The individual terms are then ANDed.4 the expression is: z (a  b  c ) ¹ ( a  b  c) ¹ (a  b  c ) (3. These decimal numbers. Using this notation z may be expressed as: (3. A variable with a 0 entry is used as it is and a variable with a 1 entry is complemented. indicate the values of the independent variables. 4. 2.5 Illustrating Minterm Notation Independent Variables a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 Term Minterm Notation Designation m0 m1 m2 m3 m4 m5 m6 m7 Decimal Form 0 1 2 3 4 5 6 7 (3. A table similar to Table 3.Basics of Digital Systems 45 The minterms and the notation used are illustrated in Table 3. The expression for z [Eq.b . There is another standard form which may also be used to express Boolean functions.5 as: z = m1 + m4 + m5 + m6 + m7 z = S 1. if expanded to their binary form.6. 6. It is shown in Table 3.5 for Boolean functions of three variables. M3 = P 0. 7 TABLE 3.2)] may be expressed in a number of equivalent forms using the notation of Table 3. M2 . c The last notation is quite convenient to use. (3. The symbol S indicates that the OR of minterms are to be taken.7) . The individual term is formed by taking the sum (or OR) of the independent variables. Each term in the expression is said to be a standard sum or a maxterm. For the function of Table 3.6) z = M0 . 3 (3.3) (3. 5.5) This expression is said to be in the canonical (or standard) product of sums form.

This will be discussed later.] In other words.7 we see that one may define eight different binary operators and one unary operator. ). There are many other binary operators which will now be introduced. AND. We have used ** to represent an exponentiation operation. By inspection of Table 3. OR (+).46 Computer Organization and Architecture TABLE 3. Other equivalent forms are necessary for some practical implementations. are also shown in this table. which describe these operators in terms of the familiar AND.7. 3. NOR (¯). .3 BINARY OPERATORS AND LOGIC GATES So far we have considered three Boolean operators namely the AND. The 16 functions of two variables are given in Table 3. Among these. we know that a truth table has four rows. but are not always useful for hardware implementation. Boolean expressions. 16 Boolean functions of two variables may be formed. The reason we use these three operators is because all Boolean expressions may be realized using these three operators. IMPLICATION (É) and NAND (­). OR. 16 distinct truth tables may be constructed with two variables as inputs as shown in Table 3. With two Boolean variables as inputs. EQUIVALENCE ( . EXCLUSIVE-OR (Å).8. OR are binary operators. Corresponding to each row the output variable may be assigned a value 0 or 1. OR and NOT operators. The binary operators are: AND (·). NOT operators.6 Maxterms for Three Variables Independent Variables a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 Term a + b + c a bc Maxterm Notation Designation M0 M1 M2 M3 M4 M5 M6 M7 Decimal Form 0 1 2 3 4 5 6 7 a b c a b c a bc a bc a b c a b c To conclude this section we observe that there is no unique representation of truth tables in terms of Boolean expressions. Thus. INHIBITION (/). These standard forms are convenient to obtain. The unary operator is COMPLEMENT (–). [In general for N input variables 2**(2**N) truth tables may be constructed.

The NOR function is the NOT of OR and the NAND function is the NOT of AND. . that is.Basics of Digital Systems 47 TABLE 3. OR and NOT) are normally useful in obtaining logic circuits.8 Sixteen Functions of Two Variables Function f0 f1 f2 f3 f4 f5 f6 f7 = = = = = = = = 0 x . The ability of the gate by itself or in conjunction with a small set of other gates to realize expressions for all Boolean functions. y x¹y x x¹y y x¹yx¹y x + y Name NULL AND INHIBITION INHIBITION EXCLUSIVE-OR OR NOR EQUIVALENCE COMPLEMENT IMPLICATION COMPLEMENT IMPLICATION NAND IDENTITY Symbol x . 2. x or y but not both. y x/y y/x x Å y x + y x ¯ y x:y y y É x Word description Always 0 x and y x and not y Always x y and not x Always y x or y but not both or not (x or y) x equals y Not y If y then x Not x If x then y not (x and y) Always 1 f8 = ( x  y ) f9 = x ¹ y  x ¹ y f10 = y f11 = x  y f12 = x f13 = x  y f14 = ( x ¹ y) f15 = 1 x x É y x ­ y Four of the 16 functions shown in Table 3.) The EQUIVALENCE function gives a 1 if x and y are equal. These are the Exclusive-OR which is similar to OR but assigns a 0 when both x and y are 1. The mathematical property of the operator such as commutativity and associativity. The question which arises now is the feasibility of using these binary operators to construct logic circuits. This depends on the following factors: 1.7 Truth Tables for 16 Functions of Two Variables x 0 0 1 1 y 0 1 0 1 f0 0 0 0 0 f1 0 0 0 1 f2 0 0 1 0 f3 0 0 1 1 f4 0 1 0 0 f5 0 1 0 1 f6 0 1 1 0 f7 0 1 1 1 f8 1 0 0 0 f9 1 0 0 1 f10 1 0 1 0 f11 1 0 1 1 f12 1 1 0 0 f13 1 1 0 1 f14 1 1 1 0 f15 1 1 1 1 TABLE 3. (This function is the strict OR.8 (besides AND.

AND and NOT gates with positive or negative logic convention are shown in Figure 3. Of the eight new binary operators introduced in Table 3. Observe that this realization of z requires three NOT gates. Currently.48 Computer Organization and Architecture 3. OR. AND. We will see later that AND. in preference to AND. OR and NOT gates. electronic circuits known as Integrated Circuits (IC) are used to implement these connectives. The Exclusive-OR and EQUIVALENCE satisfy factors 1. The NAND and NOR satisfy all the 4 criteria and are now almost universally used. 2.8. 4. We will discuss their properties and use in the next chapter. NOT gates since the advent of integrated circuits. c  a ¹b¹c (3. 3.1 we used a Boolean expression (Eq.2.2. FIGURE 3. . OR and NOT gates as they are appropriate for simplifying Boolean functions. a total of nine gates. OR and NOT are said to be logically complete as any Boolean function can be realized using these three connectives. One convention is to represent 1 by a positive voltage of around 5V which is called HI and a 0 by 0V called L0. This is called positive logic and is the most common representation. five three input AND gates and one five input OR gate. In this chapter we will primarily use AND.4 SIMPLIFYING BOOLEAN EXPRESSIONS z a ¹b ¹c  a ¹b ¹ c  a ¹b ¹c  a ¹b. As we saw in the last section the three logical connectives. In the development of computers. OR and NOT gates we may realize the Boolean expression for z. when the input variables a.4 and it is reproduced below: Using AND. In implementation we first consider how we can represent the logical 1 and the logical 0 by electrical voltages. inhibition and implication are not commutative. The symbols used for OR. 4 given above but are expensive to construct with presently available physical devices. The feasibility and cost of making these gates with physical components.2) for the Truth Table 3.1 Symbols for AND. b and c are given as shown in Figure 3. 3. a number of devices have been used to realize these functions. OR and NOT gates may be realized using only NAND or NOR gates. Number of inputs that could be fed to the gate.1.2) In Section 3.

4(a) By Postulates 3(b). OR. (3. (3.3. NOT gates as shown in Figure 3. 3(b) By Theorem 5(a) By Postulate 4(a) By Postulates 3(b).2) as follows: z (a  a) ¹ b ¹ c  a ¹ b ¹ c  a ¹ b ¹ ( c  c) b ¹c  a ¹b ¹c  a ¹b b ¹ (c  c ¹ a )  a ¹ b b ¹ (c  a )  a ¹ b By Postulate 4(a) By Postulates 3(a). 5(a) By Postulates 4(a).2). 5(a) b ¹c  b ¹a  a ¹b b ¹ c  a ¹ (b  b) b ¹c  a This expression may be realized using AND.Basics of Digital Systems 49 FIGURE 3. a total of three gates as compared to nine gates used in the previous implementation.2 Gate realization of Eq. The expression for z may be written in an alternate form by reducing the number of terms and variables in Eq. Observe that this realization uses only one NOT gate. . one two input AND gate and one two input OR gate.

(The term literal will be used from now on to denote variables or their complements. These are known as the Veitch–Karnaugh map method and the Quine–McCluskey chart method. 3. is not useful in practice as it is difficult to apply and it is not possible to guarantee that the reduced expression is minimal. Unfortunately. there are no simple techniques of manipulating Boolean expressions to directly achieve such engineering goals. Almost any Boolean expression obtained by inspection from a truth table such as Table 3.5 VEITCH–KARNAUGH MAP METHOD The Veitch–Karnaugh map is another method of representing a truth table. therefore minimizing the number of literals is an acceptable criterion for optimization. Besides this. The Quine–McCluskey method is suitable for any number of variables. Two methods of systematic simplification of Boolean expressions which guarantee a simplified expression with a minimum number of literals have been derived in the literature. It is a diagram made up of a number of squares. This example illustrates the value of simplifying Boolean expressions. maximizing speed of operation and maximizing reliability. it is adaptable to an algorithmic formulation and thus computer programs have been written to implement it.) Because there is a close correlation between the number of literals in a Boolean expression and the total cost and reliability of the switching circuit derived from it. At the beginning of this section we reduced the number of literals in a Boolean expression using the postulates and theorems of Boolean algebra.3 A minimal gate realization of Eq. We will discuss the Karnaugh map (abbreviated K-map) but Quine–McCluskey method will not be discussed in this book. This method.50 Computer Organization and Architecture FIGURE 3. Each square represents a minterm of . The Veitch–Karnaugh method is easy to apply for Boolean expressions with four or less variables. (3. known as algebraic simplification.2). But simple methods do exist to manipulate Boolean expressions to minimize the number of variables and complements of variables appearing in it.4 will be excessively complex and we must analyze such expressions with the objective of simplifying them. By simplifying we mean optimizing some engineering criteria such as minimizing the number of components.

4. if we OR m2 and m6 we obtain: m2  m6 a ¹b¹c  a ¹b¹c (a  a) ¹ b ¹ c b¹c By Postulate 4(a) By Postulate 5(a) We will use this basic principle to simplify Boolean expressions. FIGURE 3. for functions of three variables the map will have eight squares and for functions of four variables the map will have 16 squares. for functions of two variables the map will have four squares.3). adjacent squares in the map differ by only one variable which is complemented in one box and appears as it is (that is uncomplemented) in the next.2 Consider the Truth Table 3. Observe that with this coding. EXAMPLE 3. . but in a sequence corresponding to a cyclic code with unit Hamming distance between adjacent squares (see Section 2.4.4(b) indicates the minterm (in terms of the literals) represented by each box and also the logic values of the triplet of variables.5. Thus. This labelling of the boxes in the map directly aids simplification of Boolean expressions. The table is first mapped on to the Karnaugh map as shown in Figure 3. From Postulate 5 of Boolean algebra it is evident that if we OR the minterms in adjacent squares.Basics of Digital Systems 51 the corresponding truth table. The map in Figure 3. The Karnaugh map for three variables is non-trivial and we will consider it first. Note that the minterms are not arranged in the map in their natural ascending sequence.4 A three-variable Karnaugh map. Each square in the map is labelled in such a way as to aid simplification of Boolean expressions by inspection of the map. The map with the minterm represented by each square is shown in Figure 3. For example. they may be simplified to yield a term with one less literal.9.

with z = 1. 1 0 1. for a b c = 0 1 0. a 1 is entered in the Karnaugh map. c are adjacent. For each combination of a. Thus. This is true except for the first and last columns of the map. we have 1s in the map.52 Computer Organization and Architecture TABLE 3. Two adjacent 1s may be combined to eliminate one literal. one can.5 Simplification with three-variable maps. and 1 1 1. pick terms to be combined to eliminate variables. one may imagine the map to be wrapped on a cylinder with the . The mapping is straightforward. minterms which differ in only one variable (such as a ¹ b ¹ c. by inspection. b. a ¹ b ¹ c ) are also physically adjacent in the map. They are logically adjacent. a ¹ b ¹ c and a ¹ b . that is. The next step is to identify adjacent 1s in the map. To ensure that they are all physically adjacent. c. In this example. the terms a ¹ b ¹ c and a ¹ b ¹ c are adjacent. 1 1 0.9 A Boolean Function for Simplification a m0 m1 m2 m3 m4 m5 m6 m7 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 z 0 0 1 0 0 1 1 1 FIGURE 3. we have: z ( a ¹ b ¹ c  a ¹ b ¹ c )  (a ¹ b ¹ c  a ¹ b ¹ c) ( a  a ) ¹ b ¹ c  a ¹ c(b  b ) b¹c  a¹c The main merit of the Karnaugh map is the fact that terms which are logically adjacent. Thus. Similarly.

a ¹ b  c and z ( a  b) ¹ c (3. Thus. if we wish to obtain the product of sums form we may use the 0s in the map of Figure 3. Thus. We will illustrate minimization of a function in which they are physically adjacent.4 Consider the Karnaugh map of Figure 3. they may be combined.7. the four boxes with 1 could have been replaced by b . that by inspection.8(ii)) .3 Consider the K-map of Figure 3.8(i)) which when realized with gates leads to Figure 3. First we observe if the 1s in the first and second rows are adjacent.6(b) from which z using De Morgan’s theorem. EXAMPLE 3. we see that the first and the fourth columns are adjacent. Instead of getting the sum of product form.5(b). EXAMPLE 3.Basics of Digital Systems 53 first and last columns next to one another.6 K-map for a three-variable function. z a ¹b  a ¹b (a  a) ¹ b b Observe from the marking on the map. FIGURE 3. The reduced expression for z [using Postulates 4(a) and 5(a)] is: z a ¹c b¹c (3. z (a ¹ b ¹ c  a ¹ b ¹ c )  ( a ¹ b ¹ c  a ¹ b ¹ c ) a ¹b  a¹b As (c  c 1) Now.6(a).6(a) and obtain Figure 3.

Observe that adjacent squares differ by not more than one bit in their four-bit code representation. One can imagine the map to be wrapped around on a surface in both directions to remember the logical neighbourhoods given above. The map on the left shows how the 16 minterms are represented in the map. (i). however.8 A minimal gate realization of Eq. 3.7 Gate realization of Eq. (ii). If the above form is implemented with gates we obtain Figure 3. Further.7.54 Computer Organization and Architecture FIGURE 3. which uses one less AND gate when compared with Figure 3. the first and last columns are adjacent. It should be emphasized that both the forms when expanded should be identical as they represent the same Boolean function. (3.8.1 Four-Variable Karnaugh Map The Karnaugh map for Boolean functions of four variables is shown in Figure 3. . FIGURE 3. Observe.8)(i) if we had recognized the possibility of factoring c. obtaining the product of sums form would be useful in cases where one wants to make sure of a minimal gate realization. Even though it is obvious in this example. in most complex examples the factoring might not be obvious. that the second form could have been obtained from Eq. Thus. The same is true for the top and bottom rows.5.9. The map on the right shows the coding used to indicate rows and columns in the map.

9 A four-variable K-map. In this case there is a group of eight adjacent 1s. d. Thus. Inspecting the rows cd = 00 in row 1 and cd = 01 in the second row. the variables a. three variables will be eliminated. Thus. Inspecting second and third rows. the two variables to be eliminated are a and c. We illustrate this using the maps given as Figure 3. combination of four adjacent squares eliminates two literals and the combination of eight adjacent squares eliminates three literals. d is eliminated. Thus. c and d are eliminated and z b. the variable d is constant at 1 whereas c takes on both values 0 and 1.10. Inspecting the rows.Basics of Digital Systems 55 FIGURE 3. In the first and fourth columns. Thus. A number of examples of combining 1s in the Karnaugh map to eliminate literals is shown in Figure 3. combination of two adjacent squares eliminates one literal. two variables will be eliminated. Thus. The resultant expression is z a ¹ b ¹ c .10(b). Thus. This has four adjacent 1s. Thus.10. The minimization method used with four-variable Karnaugh map is similar to that used to minimize three-variable functions. . the variable b is constant at 0 and a takes on both values 0 and 1. Consider the map of Figure 3.10(a). the variable a takes on both values 0 and 1 whereas b is constant at 1. both c and d take on all possible values. The reduced expression is z = b . Consider Figure 3. Consider Figure 3. In this map there are two adjacent 1s in the third column where a and b are both 1. The primary advantage of using Karnaugh map is that we can obtain the minimal expression corresponding to a Boolean function by inspection. Inspecting the second and third columns.10(f). d takes on both values 0 and 1 whereas c is constant at 0.

.10 Simplification with four-variable maps.56 Computer Organization and Architecture FIGURE 3.

we encounter situations in which some minterms cannot occur. Minimization is. Note the variables which are constant in the group (The variables which change should assume both values 0 and 1 and are eliminated). variables a and b take on all possible values whereas c is constant at 0 and d is constant at 1. the process of obtaining minimal set of prime implicants which ‘covers’ all the minterms of the given Boolean function. We will illustrate this with an example. 6. Make groups of two adjacent squares with 1s or four adjacent 1s or 8 adjacent 1s. two variables will be eliminated. Thus. Any 1 may be included in more than one group. Thus. .10(e) there is a group of four adjacent 1s. which has four adjacent 1s. The variable c also assumes both values 0 and 1. Quite often. in practice. Thus. thus. Look for 1s in the Karnaugh map which have no neighbouring 1s. 2. start with the smallest group. We may summarize the rules of reduction using Karnaugh map as follows: 1.10(c) in the square consisting of four adjacent 1s. z c ¹ d . Repeat this step and form a product for each group until all groups are exhausted. 5. Thus. starting with the smallest group of 1s.10(d) in the second row. functions whose value is specified for all minterms. Thus. They cannot be combined and minterms corresponding to them appear as they are in the final expression. take the sum of the products to form the final expression. Thereafter. a and c are eliminated and z b ¹ d . a and c are eliminated giving z b ¹ d . 3. After grouping 1s. Form the product of all such variables. This is one term in the final sum of products.Basics of Digital Systems 57 In Figure 3. If the constant value of the variable is a 1 it appears in its true form and if it is a 0 it appears in its complement form. A Boolean expression which cannot be further reduced is said to be in its minimal sum of products (or product of sums) form. each group should have at least one 1 not included in any other group. In Figure 3. 4. whereas a and c take on both values 0 and 1. In this group b has a value 1 and d has a value 0. The variable a has both values 0 and 1. b is constant at 0 and d is constant at 0. Incompletely specified function So far we have considered completely specified functions namely. Each 1 must be included in the largest group possible to eliminate the maximum number of literals. Groupings of 1s should be made to minimize the total number of groups. In Figure 3. Each term in the minimal expression is known as a prime implicant.

Observe that six code groups are illegal.11. Thus. The Karnaugh map for the parity generator is shown in Figure 3. P is marked with a f in the truth table and in the Karnaugh map indicating that it could be either 0 or 1. The truth table is given in Table 3. No parity bit need be generated in these cases as these cases cannot occur.10.11 K-map for parity generator. In other words the value of P is immaterial.10 Even Parity Bit for Excess-3 Code A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 P f f f 0 1 0 0 1 1 0 0 1 0 f f f Þ Ñ ß Ñ à Þ Ñ ß Ñ à Illegal codes Illegal codes FIGURE 3.58 Computer Organization and Architecture EXAMPLE 3. These are known as don’t care conditions and may be taken as 1 or 0 in the Karnaugh map depending on whether they aid in eliminating literals. TABLE 3.5 A switching circuit is to be designed to generate even parity bit for decimal numbers expressed in the excess-3 code. .

P ( A  B ) ¹ ( A  B ) ¹ (C  D ) ¹ (C  D ) The expression is more economical to implement. . This cyclic coding scheme is difficult to extend to a larger number of variables. Up to six variable maps can be constructed and used with some success but beyond that it is practically impossible to use this technique.12 Gate realization of the parity generator.12. The gate circuit is shown in Figure 3. The construction of the map and its use depends on the coding scheme. leads to the expression: P A ¹C ¹ D  B ¹C ¹ D  B ¹C ¹ D  A ¹C ¹ D If we look at the Karnaugh map for P . FIGURE 3.Basics of Digital Systems 59 Observe that the number of terms in P using don’t care conditions. which ensures that physically adjacent minterms in the map are also logically adjacent. A method developed by Quine and extended by McCluskey is applicable to any number of variables. when they help in reducing literals and ignoring them when they do not help. It is an algorithmic method and programs have been written to implement it on a computer. The Karnaugh map method is a useful and powerful tool to minimize Boolean expressions. we find that it can use all the don’t care conditions optimally and we obtain: P B¹ A  A¹B C¹D C¹D and using De Morgan’s theorem.

are first isolated. From these groups the smallest set of prime implicants which will cover all minterms is selected.2 that the NAND and NOR functions are complements of AND and OR respectively.9) . The similarity is that in both the methods groups of minterms. The NAND/NOR operators were defined for only two inputs in the last chapter.60 Computer Organization and Architecture McCluskey’s procedure is similar to the Karnaugh map procedure. Thus the symbols used are those of AND. OR with additional small circles (called inversion circles) at the input or output of these gates. The method of eliminating literals is based on the repeated application of the theorem: a ¹b  a ¹b a If two minterms are to be combined using the above theorem. then the number of 1s appearing in their binary representation must differ by one and only one 1. Note that there are two possible symbolic representations for each gate. These operators are not associative. This idea is systematically applied in the Quine–McCluskey method. This is illustrated in Figure 3.13.6 NAND AND NOR GATES We saw in Section 3. which will combine to eliminate the maximum number of literals. FIGURE 3. 3.13 Symbols for NAND/NOR gates. For details the readers may refer to the book Digital Logic and Computer Organization [31] by the authors of this book. In other words. ( a  b)  c › a  (b  c ) (3. the Hamming distance between their binary representations should be 1.

Note that complementation is straightforward. AND: OR: a a a ¹1 a¹a a 1 aa (3. This is shown as follows: NOT: Also. The additional inversion needed is achieved using the second gate.b. The NAND and NOR gates are universal gates since any Boolean expression can be implemented with these gates. In fact.Basics of Digital Systems 61 This is shown below by evaluating both sides of Eq.14 Symbols for associative NAND/NOR gates.14) (3. we run into difficulties when more than two input variables are involved.15(b).c.9) ( a  b)  c a  (b  c ) a ¹b ¹ c a ¹b¹c a ¹b  c a b¹c (3.13) (3. With this understanding. by only showing a single input we can immediately recognize that we are just performing the NOT operation using the NAND gate. we define the multiple input NAND/NOR gate as a complement of the corresponding multiple input AND/OR. we only have to show that AND. OR and NOT gates can be implemented with NANDs alone or NORs alone.12) FIGURE 3. . we need not show the two inputs of the gate when the unary operation is performed. To overcome this difficulty.16) a ¹b a b a ¹b a ¹b (a  b)  1 (a  1)  (b  1) These realizations are shown in Figure 3. This is shown in Figure 3.10) (3.15. (3.11) Thus. We can either NAND a with 1 or connect the two inputs together as shown. To see this.15) (3. The AND operation using a NAND gate is straightforward as shown in Figure 3. Using this modified definition (called Associative NAND/NOR indicated by the symbols respectively) we have: a b c=a. a b c = (a + b + c) (3.14.

AND: OR: a a a0 aa a‘0 a‘a (3. Once again the two diagrams are . whenever OR operation is contemplated using a NAND gate.62 Computer Organization and Architecture But. FIGURE 3. Thus. the OR operation using the NAND gate is not so straightforward as shown in Figure 3. The symbol for the NAND gate which is based on that of OR is conveniently used to make the OR operation explicit.15.15(c) where both the diagrams (i) and (ii) are equivalent. Performing the NOT/OR operations using the NOR gate are straightforward. In performing the AND operation we use three NOR gates as shown in Figure 3. This will be further illustrated later.17) (3. this symbol is conveniently used. OR and NOT with NOR gates from the following equations: NOT: Also.15 Realization of NOT/AND/OR with NAND.20) a¹b a b ( a ‘ 0) ‘ (b ‘ 0) a b a b ( a ‘ b) ‘ 0 These realizations are shown in Figure 3. We may realize AND. whereas in the second diagram the OR operation is more obvious.16(c). The first diagram needs to be interpreted using DeMorgan’s theorem.18) (3.19) (3.

the second diagram is more convenient since it makes the operation more explicit. However. The Karnaugh maps discussed earlier give Boolean expressions using AND. OR and NOT connectives with a minimum number of literals.b b – b (c) FIGURE 3. the variety of gates used to realize any combinatorial logic circuit is reduced to just one type of gate. testing and realization of combinatorial circuits are all simplified. a 0 (a) – a a – a a – a a b a+b a+b (b) a a – a a . as we have to deal with only one type of building block. When discrete components were being used. Since several gates are included in a single IC package. Thus. It is thus evident that all logical expressions can be realized using either NAND or NOR gates. This is not necessarily true when ICs are used. they are universal gates. This is of great value from the engineering point of view as design. This. the realization with minimum number of gates also minimized the cost. In other words.16 Realization of NOT/OR/AND with NOR. The cost is determined not by the number of gates but by the number and variety of ICs used.b – a – b b a . in turn. This is the primary reason why NAND and NORs are the preferred gates in practice.Basics of Digital Systems 63 equivalent. it may not cost more to use a few more gates provided they are all in the same package. . leads to a circuit with a minimum number of gates and with each gate having the minimum number of inputs.

To illustrate multiplexing. at one time X can be only one out of the four channel inputs X1. FIGURE 3. S1 can have four possible combinations. Since the switch has four positions. To do this we must have two Boolean control inputs. Note that the switches are mechanically coupled (also known as ganged switches). we have to select one out of four from these.18(b). an arrangement using switches is shown in Figure 3.17 Multiplexer using mechanical switches. . consider the cyclic code to BCD converter.7 DESIGN OF COMBINATORIAL CIRCUITS MULTIPLEXERS WITH A multiplexer (abbreviated MUX) is used when a complex logic circuit is to be shared by a number of input signals. We would like to obtain the equivalent of this mechanical switch.21) Note that this expression contains the four minterms of the two control variables. For example.64 Computer Organization and Architecture 3. X3 or X4. The realization of a 4-input multiplexer using AND and OR gates is shown in Figure 3. So. Only one of these can assume a value 1 and thus.17. The block diagram of the multiplexer is shown in Figure 3. then a multiplexer is used. X2. Multiplexing action is achieved by the Boolean expression: X X1 ¹ S0 ¹ S1  X 2 ¹ S0 ¹ S1  X 3 ¹ S0 ¹ S1  X 4 ¹ S0 ¹ S1 (3. If cyclic codes are obtained from a number of different sources and the decoder is to be shared.18(a).

Basics of Digital Systems 65 FIGURE 3. realize with a MUX any Boolean function of three variables.19. We have purposely chosen one of the worst examples where the 1s have no neighbours.23) If we make the I0 input of the 4-input multiplexer C. The Boolean expression is: Y C¹ A¹B C¹ A¹B C¹ A¹B C¹ A¹B (3. I3 selectively 1 or 0 we can make Y equal to any Boolean function of two variables. I2. then Y will be realized. Although multiplexers are primarily designed for multiplexing operations. . Consider the Boolean function mapped on to the Karnaugh map of Figure 3. If we connect to the control inputs of a multiplexer variables A and B and connect to the channel inputs I0. I1. I1. I1 = I2 = C and I3 = C (see Figure 3. the output Y is given by: Y I 0 ¹ A ¹ B  I1 ¹ A ¹ B  I 2 ¹ A ¹ B  I3 ¹ A ¹ B (3.19 A Boolean function of three variables. in fact. We can.22) By making I0. FIGURE 3.18 (a) Block diagram of a multiplexer and (b) gate circuit.20). they may also be used for synthesizing Boolean functions. I2. I3. Multiplexers are available as standard ICs and using such a standard circuit to realize Boolean functions is economical.

we can have only four product terms in the sum of products form of a Boolean function. AD. Sometimes it is possible to realize such functions with a four-input MUX. Observe that each of the columns in the K-map is a selection and routes I0.21(b). Consider the K-map of Figure 3.21 Realizing a three-variable function with a 4-input MUX. Thus. However. I3 and I2 respectively. instead of the usual K-map reduction we should inspect columns in the K-map to realize the Boolean function with a MUX optimally. namely. The function given in Figure 3.23.66 Computer Organization and Architecture FIGURE 3. When attempting to realize a four-variable function with a four-input MUX the choice of control variables is crucial. we can combine them appropriately as shown in the realization of the Boolean function mapped on the Karnaugh map of Figure 3. BD and CD are possible. The factorings are illustrated in Figure 3. BC. it is possible to realize any four-variable function with additional gates and a 4-input MUX. Note that with a 4-input MUX.24 is realized with a single 4-input MUX by realizing that BD should be picked as control variables. Taking CD as control variables for a MUX we obtain the realization of Figure 3.22(a). .20 Realization of Boolean function of Figure 3.19 with a MUX. AB. AC. For each choice of control variables the K-map should be factored. FIGURE 3. A straightforward method to realize a Boolean function of four variables is to use an eight-input MUX.25.21. Six different choices. Sometimes it is not possible to realize a four-variable function with a 4-input MUX. This is illustrated in Figure 3. I1. This is in fact enough because if there are more than four 1s in a 3-variable K-map.

FIGURE 3.Basics of Digital Systems 67 FIGURE 3.22 Realizing a four-variable Boolean function with four-input MUX. .23 Factoring K-map for MUX realization of a four-variable function.

If. FIGURE 3. Demultiplexer: Demultiplexing is the reverse process.25 Realizing with a MUX and a gate.24) (3.24 MUX realization of a four-variable function. This is illustrated in Figure 3.27) Z W I ¹ C1 ¹ C0 I ¹ C1 ¹ C0 . we want to steer the signal on an input line to four output lines then the steering may be controlled by using two control variables.68 Computer Organization and Architecture FIGURE 3.26) (3.26. for example. The Boolean expressions for the demultiplexer are: X Y I ¹ C1 ¹ C0 I ¹ C1 ¹ C0 (3. Data available on one line is steered to many lines based on the values of control variables.25) (3.

Basics of Digital Systems 69 FIGURE 3. .18 of a MUX that the output will be X1. W will be equal to I depending on the values of C0 and C1. A DEMUX performs the reverse process. i.27 to steer an input from one of the four sources to one of the four destinations as shown in Figure 3. From the above expressions it is clear that X. X3 or X4 depending on whether S0S1 is 00. X3 or X4 to the output line. MUX and DEMUX can be combined as shown in Figure 3. Combining MUX and DEMUX: We saw from Figure 3.27 A MUX-DEMUX combination.e. X2. X2.. In other words. Z. Y. takes a signal from a single input line and steers it to one of 4 outputs depending on control bits. 01. A MUXDEMUX combination is used to steer a source bit to a destination via a one bit bus. 10 or 11.26 A 4-output demultiplexer. the control inputs S0S1 will steer one of the signals X1. FIGURE 3.27.

70 Computer Organization and Architecture For example. Further. 3. the number of outputs considered has been mostly one or two.28. however. Both the true and the complement form of the variables are fed to an AND array. Field Programmable Logic Array (FPLA) 2. PLDs. 3. FIGURE 3. PLDs fall into the following three categories: 1. If m Boolean functions are to be realized as a sum of at most k products then an array of m OR gates. allow much larger circuits to be designed. if we want to send SA to destination DC. Multiple MUX-DEMUX pairs can be used to send bits via a parallel bus with multiple lines. The maximum number of product terms which can be generated by the AND array with n variables is 2n. The number of gates used to realize such circuits will become very large. we will set S1S2 to 00 and D1D2 to 10. A block diagram of a PLD is shown in Figure 3. In general.28 Block diagram of PLD. MUXs also assist in building large combinational circuits. A costeffective method of realizing such circuits is to use Programmable Logic Devices (PLDs). In many real problems there are a large number of input and output variables. Programmable Read Only Memory (PROM) All these devices consist of an array of AND gates followed by an array of OR gates. By stating that an AND array is programmable we mean that product terms can be created as . Programmable Array Logic (PAL) 3. PLDs are medium-scale integrated circuits. There are n input variables fed to an input buffer and inverters.8. any one of the sources can be selected by S1S2 and any destination can be selected by D1D2.8 PROGRAMMABLE LOGIC DEVICES The examples of combinational circuits considered so far have a small number of input variables. each with k inputs.1 Realization with FPLAs In FPLAs both the AND array and the OR array are programmable. to send SB to DA we set S1S2 = 01 and D1D2 = 00. Thus. are required.

Further. the AND array is not programmable.30). (b). however. Thus. . Various methods of representing the product A ¹ B ¹ C ¹ D are shown in Figures 3. The OR array is programmable. A similar notation is used to indicate a programmable OR gate (see Figure 3.29(d) shows the preferred notation.29 Various notations used to describe programmable AND gate. Thus. We will first consider FPLAs and clarify the notation we will be using. In PROMS. depends on the size of the FPLA device.29(a) which realizes the term A ¹ B ¹ C ¹ D . all 2n product terms are generated. It is thus currently preferred for realizing combinatorial logic circuits. The total number of OR gates. In PALs the AND array is programmable but the OR array is fixed. (c) and (d). it is cheaper to fabricate and is widely available. FIGURE 3.29(a). even though it is somewhat less flexible than FPLA. Figure 3. Similarly. The number of product terms depends on the size of the AND array. Consider the AND gate of Figure 3.Basics of Digital Systems 71 required in an application with each product term having less than or equal to 2n variables. Observe that a single input line is shown as an input to an AND gate and a cross is shown on this line corresponding to each in the product term. programmability of an OR array means that the number of inputs to each OR gate in the array can be varied. in most practical problems it is found to be adequate.

72 Computer Organization and Architecture FIGURE 3. FPLA realization of the following 4 boolean expression is shown in Figure 3. Thus.31 Z ¹ X .8. 3.31 PLA realization of 4 boolean expressions. This is not a . Z ¹ Y and X ¹Y  Z ¹W ¹ X  Z ¹W ¹ X  Z ¹W ¹ X ¹Y  Z ¹W ¹ X FIGURE 3.2 Realization with PALs The major difference between FPLAs and PALs is that in PAL the OR array is fixed and the AND array is programmable. the number of product terms allowed in a combinational circuit is limited by the type of PAL.30 Notation used to describe programmable OR gate. Y  X ¹ Z.

Basics of Digital Systems 73 major limitation in many cases. 4. the selected PAL should provide this. number of product terms allowed per OR gate in the array. 6. 8. 11 X = S 1. In the manufacturers’ catalogue details of various types of PALs. . 3. FIGURE 3. PALs are very popular. Non-programmability of OR array allows realization of more gates in a package at a lower cost. The sixth AND gate is marked with a cross to indicate that all fuses in that line are intact. 9 In this case. number of inputs. we need a four inputs (variable) and two output PAL. The PAL realization is shown in Figure 3. etc.. 5. Thus. Given the following functions we will show how it is realized with a PAL: W = S 0.32 PAL Programming of two sum of products. outputs. A user has to look at the catalogue before deciding what PAL is suitable for a given application. A popular series of PALs is the 20 pin series in which there are total 20 pins for input and output. 7. are given. 2. Observe that X needs only five products.32. As the maximum number of product terms is six.

labelled R and S.34(a). these delays act as memories.74 Computer Organization and Architecture 3. but also on the past outputs which in turn were functions of the previous inputs. as a sequence of events in time. Information storage is performed in what are usually called memory elements. Remembering some conditions for a period of time and recalling these later means memory. In the new class of switching circuits. .33 A model for sequential circuits. in this and subsequent sections we will consider another class of switching circuits that contain memory in one form or another. The word history is appropriately used to comprehensively refer to the past values or previous conditions which prevailed in the circuit. are fed back as inputs as shown in Figure 3. forms an important factor in determining the behaviour of these circuits. The other two inputs. Thus. these are known as sequential switching circuits. One input of each of these NOR gates is fed from the output of the other NOR gate to form the feedback loops of Figure 3.34(a). then information regarding the past inputs must be stored. some of the outputs of a practical combinatorial circuit.33. 3.10 A BASIC SEQUENTIAL CIRCUIT Consider two NOR gates having two inputs each connected as shown in Figure 3. An output when delayed can be thought of as stored for a period of time and made available at a later time.9 SEQUENTIAL SWITCHING CIRCUITS If we wish to have a circuit where the outputs depend upon the past inputs as well. The outputs then depend not only on the present inputs. Thus. FIGURE 3. which appear after some time delay. Since the history of inputs. become the primary inputs which can be controlled externally.

28)] in K-maps as shown in Figures 3. Now we represent x and y [Eq. In the K-map the stable-states are easily identified by comparing the XY values of each row and the entries of xy in the same row of the map. For convenience. we can write the combinatorial relationships: x R  Y.34(a) is modelled by an equivalent circuit shown in Figure 3. The word ‘previous’ means d units of time earlier. . y SX (3.29) to determine what happens after a time delay d by just saying “X becomes equal to x and Y becomes equal to y after the time delay”. after a time delay d. Y(t) = y(t – d) Observe that x and y are combinatorial functions of the primary inputs R and S as well as the previous values of x and y which are represented by X and Y respectively.28) (3. (3. S.29) Since X and Y are nothing but x and y delayed by a period of time d. We are aware of the fact that this K-map does not tell us anything whatsoever about the sequence of events in time. the circuit of Figure 3.34(b).34 A simple sequential circuit. the two maps are merged to form a single map as shown in Figure 3. we should remember that any practical NOR gate cannot operate instantaneously. then the circuit is unstable and therefore further changes must take place. X and Y instantaneously.35(a) and (b) respectively. The stablestates are distinguished by circling them. The stable states of the circuit are characterized by X = x and Y = y when no further changes will take place. But we can make use of Eq. X(t) = x(t – d). Thus. The delays have been shown as separate elements in the figure.35(c). To analyze the behaviour of this circuit.Basics of Digital Systems 75 FIGURE 3. (3. By inspection of the equivalent circuit. we write. The gates in the equivalent circuit are assumed to be ideal with no delay whatsoever. Thus. The outputs of ideal gate x and y become X and Y respectively. It will have a time delay between the application of a change in input and the appearance of the corresponding output. we see that the outputs x and y follow changes in the inputs R. If at any instant of time X and Y are not equal to x and y.

we must focus on the fourth column but move to the first row. Now. The question that arises is. is an abstraction in which the combinatorial part of the circuit is assumed to have no delay. Let us now consider the situation where the primary inputs RS assume values 00 (1st column). These transitions produced by changing the primary inputs are shown by the dotted arrows in the figure. (3. We will now show that the actual stable-state depends upon the previous history of the inputs RS. XY becomes 01 shifting our attention to the second row. and made RS = 00. Since RS value continues to be the same. Thus. they could become 00. . Since XY assume the values of xy after a time delay. For example. we started with RS = 10 and the stable-state E.35 K-maps for x and y of Eq. by a change of one variable. the circuit will move to the stable-state A. RS = 00 and move to the first column.76 Computer Organization and Architecture FIGURE 3. This clearly establishes the fact that this sequential circuit exhibits memory and this then is the essential difference between combinatorial and sequential circuits. If. we find the stable-state B adjacent to C. This is a stable-state and there will be no further changes. They could have been 01 or 10 from which. let us investigate what happens when XY = 00 (first row). The entry in the map gives xy = 01. if we make S = 0.33.28). Next.35(c) we see that the value of xy at the intersection of the third row and fourth column is 00. what were RS before they became 00. on the other hand. Let us first consider the input condition RS = 01 (second column) and assume that the circuit was in the stable-state C. From the map of Figure 3. The sequence of events called cycles (or transitions) leading to this stable-state E are indicated by arrows in the K-map. Here xy is also 01. consider a situation where RS = 10 (fourth column) and XY = 11 (third row). After another time delay. The delay is shown separately in the feedback paths and is labelled as memory. we will examine the sequence of events which takes place if the circuit is in an unstable state at some point of time. The outputs XY can be in either one of the two stable-states 01 (A) or 10 (B). the general schematic presentation of a sequential circuit given in Figure 3.

It can now be seen that if RS = 10. Thus.Basics of Digital Systems 77 There are two types of sequential circuits: synchronous and asynchronous.35(c)] describes its behaviour. The circuit is known by the name RS Latch. Keeping RS = 00. 2. they use bistable memory elements which can store a ‘1’ or a ‘0’. X and Y are both ‘0’. the inputs and outputs do not change at preassigned times since the inherent delays are not rigidly controlled. 3. In all the other rows. The transfer of information from outputs to memories is done only at preassigned discrete points of time in a systematic manner by using a sequence of periodic pulses known as a clock. Q = 1. When Q is ‘1’. there is a critical race condition and the outcome is unpredictable. Here. Information regarding the outputs is stored in these memories. Its K-map [Figure 3. Again starting from RS = 11. Thus. . we rename the output X as Q and Y automatically becomes Q .34) using two NOR gates. the outputs XY are complements of each other. When RS = 00. With reference to the K-map. if we disallow the condition RS = 11. if we change RS to 01 or 10. used in sequential circuits to store binary information. we move to the left or right into the adjacent columns of the map and the outcomes are predictable. Making RS = 01 is conventionally called setting the latch and making RS = 10 is called resetting the latch. 3. because X and Y are related. we note the following: 1. for the rest of the allowed combinations of R and S.11 FLIP-FLOPS Bistable memory elements. This leads to problems known as races. if we change both the variables. Since both the variables X and Y have to change. Instead. Q retains (stores) its previous value. synchronous sequential circuits do not depend upon unknown delays in the feedback path to give the memory function. the outputs X and Y will be complements of each other. Furthermore. we say that a ‘1’ is stored in the latch. we have a predictable behaviour for the circuit. where RS = 11. The circuit considered above belongs to the asynchronous category. the circuit has two stablestates. we leave the contents of the latch undisturbed and is hence referred to as operating it in the store condition or store mode. The behaviour of the latch is summarized in Table 3.11. Q = 0 and when RS = 01. For the rows indicated by arrows in the table where RS = 11. We analyzed the behaviour of a simple sequential circuit (Figure 3. On the other hand. There are several types of flip-flops having different behaviour with respect to information storage and retrieval. When the circuit is in the stable-state D. In asynchronous circuits. RS become 00 and we move to the first column of the map. are called flip-flops.

TABLE 3. After the pulse is gone. Sn as well as Qn as shown in Table 3.11 Behaviour of RS Latch RS 10 01 00 11 X = Q 0 1 0 or 1 Resets Sets Stores previous value Disallowed As it is. Thus. the circuit is asynchronous. i.e. Rn and Qn given by Qn 1 Sn  Rn ¹ Qn (3.12. The inputs R and S are applied along with clock pulses to two AND gates. the primary inputs Rn and Sn of the nth interval are applied to the latch. When the clock pulse (CP) arrives at time tn.36 Clocked RS flip-flop. it becomes possible to express Qn+1 as a Boolean function of Sn.12 Characteristic Table of RS Flip-Flop Sn 0 0 1 0 0 1 1 Rn 0 1 0 0 1 0 1 Qn 0 0 0 1 1 1 Not allowed Qn+1 0 0 1 1 0 1 Stores Resets Sets Stores Resets Sets Qn 1 Rn ¹ (Sn  Qn ) Qn+1 Qn+1 Qn+1 Qn+1 Qn+1 Qn+1 = = = = = = Qn 0 1 Qn 0 1 .36(a). the (n + 1)th interval starts. The output Qn+1 during the (n + 1)th interval depends upon Rn. the events are not controlled by a clock.30) This is called the characteristic equation of an RS latch.78 Computer Organization and Architecture Table 3. The outputs of these gates become the inputs of the latch. the outputs of the gates are low and hence the latch is in the store mode. FIGURE 3. Normally. Synchronous operation is obtained as shown in Figure 3.

then the characteristic equation is Qn 1 Sn ( Rn  Qn ) The reader should verify this using a K-map. In an RS flip-flop. at the end of a narrow clock pulse X becomes 0. we make use of all the four combinations RS can assume to obtain useful transitions. we can obtain an excitation map of this circuit (Figure 3. we can make the outputs X and Y to be complements of each other. If a ‘1’ is stored in the flip-flop.38) (assume that CP = 1). For JK = 11. How can we make the flip-flop complement its content when RS = 11? Complementing its content means that we have to know what is contained in it so that we can bring about the appropriate transition.Basics of Digital Systems 79 If we use the fact that Rn = Sn = 1 is disallowed. note that this feedback is such that if X = 1. gate number two gets disabled. if a ‘0’ is stored in it. Now since Y = 0. for RS = 11 the content of the flip-flop gets complemented. we must reset it. Such a flip-flop is called a JK flip-flop and it is customary to label the primary inputs by the letters J and K. In other words. if X = 1. X and Y are complements to each . We assume that the inputs Rn and Sn do not change within the pulse duration. in all its stable-states. Under these conditions. we can design the circuit in such a way that the output Qn+1 at tn+1 is the complement of Qn at tn. we must set it. R = 1 and S = 0 during the clock pulse.36(b). The latch along with the AND gate (called loading gates) is called a clocked RS flip-flop the symbol for which is shown in Figure 3. Further. The circuit of an RS flipflop can be modified so that even when RS = 11. Now it becomes clear that we have to load the outputs into their own inputs to complement them as shown in Figure 3. FIGURE 3. Using a technique similar to that used for analyzing an RS flip-flop. In this way. Thus. the AND gate numbered one gets enabled.37 JK flip-flop. On the other hand. It is seen from the map that. the condition RS = 11 is disallowed. for this input condition.37.

Zn–1. Zn of the nth interval are available. S2 are opened first and then S3.39. This lets the information regarding Wn. This updates the contents of the slave memory to Zn. S4) interposed between them. in such flip-flops one gets another restriction on the speed of the back edge of the clock. S2 and CP [Figure 3. When the clock is low in the nth interval. first we open the switches S3.38 K-map for explaining JK flip-flop action. the user does not have to worry about the pulse width. S4. Thus. because the complementation process will go on endlessly so long as CP is ‘1’. When the clock goes high. S3 and S4 are closed and S1. In many commercially available ICs. Now we explain the operation of this new arrangement as follows. the flip-flop is called a charge controlled JK flip-flop. S4 are complements of S1. Assume that the switches S3. S1. the inputs RS of the latch will be 01 which will set the latch. However.80 Computer Organization and Architecture other. The outputs Wn. When the inputs JK change from 01 to 11. At the trailing edge of the clock pulse. The minimum rate of fall of the back edge of the clock pulse is specified by the manufacturer. Now . At the trailing edge of the external clock pulse. When such a provision is incorporated. This is done by using a clock pulse (CP) to operate S1. the slave still contains Wn–1. By this we mean that when S1 and S2 are closed. One set of memories is called Master and the other Slave as shown in Figure 3.39(b)] to operate S3. Information from master stays connected to the slave. However. S2. it is necessary that the width of the clock pulse is less than the delay in the latch. Now we have two sets of memory elements in tandem with a new set of switches (S3. This is achieved by differentiating the back edge or by a charge control mechanism. The problem of pulse width control in JK flip-flop is easily avoided by the arrangement shown in Figure 3. Wn. S4 are closed. to see that complementation takes place only once. FIGURE 3. but these are not connected to the master memory. S2 remain open. Thus. this state is unstable if CP continues to be ‘1’. the content of the latch is complemented.39(a). Zn move into the master. S4 and then close S1. a new pulse of appropriate duration is generated internally in the IC. S2. S3 and S4 are open and vice versa.

Such flip-flops are called master slave flip-flops. Zn+1 and Wn+1 are formed. .Basics of Digital Systems 81 FIGURE 3. we see that the duration of the clock pulse is immaterial from the point of view of the circuit operation. Note that the new values of outputs are formed at the trailing edge of the clock pulse.39 Sequential circuit with master-slave memory. To make such an approach to the design of sequential circuits possible. FIGURE 3.40 Master-slave flip-flop. Master-slave JK flip-flop: A master-slave JK flip-flop with two NOR latches is shown in Figure 3.40. Thus. Note that the input loading gates are operated with the clock pulse CP and the loading gates in between the latches are operated by CP . we put two flip-flops into a single IC package with appropriate gates so that a composite memory is obtained.

Since the pulse width is not critical any more.31) is the characteristic equation of master-slave JK flip-flop.13.13.41 is known as D flip-flop. Now it is easy to see that the operation of the circuit does not depend upon the clock pulse duration. it becomes possible to even use square wave clocks which have a duty cycle of unity.13 Behaviour of JK Flip-Flop Jn 0 0 0 0 1 1 1 1 Kn 0 0 1 1 0 0 1 1 Qn 0 1 0 1 0 1 0 1 Qn 1 J n ¹ Qn  K n ¹ Qn 0 1 0 0 1 1 1 0 Stores Resets Sets Complements With reference to Figure 3. TABLE 3. we get. The behaviour of this flip-flop is described in the characteristic table (Table 3. The characteristic table of the flip-flop is shown in Table 3.40. This input is called the D input. we should now carefully decide as to which interval the pulse duration belongs to. The characteristic equation of D flip-flop is: Qn+1 = Dn (3. the pulse duration is included in the previous interval as shown in Figure 3. (3. the output of the slave is fed back to the master as shown in the figure. Since the pulse width can be comparable to the period between the pulses. Using this table. D flip-flop: The flip-flop shown in Figure 3. Since the transitions take place at the trailing edge of the pulse. Normally. SD = 0 and CD = 1 resets the latch.31) Qn 1 J n ¹ Qn  K n ¹ Qn This may be verified by using K-map of Table 3. Kn and Qn. expressing Qn+1 as a Boolean expression of Jn. Equation (3.14).82 Computer Organization and Architecture In order to make the flip-flop complement its own contents when JK = 11. Remember that SD and CD should never be made ‘1’ simultaneously. It is a modification of the RS flip-flop in which the combination RS = 11 is disallowed.39(b). both the inputs SD and CD are kept low (0). The use of these inputs in entering asynchronous data into the flip-flop will be discussed later. note that there is provision to set and reset the slave latch asynchronously using the set direct (SD) and clear direct (CD) inputs.32) . SD = 1 and CD = 0 sets the slave latch. Now note that the flip-flop has only one input. A simple way to disallow RS = 11 is to impose the restriction that R and S will always be complements of each other (R = S ).

Basics of Digital Systems 83 Although the operation is not dependent on the pulse width. The characteristic equation of T flip-flop is: Qn+1 = Tn Å Qn (3. This flip-flop also has a single input marked T in the figure. The symbol T stands for toggling which is a popular way of referring to complementation. it toggles.15. the change of state takes place soon after the leading edge of the clock pulse.41 D flip-flop.15 Behaviour of T Flip-Flop Tn 0 0 1 1 Qn 0 1 0 1 Qn+1 = Tn Å Qn 0 1 1 0 Stores Complements When T = 0. A trailing-edge triggered T flip-flop is readily obtained by connecting the J and K inputs of a JK flip-flop as shown in Figure 3. When T = 1.42(b).14 Characteristic Table of D Flip-Flop Dn 0 0 1 1 Qn 0 1 0 1 Qn+1 = Dn 0 0 1 1 T flip-flop: A T flip-flop has the characteristic behaviour as given in Table 3.42(a). TABLE 3. TABLE 3.33) . this can be obtained using a JK flip-flop as shown in Figure 3. If trailing-edge triggering is needed. FIGURE 3. the content of the flip-flop is unaffected (store mode).

43(c).44 Clocked master-slave JK flip-flop.44.43(a).43(b) as intermediary.84 Computer Organization and Architecture FIGURE 3.42 Conversion of a JK flip-flop into (a) D flip-flop (b) T flip-flop. The arrangement of a master-slave JK flip-flop using NAND gates alone is shown in Figure 3. FIGURE 3. In the CMOS technology. . all flipflops are based on NAND gates. Which is equivalent to the NOR latch of Figure 3. We can make a latch using NAND gates also as shown in Figure 3. FIGURE 3. the basic circuit is the NAND gate.43 NAND latch compared with NOR latch. The NAND latch: Hitherto we considered latches using NOR gates. Hence. This is shown using Figure 3.

12 3.45 gives one output pulse for every 16 input clock pulses and is hence called a modulo-16 counter. Note that the J and K terminals of each flip-flop have been connected together thereby making them T flip-flops. the frequency of the output Q1 is half the frequency of the clock. This in turn. Let us assume that all the flip-flops change state at the trailing edge of the clock pulses. The first stage is fed by the clock. we can count by any power of two. Binary counters: Consider four JK flip-flops interconnected as shown in Figure 3.16. The circuit of Figure 3. Since the T inputs are maintained at logical 1. This section illustrates how counters are constructed using flipflops. namely from 0000 to 1111 and returns to 0000.45. triggers flip-flop III and . They are used for counting the occurrence of events. Note that the output of the first stage is fed into the CP input of the second stage and so on. The first stage may thus be thought of as a binary counter or as a frequency divider which divides by a factor of 2. For instance. Observe also that the contents of the flip-flops follow the natural binary counting sequence.12. Counters form important building blocks in digital systems. Observe that one output pulse is obtained for every two input pulses from the output of the first flip-flop.1  COUNTERS A Binary Counter A counter is a sequential circuit consisting of a set of flip-flops which go through a sequence of states on the application of clock pulses. when the count is to change from 0111 to 1000. Thus. By cascading a number of such flip-flops. each flip-flop will complement for every negative transition of the pulse applied to its CP input. frequency division and generating timing sequence to control operations in a digital system. TABLE 3.Basics of Digital Systems 85 For ready reference. There will be a small delay between the clock input and the output of the first flip-flop.16 Characteristic Equations of Flip-Flops Flip-flop type RS latch (11 for RS not allowed) JK flip-flop (master-slave) D flip-flop T flip-flop Characteristic equation Qn 1 Sn  ( Rn ¹ Qn ) Qn 1 Jn ¹ Qn  K n ¹ Qn Qn+1 = Dn Qn+1 = Tn Å Qn 3. we consolidate the characteristic equations of all the flipflops in Table 3. The output of the first flip-flop ripples through the other flip-flops in tandem. the trailing edge of the pulse from flip-flop I triggers flip-flop II.

46. The inputs to the four flip-flops are: (3. As Q1 is input to flip-flop II. Q4 are as shown in Figure 3. .86 Computer Organization and Architecture FIGURE 3. Q2 . Thus. This counter is called ripple carry counter. All of them are thus T flip-flops. Q3 The outputs Q1. the last flip-flop changes state only after a cumulative delay of four flip-flops. the output of flip-flop III then triggers IV. We will first analyze a given synchronous counter and understand its behaviour.34) T1 = 1. Q2. it will toggle at each trailing edge of Q1 as shown in Figure 3. and the output is obtained after a delay. Q3 are 1 and transition to 0 as seen in Figure 3. all of them will change simultaneously. etc.12. Using these wave forms Table 3.Q2. Q2. go to 1111 and return to 0000. It is thus a modulo-16 counter. As T3 = Q1. A counter in which all flip-flops change state simultaneously at the trailing edge of a clock pulse is called a synchronous counter.46.46 is a synchronous counter.2 Synchronous Binary Counter In the last para we showed how a ripple carry counter works.45.46. 3. We see from this table that the output of four flip-flops follow the natural binary counting sequence. Following the same argument Q4 will become 1 when Q1. Notice that the J and K inputs of all four flip-flops are tied together. Q3.45 A four-stage ripple counter. Q1 will toggle at trailing edge of each clock pulse as shown in Figure 3. T3 = Q1 . Q2 and T4 = Q1 . second row to the second clock pulse.46. Figure 3.17 is obtained where the first row corresponds the first clock pulse. Observe that all flip-flops are connected to a common clock. flipflop III will toggle when Q1 and Q2 are both 1 and transition from 1 to 0 as seen in Figure 3. T2 = Q1. They start from 00000. Thus.

17 Natural Binary Counting Sequence Q4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 Q3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 Q2 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 Q1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 .46 Synchronous counter. TABLE 3.Basics of Digital Systems 87 FIGURE 3.

the data moves from the flip-flop X to Y at the occurrence of the trailing edge of the clock.47. data also moves in from the serial data input terminal into X and the data contained in W is lost.49 is used. The method of interconnecting JK flip-flops to form a shift register is shown in Figure 3. etc. As each flip-flop gets triggered at the trailing edge of the clock pulse. Generally. Asynchronous data entry is indicated in Figure 3. In addition.48. When SHR signal is ‘1’ data is transferred from flip-flop 1 to 2. we will illustrate how a register can be constructed with flip-flops. FIGURE 3. We assume that a D flip-flop changes state at the trailing edge of a clock pulse applied to it. D flip-flops are used to construct registers. we can . Thus. 2 to 3. Consider the fourbit register of Figure 3. In other words. Recall that the behaviour of a D flip-flop is given by Qn+1 = Dn (see Table 3.47. The output of the first flip-flop is connected to the input of the second flip-flop.47 Shift register using D flip-flops. we get Yn+1 = Xn.14). the arrangement shown in Figure 3. and how its contents are shifted using control signals and gates associated with the flip-flops. This is called parallel data transfer. In order to shift the contents of the register left or right. data moves from Y to Z and Z to W. In other words.13 SHIFT REGISTERS A register is a unit which can store a string of bits. In this section. there is no ambiguity in the transfer of data.88 Computer Organization and Architecture 3. In a number of algorithms it is required to enter bits into the register and shift its contents to the left or right by a specified number of bits. This can be done independent of the clock (asynchronously) or the data may be constrained to enter only at the occurrence of the trailing edge of the clock pulse (synchronously). where the data is entered when the parallel load enable signal is ‘1’. Simultaneously. Often it is necessary to enter data into all flip-flops in the register simultaneously.

18 gives the functions to be performed by this shift register. New data can also be entered into flip-flop 3 via the terminals marked in Figure 3. A block diagram of such a shift register is given as Figure 3.48 Shift register using JK flip-flops. simultaneously write fresh data in a flip-flop and read its previous content. .49 Left-right shift register. Such a shift register can be an integrated circuit in a single chip. This becomes possible since the new data does not get into the flip-flop till the trailing edge of the clock pulse arrives and the old content is safely transferred to the next flip-flop. Controlled shift register: Controlled shift registers can be made using MUXs and JK flip-flops.49.50. Table 3. into 1 (not shown in Figure 3. the content of flip-flop 3 is transferred to 2 and that of 2.Basics of Digital Systems 89 FIGURE 3. In this case. Shifting left is achieved by making the SHL signal ‘1’. FIGURE 3.49).

.51 when XY = 00 then I0 input to MUXs. a2. namely a0.51 Realization of controlled shift register with MUXs. a1. a3 Shift right register Shift left register Complement contents of register A MUX output may be connected to each of the flip-flop inputs.18 Control Inputs to a Shift Register Control inputs X 0 0 1 1 Y 0 1 0 1 Function Load register with a0. a1. TABLE 3.90 Computer Organization and Architecture FIGURE 3. a3 FIGURE 3.50 A controlled shift register. Referring to Figure 3. a2. This will route the appropriate MUX inputs to the four flip-flops depending on the control inputs.

52. J2 = K2 = Q0 · Q1 and J3 = K3 = Q0 · Q1 · Q2 For counting down the flip-flop equations are: (3. Similarly. Q2 ¬ Q3 and Q3 ¬ SHR. a1 as Q1. J1 K1 Q0 .36) .19.35) J0 K0 1. Q0 ¬ Q1. then at the trailing edge of the clock pulse a0 will be stored as Q0. The reader can deduce how the other control inputs route data to flip-flops. J1 = K1 = Q0.19 A Controlled Counter Control inputs X 0 0 1 1 Y 0 1 0 1 Function Stop counting Count up Count down Complement state The block diagram of the counter to be designed is shown as Figure 3. when XY = 01. TABLE 3. FIGURE 3. The state diagram of the counter for counting up and down may be drawn and by its inspection we may deduce that for counting up.52 Block diagram of a controlled binary counter. If the Enable input is 1. J0 = K0 = 1. J2 K2 Q0 ¹ Q1 and J3 K3 Q0 ¹ Q1 ¹ Q2 (3.Basics of Digital Systems 91 will be routed as the inputs to the four flip-flops of the shift register. Q1 ¬ Q2. Controlled binary counter: We will now design a controlled binary counter which has the functions defined in Table 3. a2 as Q2 and a3 as Q3.

4..) and OR(+) and a unary operator called NOT (shown as a bar – over a variable). an) can be obtained by inspecting the column corresponding to the dependent variable z starting from the first column. It is useful for designing digital switching circuits. A Boolean function f (a. Thus. A truth table lists the value of the dependent variable for all possible values of the independent variables a. …. c. there can be only eight possible combinations of values of a. SUMMARY 1. b and c starting with 000 and ending with 111. A set of postulates define the functions of the operators and the result of applying these operators to binary variables (i. c and the corresponding value of f(a. b. b. c) of three variables a. Using these postulates a set of theorems can be proved. A term in the Boolean . 6.e. 5.3 gives a summary of all the postulates and theorems of Boolean Algebra. b. c). b. As the independent variables can have only one of the two values (either 0 or 1).53 MUX realization of a controlled counter. Switching algebra deals with a set of elements (0. 3. Table 3. b. A Boolean expression corresponding to any truth table representing z = f(a1. FIGURE 3. 2. the truth table has eight rows and each row gives one combination of values of a. A subset of a general algebra known as Boolean Algebra is called Switching Algebra. variables which can have a value 0 or 1).92 Computer Organization and Architecture Using MUXs the counter may be implemented as shown in Figure 3.1) together with two binary operators known as AND(.53. a2. c can be defined using a truth table.

m1. . . a1. 7. M4 . a row: an 0 an–1 1 . A Boolean expression for z = f (a0. …. For example. 7. A Boolean expression for z = f (a0.. a3 1 a2 0 a1 1 a0 0 z 1 will have a term in the Boolean expression for z as an ¹ an 1 " a3 ¹ a2 ¹ a1 ¹ a0 . 16 possible truth tables can be formed. namely. 2n – 2 is represented in this form as: z = M1 . 4. With two Boolean variables. ….. The term for the corresponding row is obtained by ORing independent variables. A notation m0. . 8. M(2n– 2) = P 1. 6. is used to represent minterms starting from the row with all 0s to the row with all 1s. two operators called NAND (NOT of AND) and NOR (NOT of OR) can be used to implement all Boolean functions and are called universal operators.. 5. There are two methods of minimizing the number of literals in Boolean expressions. 6. M2 . …. Of these. 6. Each truth table represents a Boolean function. 11. b. 3. an in the truth table. . 2.. This form is called maxterm form. ) represented with 0s in rows 1. M8 . Of these eight operators. 7 …. One commonly used criterion of optimization is minimizing the number of literals in the Boolean expression which also usually minimizes the number of gates used in realization.. + m(2n–1) = S 0. . eight can be used to represent binary operators. 8... Both of them repeatedly use the theorem a . b = a. …. b + a . 3. The expressions of all the rows with z = 1 are ORed to get the final expression for z. …. The expression obtained as explained above is said to be in a standard sum of products form. …. 4. 2n – 1 is represented in the standard form as: z = m0 + m3 + m5 + m6 + m7 + . (2n – 2) 9. a1. 5. Each term in the expression is called a minterm. 10. AND operation is then applied to individual terms... A variable with 0 entry is used as it is and a variable with a 1 entry is complemented. M6 . a2. 2. Another equivalent form for expressing the same truth table is obtained by examining the rows with 0.. Boolean expressions in the standard form are not optimal for realization with gates. 6.Basics of Digital Systems 93 expression corresponding to each row where z = 1 is obtained by applying AND to all the independent variables a1.. 2n – 1 8. The independent variable with a 0 entry appears in the complement form and that with a 1 entry appears as it is. which results in the elimination of one variable. ) represented by a truth table with 1s in rows 0.

The labelling of the boxes is such that physically adjacent boxes are also logically adjacent. i. b. namely. a MUX with n control variables can realize all the minterms of n + 1 variables. A multiplexer (abbreviated MUX) is a combinatorial circuit which steers one out of 2n input signals to a single output line depending on the value of a n variable control variable. MUXs can be used to realize any combinatorial expression. each square representing a minterm of the corresponding truth table. as it depends on the pattern recognition in two-dimensional maps. 15. PLAs can realize any sum of products. A method called Veitch–Karnaugh map method is a graphical aid to identify pairs of terms of the type ( a ¹b  a ¹b) which allows elimination of one literal. A demultiplexer (abbreviated DEMUX) is a device which does the reverse of MUX. A MUX-DEMUX combination can be used to steer a source bit to a destination via a one-bit bus. The Karnaugh map is useful to minimize Boolean functions up to five variables.e. The Veitch–Karnaugh map is a diagram made up of a number of squares. . it is entered in a box corresponding to this minterm. A map for three variables has eight squares and a map of four variables has 16 squares. For functions of more variables. 14. Thus. Whenever the value of the dependent variable is 1. It steers data on a single input line to one of 2n output lines depending on n control variables. 13. Thus.94 Computer Organization and Architecture 12. NAND and NOR gates are universal gates in the sense that AND. for example. 21. an algorithmic method is needed. C we can realize A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C + A ¹ B ¹ C . if A and B are the control variables and the four inputs are C ¹ C . 20. 16. The inputs to the AND array has both true and complements of input variables – If both the AND and OR arrays have fusible lines to select gates it is called a Programmable Logic Array (PLA). A Progammable Logic Device (PLD) consists of an array of AND gates feeding an array of OR gates. C . 18. they differ in only one variable which appears in true form in one box and in its complement form in the adjacent box. By inspecting 1 entries in the box it is possible to identify the possibility of combining minterms to eliminate one or more variables. OR and NOT operators can be realized with only NAND gates or only NOR gates. Quine–McCluskey method of minimizing Boolean expressions is an algorithmic method which can be automated using a program. 17. Universal characteristics of NAND/NOR gates make them the preferred gates to design combinatorial circuits as variety is reduced with consequent advantages in design and maintenance. 19.

but also on the past history of inputs.Basics of Digital Systems 95 22. 25. An initial . it is easy to program a PAL after selecting an appropriate PAL. 30.34). They are JK. 27. The outputs of an ideal combinatorial circuit depend only on the current inputs and not the past inputs. They are inherently faster but difficult to design. Thus. 29.e. A simple sequential circuit which exhibits memory is an RS flip-flop which is made using two NOR gates (see Figure 3. The outputs of sequential circuits depend not only on the current inputs. 33.1 not allowed) J n ¹ Qn  K n ¹ Qn Dn Tn § Qn 32. its output) at clock time (n + 1) given its output (or state) at the nth clock time. 31. They exhibit what is known as race condition. Synchronous circuits change state at predictable times as they are driven by a periodic sequence of pulses known as clock. In other words. The characteristic equation of RS. There are four types of flip-flops. A PAL (Programmable Array Logic) has a fixed OR array and a programmable AND array with fusible links. 24. An asynchronous sequential circuit is one whose inputs and outputs do not change at pre-assigned times. 28. A commonly used sequential circuit is called a register. A counter is a sequential circuit whose state changes to a predefined next state with application of a clock pulse and return to the starting state after the application of a finite number of clock pulses. Behaviour of a flip-flop is determined by what is known as its characteristic equation. 26. Sequential circuits can be modelled at a circuit level by an ideal combinatorial circuit some of whose outputs are fed back after a delay as secondary inputs to the combinatorial circuit. A register stores a string of bits and is made up of a set of interconnected D flip-flops. The characteristic equation gives the state of the flip-flop (i. The delay acts as memory. D and T flip-flop. they spontaneously change their state whenever appropriate inputs are applied. they have memory. Thus. Clocks determine whether system can change state. D and T flip-flops are respectively: Qn 1 Qn 1 Qn 1 Qn 1 Sn  Rn ¹ Qn ( R. JK master-slave. S  1. Given a set of Boolean expressions in a sum of products form. 23. making the output state unpredictable. most practical circuits are synchronous circuits. JK. all based on clocked RS flip-flop. Realizing with PAL is simple and does not require any minimization.

6. When bits are serially fed and the contents of register shifted. c. 10. 14. DEMUX system which will allow the 16 terminal users to time share the communication line and use the computer. d) = S (0. d) = S (0. 7.96 Computer Organization and Architecture value can be loaded into all the flip-flops simultaneously (initiation step). 6. x. Exclusive OR operation is defined for two variables as a Å b = a ¹ b  a ¹ b . EXERCISES 1. 15) 9. 11. 9. 13. 3. Shift registers of various sizes and controls are available as IC chips. z) = P (3. 8. 2. b. 1. 15) 7. design a MUX. Show that the three definitions in Section 3. Using Karnaugh maps simplify the following Boolean functions: (i) f (a. Simplify the following Boolean functions using the theorems of Boolean algebra: f1 (a. 2. c. 3. 11. 7. . d) = S (1. A single communication line between a computer building and another remote building is to be used by 16 terminal users located in the remote building.1 satisfy the postulates of Boolean algebra. z) = S (2. Registers are common building blocks of many digital systems such as calculators and computers. Using the postulates of Boolean algebra prove the following: (i) x ¹ y  x ¹ z  y ¹ z (iii) x ¹ y  y ¹ z  x ¹ z (i) a ¹ b  a ¹ c  b ¹ c (ii) a ¹ b ¹ ( a ¹ b ¹ c ) ¹ (b  c ¹ d ) 5. x. Express the following in canonical sum and canonical product form: (ii) f (w. 6. Prove the associative law for exclusive OR operation. 13) f x¹yx ¹z x ¹ y¹z x ¹ y  y¹z  x¹z (ii) ( x  y) ¹ ( x ¹ z  z ) ¹ y  x ¹ z 4. 5. 13) f3 (w. 7) f2 (a. b. y. The contents of a register can be shifted left or right. y. 5. 3. Repeat the problems in Exercise 6 using Karnaugh maps and verify your results. 34. Prove De Morgan’s law for ‘n’ variables. 4. 6. 3. c. 15) + S (2. b. these bits replace the values stored earlier. 8. A string of bits can also be serially fed to a register. 11. Assuming that each terminal has a buffer register.

one output. Is MUX or a DEMUX appropriate to realize it? Realize the circuit for a four-bit of Op-code. 13. Repeat with PALs and PLAs. else shift right once for each shift pulse. Design a controlled four-bit register which can be controlled to perform the following four functions: Load. An Op-code decoder takes as input a set of bits and places a 1 on any one of a set of output lines. Design a logic circuit using DEMUXs which will accept a seven-bit ASCII code for a character and energize one of 128 solenoids to activate a type bar corresponding to the appropriate character. 14. 2’s complement.Basics of Digital Systems 97 10. one shift pulse input and a control input. 16. Design an NBCD counter using MSI counter chip to (i) Load. Design a shift register which has one input. Modify an asynchronous RS flip-flop appropriately so that when R and S are both 1. (ii) 9’s complement contents. Repeat Exercise 8 using MUXs. If the control input is a 1. 11. 12. shift left once. . 15. Clear. the flip-flop is set. (iii) Count up and (iv) Count down. 1’s complement.

Algorithms for multiplication and division. subtract.1 INTRODUCTION One of the important components of all computers is an Arithmetic Logic Unit (ALU). real numbers.ARITHMETIC AND 4 LOGIC UNIT–I LEARNING OBJECTIVES In this chapter we will learn: â â â â â â â â Algorithms for addition and subtraction. of IEEE 754 standard for representing real numbers. How to store real numbers using floating point representation. Details Algorithms to add/subtract/multiply/divide floating point numbers. multiply and divide and logic operations such as AND. How to add/subtract numbers represented using one’s and two’s complement notation. As the name implies. 4. How to design logic circuits to add/subtract/multiply/divide integers and Functions of MSI chips which perform arithmetic and logic operations. Big endian and little endian representation of integers. 98 . this unit is used to carry out all the four arithmetic operations of add.

On the negative side. while counting. The decision will also be determined by the need for speed in a given situation. 1 0 0. This type of realization is economical in the use of hardware but is inherently slower compared to combinatorial realization. 1. In decimal addition. these logic circuits use large number of gates and are expensive to realize. NOT and XOR on bit strings. This unit is also used to compare both numbers and bit strings and obtain the result of comparison to be used by the control unit to alter the sequence of operations in a program. For example. are obtained by adding 1. The 1 represents a carry to the tens position in the positional system. Arithmetic operations are carried out both on integers and reals. 1 0 1. There are significant differences between the methodology used for combinatorial design and sequential design. A designer thus has to pick the appropriate realization based on cost/time trade-off. We will thus first examine the representation and algorithms for arithmetic operations and then realize them with logic circuits. to perform arithmetic on two 32-bit operands we can design a four-bit unit and use it sequentially eight times with appropriate clocking. In order to design ALU we should first understand the algorithms for binary arithmetic operations. we count 10 (one followed by zero). as the base of the system is 10 (ten) and as there are no further symbols in the system. There are two distinct methods of realizing logic circuits to perform arithmetic/ logic operations. In this chapter we will be mainly concerned with the logical design of the ALU of computers. The other method is using sequential systems for realization.Arithmetic and Logic Unit–I 99 OR. some interesting problems of representation arise when large range of real numbers with adequate precision are to be stored and processed. in binary system the count progresses as follows: 0. After 9. Once we decide on the methods of representing decimal numbers and algorithms to process them. 1 1. the next problem is of designing logic circuits to implement these operations. 1 0. we start with 0 and by successively adding 1 reach 9. 4. In such systems a single logic circuit (such as an adder) is used repeatedly on one bit or a group of bits to realize addition/subtraction. … . While integers are simple to deal with. One of them is using combinatorial circuits and has two advantages: (i) ease of realization and (ii) potentially higher speed of operations of these circuits. Similarly. This chapter will describe combinatorial realization of ALUs.2 BINARY ADDITION Counting is a form of addition as successive numbers. We thus have to discuss the trade-offs between various methods of representing real numbers.

2. An addition table showing the values of sum and carry with three bits as inputs is developed in Table 4. 1 1 1 0 1 .2 A Full-Adder Table a (augend) 0 0 0 0 1 1 1 1 b (addend) 0 0 1 1 0 0 1 1 Carry (carry) 0 1 0 1 0 1 0 1 Sum 0 1 1 0 1 0 0 1 Carry to next position 0 0 0 1 0 1 1 1 . the carry bit and the bits of the two numbers being added. EXAMPLE 4. 1 0 0 1 .1 Carry Augend Addend 1 1 1 1 0 1 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 1 1 . Table 4.1 A Half-Adder Table a 0 0 1 1 b 0 1 0 1 Sum 0 1 1 0 Carry 0 0 0 1 We will now give some examples of binary addition. 0 1 We see from the example that while adding two binary numbers. TABLE 4.100 Computer Organization and Architecture Using the above idea.1 to represent addition of binary numbers. 0 1 1 . we may obtain Table 4. we have to add three bits. TABLE 4.1 which is known as a ‘half-adder’ table.2 is known as the ‘full-adder’ table in contrast with Table 4.

1 1 0 0 . . +5 is represented by 0. One method of representing signs of numbers is to use an extra bit at the left end of the number.101 and –7 is represented by 1.3 and Table 4. We will obtain a subtract table after looking at a few examples.3 BINARY SUBTRACTION Binary subtraction is a special case of addition. It is ignored in computation and data storage).3 0 0 0 1 0 1 1 1 0 1 1 1 0 Borrow Minuend Subtrahend Difference EXAMPLE 4.111. it is equivalent to subtracting y from x. For instance.Arithmetic and Logic Unit–I 101 4. In order to subtract two binary numbers we may use a subtract table similar to the one used for addition. Thus. If a number x with a positive sign is to be added to a number y with a negative sign.2 Subtracting single bit Borrow Minuend Subtrahend Difference EXAMPLE 4. 0 0 0 1 . By convention a zero is used to represent the + sign and a one to represent the – sign.4 1 0 1 0 1 0 1 1 0 1 0 Borrow Minuend Subtrahend Difference 1 1  1 1 0 . In fact. A comma is used to separate the sign bit from the number. before discussing subtraction we should discuss how negative numbers may be represented in the binary system. The tables are given as Table 4. EXAMPLE 4. (Please note that we use the comma only for readability.4. 0 1 Tables similar to the half-adder table and full-adder table may be obtained using the above examples. the addition of a negative number to a positive number is subtraction.

5. if s(y) = 1. s( y ) = 1.4 Full Subtractor Table A (Minuend) 0 0 0 0 1 1 1 1 B (Subtrahend) 0 0 1 1 0 0 1 1 Borrow 0 1 0 1 0 1 0 1 Difference 0 1 1 0 1 0 0 1 Borrow to next position 0 1 1 1 0 0 0 1 The sign and magnitude of the result of an add or a subtract operation.5 s( y ) means complement of s(y). In Table 4.102 Computer Organization and Architecture TABLE 4. TABLE 4. The magnitudes of x and y are represented by m(x) and m(y) respectively and their signs by s(x) and s(y). In other words. s( y ) = 0 and if s(y) = 0. when numbers are represented with a sign and a magnitude part. In this table.5 Add/Subtract Rules Conditions Is s(x) = s(y)? Operation? Is m(x) ³ m(y)? Actions Sign of Result = m (Result) = m(x) + m(y) m (Result) = m(y) – m(x) m (Result) = m(x) – m(y) s(x) X — — s(y) — X — s(x) — — X s( y ) Rule 1 No Subtract — Rule 2 No Add No Rule 3 No Add Yes Rule 4 Yes Subtract No Rule 5 Yes Subtract Yes s(x) — — X Rule 6 Yes Add — s(x) X — — — X — . x and y represent the two operands. The variable x is taken to be the first number (augend/minuend) in addition or subtraction and the variable y as the second number (addend/subtrahend).3 Half Subtractor Table A 0 0 1 1 B 0 1 0 1 Difference 0 1 1 0 Borrow 0 1 0 0 TABLE 4. may be summarized as shown in Table 4.

is 1.010.Arithmetic and Logic Unit–I 103 The table used above (Table 4.4 COMPLEMENT REPRESENTATION OF NUMBERS In the one’s complement system. 0010. The bit appearing in front of the comma is the sign bit. A dash entry for an action indicates that the corresponding action need not be carried out. In general. Two conventions for representing negative numbers which allow this are the one’s complement representation and the two’s complement representation of numbers.011. 1110. for example.5) is known as a decision table. Each column to the right of the vertical double line is called a decision rule. is represented by 1. Thus. and the magnitude of the result equals the magnitude of x minus the magnitude of y. This representation is obtained by replacing every 1 by a 0 and every 0 by a 1 in the binary representation of the number +5 (0. for an n bit number x (excluding the sign bit) the one’s complement is given by (2n – 1 – x). then the sign of the result is s(x). for an n bit number x. 4. The advantage will be that a single basic electronic circuit could then be used to implement addition as well as subtraction. the representation of positive numbers is identical to that used in the sign magnitude system.) It is seen from the table that when numbers are represented in the sign magnitude form we must have separate procedures or algorithms to add and subtract. the two’s complement of 0. The convention used to represent negative numbers is different. It will be advantageous if another convention could be evolved for representing positive and negative numbers which allows us to use one basic algorithm for both addition and subtraction. Another method of obtaining a two’s complement of a binary number is to scan the number from right to left and complement all the bits appearing after the first appearance of a 1. The two’s complement representation of a negative number is obtained by adding a 1 to the one’s complement representation of that number. It lists in a tabular form the conditions to be tested and the actions to be taken based on the results of the tests. (A reader not familiar with decision table notation should read Appendix A. The process of replacing a 1 by a 0 and a 0 by a 1 is known as bit complementing. . Thus. (2n – x). for example. –14 is represented by 1111 – 1110 = 0001 excluding the sign bit. It is left to the student to understand why this is true. Thus. the two’s complement is given by 1. The number –5. and if m(x) ³ m(y). An ‘X’ against an action indicates that the particular action is to be carried out. the two’s complement representation of –5. For example. for example. the fifth rule in Table 4. In general. is obtained as 1. and if the operation is subtract. A dash against a condition indicates that the outcome of testing the condition is irrelevant.5 is interpreted as: If s(x) = s(y).101).

4. for representation of negative decimal numbers one may use the 9’s complement or the 10’s complement representation.104 Computer Organization and Architecture Table 4. Thus. the 9’s complement of a decimal number 3459 is 9999 – 3459 = 6540. whereas its 10’s complement is 6541. We will assume that the two numbers are four bits long. . TABLE 4.6 depicts the three methods of representing negative numbers.5 ADDITION/SUBTRACTION OF NUMBERS COMPLEMENT NOTATION IN 1’S We will first illustrate with some examples addition and subtraction rules for binary numbers represented in the 1’s complement notation. If the radix of a number system is r then one may define an (r – 1)’s complement and an r’s complement. We would like to emphasize again that the comma used in the complement representation of binary numbers is intended only for easy readability and is not relevant in either storing the number or in computation.6 Three Methods of Representing Negative Numbers Sign and magnitude (binary) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Sign and magnitude (decimal) +0 +1 +2 +3 +4 +5 +6 +7 –0 –1 –2 –3 –4 –5 –6 –7 One’s complement (decimal) +0 +1 +2 +3 +4 +5 +6 +7 –7 –6 –5 –4 –3 –2 –1 –0 Two’s complement (decimal) +0 +1 +2 +3 +4 +5 +6 +7 –8 –7 –6 –5 –4 –3 –2 –1 The complement notation may be used to represent negative numbers in any number base. For example.

No separate electronic circuit is needed for subtraction in this case as addition and complementing circuits are sufficient. This leads to the right answer with the right sign.0001 1 +2 EXAMPLE 4.1001 1. This assumes that for 4 bit numbers the sum should be less than or equal to 15. In one’s complement arithmetic –y is represented by its one’s complement and it is added to x. then the overflow bit is removed and added to the result. Adding the overflow bit to the least significant bit of the sum is called end around carry. In one’s complement arithmetic if the two numbers including the sign bits are added and an overflow (beyond the sign bit) is obtained.7 and 4. it is equivalent to adding –y to +x.1101 10.8 0.6 +8  0.0101 1.1100 1 +13 0.1111 1. subtraction operation is replaced by addition of a complement.6.1000 +9  0.Arithmetic and Logic Unit–I 105 EXAMPLE 4. This is illustrated in Examples 4.8.1101 . EXAMPLE 4.1100 10.0001 Incorrect When positive numbers are added.0010 +15 –2 0. If a number y is to be subtracted from another number x. If the sum exceeds 15 this rule gives an incorrect answer as may be seen from Example 4. Thus.1010 EXAMPLE 4. This is the main reason why the complement notation is used.0111 +10 = 0.5 +3 = 0. the two binary numbers may be added including the sign bits and the result will be correct.7 +5 –3 0.0011 +7 = 0.

0111 1.0110 10.11 –5 –8 1.9 and 4. and it is in one’s complement form.7].9 –5 +3 –2 EXAMPLE 4. .1101 1 0. Examples 4.12 –8 –9 1.1111 If two negative numbers are added then an overflow results and the overflow bit is added to the answer. The final answer has the right sign.0111 11.1101 –8 +8 –0 1. The rule observed through the examples considered so far is summarized using two decision tables [see Tables 4. then the result is incorrect.1010 1.0010 –13 EXAMPLE 4.10 1.0011 1.1000 1. EXAMPLE 4. We again assume that the magnitude of the sum (for four-bit operands) is less than or equal to 15 for the rule to work correctly.1010 0.0001 1 1.106 Computer Organization and Architecture If a positive number is added to a negative number and no overflow is observed then the result is negative.1110 Incorrect Observe that if the sign of augend and addend are same and after addition the sign of the sum is not the same.10 illustrate this. EXAMPLE 4.0111 0. The answer is correct as it is. In these tables it is assumed that x and y are the operands and z is the result and that negative numbers are represented in their 1’s complement form.

13 and 4.0000 and that of –0 is 1. the sign bit will be treated as though it is a part of the number.14.7 Decision Table for One’s Complement Addition T1: s(x) = s(y) operation Complement y Add numbers including sign bit Add carry (if any) to the sum z Declare z as answer Go to T2 Stop Y Add — X X — X — N Add — X X X — X Y Subtract X X X X — X N Subtract X X X — X — T2: s(z) = s(x) Error (Result out of range) Declare z as answer Stop Y — X X N X — X In one’s complement representation of numbers. This is illustrated in Examples 4. When one of the numbers is positive and the other is negative. As before. We saw the result of adding +8 and –8 in Example 4. The bit in the sign bit position will be the correct sign bit after addition. If the answer is negative it will be in the 2’s complement form.1111. . This is left as an exercise to the student. The representation of +0 is 0.10 which gave –0.Arithmetic and Logic Unit–I 107 TABLE 4.6 ADDITION/SUBTRACTION OF NUMBERS COMPLEMENT NOTATION IN TWO’S When positive numbers are added. The rationale behind the rules given in this section may be derived using the definition of 1’s complement of a number x and systematically working out all the cases. the answer could be either positive or negative. there are two possible representations of zero. 4. The rule is to add the two numbers and ignore overflow if any. the situation is identical to the one discussed for 1’s complement notation in the last section.

17.1011 0.1111 Incorrect When two positive numbers are added and if the sum is outside the allowed range of numbers. This is illustrated in Example 4. an overflow bit will result and may be discarded. the sign bit would become 0 indicating an error condition.0010 Incorrect .0010 ­ Ignore –5 +3 –2 1.14 0.1011 1.15 –5 –3             EXAMPLE 4. For four-bit operands the answer should be £ 15.1000 0.17 +8 +10 0.1110 When two negative numbers in two’s complement notation are added.108 Computer Organization and Architecture EXAMPLE 4.0011 1.0101 1. EXAMPLE 4. EXAMPLE 4.16 1.1010 1. The sign bit will be correct if the sum is within the allowed range of the answer. If the answer is outside the permitted range.1000 ­ Ignore –8 –9 1.0111 10.1101 –8 11.1000 1.13 +5 –3               EXAMPLE 4. then the sign bit would become 1 indicating an error condition.1101 +2 10.

multiplication of 5 by 4.Arithmetic and Logic Unit–I 109 From these examples we can derive the two linked decision tables [Table 4. 4. The sign bit of the result is represented by s(z). TABLE 4. Thus.8 (T1). is achieved by adding 5 to itself 4 times.7 BINARY MULTIPLICATION Binary multiplication is nothing but successive addition. If negative numbers are represented in the . In the situations shown in columns 2 and 3 the result cannot become too large to store and z is the answer. for example.8] for addition/subtraction of two’s complement binary numbers. In the former case.8 Decision Table for Add/Subtract Operation of Numbers in 2’s Complement Notation T1: s(x) = s(y) operation Take 2's Complement of y Add x and y including sign bit Ignore overflow Sum z is answer Go to T2 Stop Y Add — X X — X — N Add — X X X — X Y Subtract X X X X — X N Subtract X X X — X — T2: s(z) = s(x) z is answer Error (Result out of range) Stop Y X — X N — X X The simplicity of the rules for adding or subtracting numbers in the 2’s complement notation has made this the preferred method in a large number of computers. the first and the fourth column depict the situations when the result of the operation can possibly become too large to store. The method used to multiply two-signed binary numbers depends on the method used to represent negative numbers. In Table 4. This basic idea is refined to implement binary multiplication. In this table s(x) and s(y) are the sign bits of the two operands x and y. we use T2 to decide if the answer is correct or incorrect due to overflow.

copy the multiplicand and call it the first partial product. The number of bits in each new partial product is one bit more than the previous one. 3. If the least significant bit is a zero. then enter zero as the first partial product and preserve (or store) the partial product. . After each bit of the multiplier is used to develop a partial product. If it is a 1. Step 3: Add the multiplicand to the previously stored partial product after shifting the partial product one bit to the right. In this section we will discuss the multiplication of numbers in sign magnitude notation. It may thus be erased or discarded. The maximum number of bits in the final product equals the sum of the number of bits in the multiplier and the multiplicand. The final value obtained for the partial product is the product of the multiplicand and the multiplier. that bit is not used again. EXAMPLE 4. We need to preserve the partial product and shift it. Step 5: Repeat Steps 2 to 4 till all bits in the multiplier have been considered. Step 2: Examine the bit left of the bit examined last. Go to Step 5. We need to preserve the multiplicand as it is added repeatedly to the partial products. If it is a 1. If negative numbers are represented in the one’s or two’s complement form then multiplication becomes complicated. Else do Step 4.18 may be summarized as follows: Step 1: Examine the least significant bit of the multiplier. This sum becomes the new partial product. Step 4: Get new partial product by shifting the previous partial product one bit to the right.18 Consider the following long-hand multiplication method. then the method of successive addition is easily implemented. 4.110 Computer Organization and Architecture sign magnitude form. Multiplicand Multiplier Partial Product 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 Product The method used in Example 4. From this long-hand or ‘paper and pencil’ method of multiplying observe the following: 1. 2. do Step 3.

. SIGNMQ : bit {These are sign bits of MD. NPLUSN1 = 9. It is N1 = N + 1 bits long}.18) is shown in Figure 4. It is N bits long}.1 Registers for multiplication. it may be discarded and that the length of partial product grows from n to 2n in increments of one bit.NPLUSN1] of bit {ACCMQ is obtained by concatenating            MQ to the right of ACC}. SIGNACC.N1] of bit {ACC is the accumulator where partial products are developed. Using three registers. we need an ‘n bit’ register to store the multiplier and a ‘2n bit’ register to store the final product and the intermediate partial products. . ACC and MQ     respectively}. . ACC : array [1. an algorithm for multiplication may be evolved using the ideas presented in the long-hand multiplication method. We may reduce the length of the product register by remembering that after each bit of the multiplier is used to develop a partial product. In order to multiply we need three registers.N] of bit {MD is N bits long. a multiplier-quotient register of n bits to store the multiplier (or the quotient during division). Assuming an ‘n bit’ multiplier and an ‘n bit’ multiplicand. COUNT : integer {used as counter}.Arithmetic and Logic Unit–I 111 These observations aid us to develop an appropriate algorithm for implementation in a digital system. and an accumulator in which partial products are added and stored.1. Var MD : array [1. MQ : array [1. ACCMQ : array [1. S 0 1 1 0 1 Multiplicand register (MD) S 0 0 0 0 0 0 S 0 1 0 1 1 Accumulator S: Sign MQ-register FIGURE 4. .N] of bit {MQ is the multiplier-quotient register in which the multiplier is stored. We thus implement multiplication using three registers: a multiplicand register which can store n bits. For n = 4 the configuration of the registers and their initial contents (for the case of Example 4. This register needs (n + 1) bits as the intermediate sum of two n bit numbers could be (n + 1) bits long. SIGNMD. Multiplicand stored in MD}. The accumulator and the MQ register are physically joined so that their contents can be shifted together either left or right. . Procedure to multiply two N bit numbers Const N = 4. ALGORITHM 4.1. N1 = 5.

Shift-Right ACCMQ end. SIGNMQ. Output SIGNACC. 110110010. 4. This is known as sign bit extension. For example. ACCMQ:= Concatenate to right MQ to ACC.8 MULTIPLICATION OF SIGNED NUMBERS In the last section we discussed a method of multiplying two binary numbers represented in a sign magnitude notation. SIGNACC: = SIGNMD Å SIGNMQ. 2’s complement of 0. ACC:= 0. MQ. This is illustrated in Example 4. ACCMQ end {of algorithm}. the multiplicand is negative and is represented in the 2’s complement form then multiplication can be carried out by the same algorithm presented in the last section with careful attention to the sign bits. If the multiplier is negative and the multiplicand is positive we can interchange them and carry out the same algorithm or we can complement the multiplier and carry out the algorithm. for COUNT:= 1 to N do begin     if MQ [N] = 1 then ACC:= ACC + MD {MQ [N] is the least significant bit of MQ}. 001001110 (78) is 1. If both are negative then both can be complemented and we can use the algorithm used for positive operands. +5 = 00000101 and –5 = 11111011. EXAMPLE 4.19 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 1 1 0 0 0 1 (–13) Multiplicand (+6) Multiplier First partial product Second partial product Third partial product Fourth partial product Last partial product (–78) Final answer in 2’s complement form 1 1 1 0 1 1 0 0 1 0 Observe that the leading bits of the negative partial products are made 1.19.112 Computer Organization and Architecture begin {of algorithm} Input SIGNMD. MD. For instance. i. This is necessary if we remember that the leading zeros of a positive number would become one’s when their two’s complement is taken. .e. If one of the numbers.

namely 2. Thus. The first digit of the quotient is thus zero. Thus. Another method called the nonrestoring method is also popular. if so. how many times. In order to divide 721 by 25 we first see if 25 will ‘go into’ the first digit 7 of the dividend. We see that 72 – 25 is positive. It is called the restoring method for division. 25 can go into 72 at least once. Let the dividend be 721 and the divisor be 025. The next digit of the dividend. namely. The division process is illustrated below: Divisor 025) Dividend     721  25 –18 25   72   75 –03   25   221   225 –004   25   21 ¬ Restoring step ¬ Remainder ¬ Restoring step Quotient (028 ¬ Restoring step . The last digit 1 of the dividend is appended to 22 giving 221. 25 does not ‘go into’ 7. Thus. We next try 72 – (3 ´ 25) and see that the answer is –3 and negative. The quotient is thus 028 and the remainder is 21. 25 can go into 72 only two times.Arithmetic and Logic Unit–I 113 4.20 Long-hand division of decimal integers:  We will explain the procedure for dividing by considering a three-digit dividend and a three-digit divisor. finding out how many times 25 will go into 221. We do this by adding 25 and get back 7. we see that it can go eight times. This section illustrates division by the long-hand method used for decimal numbers which can then be extended for binary numbers. As no more digits are left in the dividend. Thus. We will not discuss that method leaving it as an exercise to the student. EXAMPLE 4.9 BINARY DIVISION Binary division may also be implemented by following a procedure similar to that used in long-hand division with appropriate modifications. We now examine whether 25 can go into 72 and. Next we try if (2 ´ 25) can go into 72. Repeating the same step. To proceed further we have to ‘restore’ the negative answer –18 back to the original dividend. we subtract 25 from 7 and see if the answer is negative or positive. In this case the answer is –18 and is negative. In other words. is now appended to 7. We restore the remainder back to –3 + 25 = 22. the division process ends. The quotient digit is 2. As 72 – (2 ´ 25) = 22 is positive we see that 25 can go into 72 at least twice. the quotient digit is 8. We will discuss in this section a method for dividing integers represented using the sign magnitude notation.

The dividend bits are used starting from the most significant bit.21. The question of how many times the divisor will go into the dividend is irrelevant as the quotient bit can be either 1 or 0.21 Long-hand division of binary integers: We illustrate below the division of 1011 by 11. 3. The divisor is to be preserved as it is to be successively subtracted from the dividend. Once a bit is used to develop a quotient it is not needed again. 3 and 4 four times (as the dividend in this example is four-bit long). we see that: 1. Repeat Steps 2. EXAMPLE 4.21 is expressed as the following step-by-step procedure: Step 1: Let y be the most significant bit of the dividend. Else quotient bit is 1. 2. the corresponding most significant bit of the dividend may be discarded. Step 3: If a borrow is generated in subtraction then add divisor to the remainder to restore dividend. Quotient bit is 0. As each quotient bit is developed. Step 4: Append the next significant bit of the dividend to the remainder. The bit to its right is appended to the remainder for developing the next quotient bit.114 Computer Organization and Architecture Binary division is similar and in fact simpler as we have to only check whether the divisor will go into the dividend or not. From the long-hand division presented above. Step 2: Subtract the divisor from y. Binary division is illustrated with Example 4. . Quotient bit 0 Divisor 11 ) Borrow ® Restore Dividend 0 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 Borrow Restore 1 1 1 1 1 1 0 1 1 1 1 1 No borrow No borrow 1 0 1 1 1 1 0 ¬ Remainder Quotient ( 0 0 1 1 The method used in Example 4.

end {of algorithm}. COUNT: integer {used as counter}. Const Var . Algorithm 4. MQ respectively}. The accumulator and the MQ register may be again physically joined so that their contents can be shifted together. Procedure to divide two N bit numbers N = 4. SIGNMQ: = SIGNMD Å SIGNMQ. ACC.NPLUSN1] of bit {ACCMQ is obtained by concatenating   MQ to the right of ACC}. ALGORITHM 4. SIGNMQ: bit {These are sign bits of MD. SIGN ACC.2). if ACC [1] = 1 then begin MQ [N]: = 0.N1] of bit {ACC is the accumulator where the remainders are developed. end. MQ has the quotient}. ACC. ACC: = 0. Observe that we have used 2’s complement addition instead of subtraction.Arithmetic and Logic Unit–I 115 We may thus use three registers again as was done for multiplication. ACCMQ: = Concatenate to right MQ to ACC. SIGNMQ. Output SIGNMQ.N] of bit {MQ stores the quotient during division}. N1 = 5. MD.2... N1 = N + 1}. ACC: = ACC – MD {This may be achieved by adding 2’s complement of MD to ACC}. ACCMQ: array [1.N] of bit {MD is N bits long. MQ: array [1. ACC: array [1. an (N + 1) bit accumulator in which the dividend is originally stored and from which the divisor is subtracted in each step and an N bit quotient register (see Figure 4. MD: array [l. MQ {Initially the dividend is stored in MQ}. MQ {ACC has the remainder after division.. ACC: = ACC + MD end else MQ [N]: = 1. S Sign Divisor (MD) N S S MQ[N] ACC[1] N+1 Accumulator N MQ-register FIGURE 4. Divisor is stored in it}.2 gives the detail of implementation of division. for COUNT: = 1 to N do begin Shift-Left ACCMQ. NPLUSN1 = 9. begin {of algorithm} Input SIGNMD. SIGNMD. The three registers used are again an N bit register to store the divisor..2 Registers for division.

Thus.2 using the data divisor 1 1 and dividend 1 0 1 1 below: Initial state ACC 0 0 0 0 0 COUNT = 1 Shift-Left ACC: = ACC – MD ACC [1] = 1 Set MQ [N] = 0 Restore COUNT = 2 Shift-Left ACC: = ACC – MD ACC [1] = 1 Set MQ [N] = 0 Restore COUNT = 3 Shift-Left ACC: = ACC – MD ACC [1] = 0 Set MQ [N] = 1 COUNT = 4 Shift-Left ACC: = ACC – MD ACC [1] = 0 Set MQ [N] = 1 MQ 1 0 1 1 ACC 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 ACC 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 ACC 0 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 MD 0 0 1 1 MQ 0 1 1 0 0 1 1 0 0 1 1 0 MQ 1 1 0 0 1 1 0 0 1 1 0 0 MQ 1 0 0 0 1 0 0 1 MQ 0 0 1 0 0 0 1 1 Quotient      ACC 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 Remainder 4. the most significant byte is stored in the numerically lowest memory address. Each time such an integer is loaded in CPU.116 Computer Organization and Architecture We have traced Algorithm 4. This representation is called a big-endian representation.10 INTEGER REPRESENTATION Normally. integers are represented using 16 bits.3. multiple bytes are transferred. 32 bits or 64 bits in most computers. . In some CPUs. when four bytes of a 32-bit integer is taken to CPU it will be as shown in Figure 4.

for example. There is no specific advantage of one method over the other. It is important for a systems programmer to know which method is used by the hardware of a given computer. This is called a little-endian representation.g. Sign + 1 7 2 3 5 4 Assumed decimal point FIGURE 4. due to this non-uniformity. Consider. 4. It may be written as: 1. the real decimal number 172. 172. Intel 80 ´ 86 series) the least significant digit is stored in the numerically least memory address. portability of programs between machines of two different manufacturers becomes unnecessarily difficult.5 illustrates this representation. (e. The decimal point is assumed to exist between the two parts of the register and should be remembered by the user while manipulating the number in the register.172354 ´ 103 Assume that a register is available which can store six digits and a sign bit.354 2. Further. . .3 Big-endian representation of 32-bit integer.5 Real number representation.Arithmetic and Logic Unit–I Byte addresses Least significant digit 32-bits CPU register x+3 x+2 x+1 x Most significant digit 117 FIGURE 4. One way of storing the above number in the register is to imagine that the register is split into two parts: one part containing the integer portion of the number and the other fractional portion.4 Little-endian representation of 32-bit integer. This is shown in Figure 4. In some other CPUs.4. Byte addresses Most significant digit 32-bits CPU register x–3 x–2 x–1 x Least significant digit FIGURE 4.11 FLOATING POINT REPRESENTATION OF NUMBERS A number with a fractional part (commonly known in the computer literature as a real number) may be stored and represented in several ways in a digital system. Figure 4.354.

118 Computer Organization and Architecture Hardware implementation of arithmetic operations is simple if this representation is used.5 the range is ± 999. namely 00 to 99. .6. 1099. If the exponent has a separate sign then the range of numbers will be 10–99 to If we do not have the facility to store a separate exponent sign then in order to be able to store both positive and negative exponents it would be necessary to split the range of the exponent. the range of numbers that could be represented using this notation is limited. particularly in multiplication and division operations. Consider the second form of 172. The practical difficulty in using this scheme is the need for the user to keep track of the decimal point location and significant digits.127 ´ 84). one may transfer any carry or borrow generated in the fractional part to the integer part. The fractional part is known as the mantissa (also known as significand) and the power of 10 multiplying the fraction is known as the exponent.6 the most significant four digits of the number are stored in the mantissa part. We have to devise some other way to indicate the sign of the exponent. octal. This is known as the floating point representation. Second. The two parts of the number may be treated independently. After the operation is performed. If we arbitrarily allot two digits for the exponent and four digits for the mantissa. The last two digits are truncated and thus lost. Two problems arise in doing this: first. As indicated in Figure 4.g. If a number in this form is to be stored in a register with a capacity of six digits and a sign. The gain in using a separate exponent is the increase in the range of numbers we can represent. In this form the number is written as a fraction multiplied by a power of 10.6 Real number with mantissa and exponent. then we should divide the register again into two parts: one part to hold the mantissa and another part to hold the exponent.354 at the beginning of the section. With the register configuration of Figure 4.5. as only one sign bit is available it can be used with the mantissa only. all the six digits in the number cannot be stored in the register as two digits in the register have been taken out to store the exponent. If we shift the origin to 50. a number would be represented by a fraction in that base multiplied by a power of 8 (e. Further. we may store the number in the register as shown in Figure 4. This is the penalty for having a separate exponent. we may interpret all exponents greater than 50 as positive and all . If other number systems are used. Sign + 1 7 2 3 0 3 Mantissa Assumed decimal point Exponent FIGURE 4.999. into two parts. We will now consider another method of storing real numbers using the register of Figure 4. for example.

.9999 ´ 10+49 and the minimum magnitude . Normalization is universally used in digital systems. Using the normalized floating point representation of real numbers the range of numbers we can represent in a six-digit register is: maximum magnitude .9 Normalized storage of –0. the range of the exponent will be –50 to +49.8 Improper storage of –0. 00 –50 +50 00 Sign + 1 7 2 3 5 3 +99 49 Mantissa Assumed decimal point Exponent FIGURE 4. Exponents expressed in this notation are said to be in the excess 50 form.9 thereby preserving all the significant digits in the mantissa. A little thought shows that the information conveyed by the leading zeros in the mantissa may be included in the exponent. In the normalized floating point representation.1000 ´ 10–50.7 Representation of . Thus.Arithmetic and Logic Unit–I 119 exponents less than 50 as negative.1234 ´ 10–7 and stored as shown in Figure 4.001.8. – 0 0 1 2 Exponent 4 5 Mantissa FIGURE 4.001234 ´ 10–5.001234 ´ 10–5 as –. Assume that we want to store the number – .001234 ´ 10–5 in real format.1234 ´ 10–7.1723 ´ 103 with excess 50 exponent. The number may be written as –. Using this notation .7.1723 ´ 103 may be stored in the register as shown in Figure 4.999 to 000. This should be compared with the magnitude range representable with a fixed (assumed) decimal point which is 999. the most significant digit of the mantissa is non-zero. If we blindly follow the procedure mentioned in the previous paragraphs we may store it as shown in Figure 4. This is illustrated below. This technique is called normalization. Exponent 2 3 4 4 3 – 1 Mantissa FIGURE 4.

In the excess 50 representation of exponent. The second method called floating point representation of real numbers increases the range of numbers which can be represented by using an exponent. One is to assume a decimal point to be fixed at a particular point in the number. Besides the loss of two significant digits. Suppose a 32-bit register is available.120 Computer Organization and Architecture Thus. The mantissa represents fraction with a non-zero leading digit and the exponent the power of ten by which the mantissa is multiplied. We will reject a fixed point representation as the range of real numbers representable using this method is limited. a decimal number with a fractional part) in two ways. We will now examine the methods which we may use to store real numbers in it. as zero. reduces the precision of numbers which can be represented as part of the available digits is used to store the exponent. is that all the available digits are used to represent the number. Another question which arises in the normalized floating point representation is how to represent zero. The price paid. not exactly zero. due to rounding of numbers which arises because of the finite number of digits in the mantissa it would be preferable to call a very small number. however. we can store a real decimal number (i. On the balance the floating point representation of real numbers with normalized mantissa is the preferred method as the representation of larger range of numbers is more important in practical computation.e.1 Binary Floating Point Numbers We saw in the last section that given a fixed length register to store numbers. The disadvantage is that the range of numbers which can be represented is severely limited. We will discuss these rules later in this chapter. using a given six-digit register with one sign bit. however. Binary floating point numbers may also be represented using a similar idea. This is desirable as it will simplify the circuitry to test zero. With a floating point representation we have to decide the following: . a zero will have both mantissa and exponent equal to zero. This suggests that zero may be represented by all zeros for the mantissa and the largest negative number as exponent. In actual computation. the normalized floating point representation is able to store a much larger range of numbers. namely the loss of two significant digits in order to do this. Thus. If we extend the idea in the simplest possible way to binary numbers. the largest negative exponent is represented by 0. is well worth it. The main advantage of the first method. 4. called fixed point representation. the use of normalized floating point numbers requires some specific rules be followed when arithmetic operations are performed with such numbers. It. then a binary floating point number would be represented by: mantissa ´ 2exponent where the mantissa would be a binary fraction with a non-zero leading bit. If all the digits in the mantissa are zero then one may conclude that the number is zero. The other is to divide the available digits into two parts: a part called the mantissa and the other called the exponent.11.

as exponents will not use any separate sign.1111 … 1 ´ 2111111 | ¬23 bits® | = (1 – 2–23) ´ 2127 @ 1038 The minimum magnitude number is: 0. Solution –(0.101) in binary = –0. normalized mantissa = 0. A zero will be represented by all 0 bits for mantissa as well as the exponent. exponent = 127 = 01111111. The number of bits required to represent seven significant decimal digits is approximately 23 (1 decimal digit = log2 10 bits). Whether to use a base other than 2 for the exponent. exponent –0 = 127. In the normalized floating point mode the largest magnitude number which may be stored is: 0. Whether to use an excess representation for the exponent. 3. 4. If we decide to allocate one bit of the exponent for its sign then seven bits are left for the magnitude of the exponent. The number of bits to be used for the mantissa is determined by the number of significant decimal digits required in computation. The main gain in using this representation would be the unique representation of zero. Thus. EXAMPLE 4.101 * 20 Therefore. . We thus do not gain exponent range.625) = –(0.100000 … 0 ´ 2–1111111 = 0.21875) decimal in binary floating point form.101 As excess representation is being used for the exponent. the range of the binary exponent would be 0 to 255. Number of bits used to represent the exponent. it is easy to compare two floating point numbers by first comparing the bits of the exponent. EXAMPLE 4.625) decimal in binary floating point form.23  Represent (4. If one bit is allocated for the sign of the mantissa then eight bits are left for the exponent. Number of bits used to represent the mantissa. From experience in numerical computation it is found necessary to have at least seven significant decimal digits in most practical problems. 2.22 Represent –(0. Further.5 ´ 10–38 |¬ 23 bits ®| If we use excess representation for the exponent.Arithmetic and Logic Unit–I 121 1. The minimum exponent would be –127 and the maximum would be 128.

If the exponent is increased by 1 the mantissa is to be shifted left by one hexadecimal digit. Another way in which the exponent bits may be interpreted is to assume a base other than 2 for the exponent.11010101101111 .00010100 … 1 ¬¾ 23 bits ¾® ´ 161111111 Thus. If we assume a base of 16 for the exponent. (exponent –3) = 127. Allowing one bit for the exponent sign. Mantissa: 24 bits (Bits 0 to 23).10000111. 1101011010. A compromise may be made and the mantissa bit may be increased by 1 to 24 bits.24 Represent the following binary floating point number in IBM floating point format.111 … 1 ´ 16163 » 1076 ¬24 bits® and the minimum magnitude number: 0. In contrast to this when 2 is used as the exponent base. Bits 24 to 30. The penalty is some loss of significance. exponent = 130 = 100000010.10000111 * 23 Normalized mantissa = 0. This would leave seven bits for the exponent. namely four bits. Thus.122 Computer Organization and Architecture Solution (4.10000 … 0 ´ 16–64 ¬ 24 bits ® This range is considered reasonably good. the following representation would be valid. then the largest magnitude floating point number that may be represented in this format would be: 0. Thus.111 … 1 ´ 16127 » 10153 ¬23 bits® The range obtained by this representation is considerably larger compared to base 2 representation of exponents. we can lose three significant bits in the mantissa. 0. This idea was used in the floating point representation in IBM 370 series machines. As excess representation is being used for exponent.00111 = 0.21875) = 100. When 16 is used as the exponent base the most significant hexadecimal digit of mantissa should not be zero. EXAMPLE 4. increasing exponent by 1 will lead to shifting the mantissa left by one bit position. The exact format used is: Exponent: Base 16. the maximum number represented would be: 0. Excess 64 representation. Representation of zero: All mantissa and exponent bits zero. Bit 31 sign bit.

These decisions were: 1.2 IEEE Standard Floating Point Representation The Institution of Electrical and Electronics Engineers. Thus. TABLE 4. When two floating point numbers of the same sign are compared. The floating point co-processor designed for microcomputers also use this standard. a program using floating point numbers executed on some machine A will give the same result if run on machine B provided both machines A and B use the same standard representation of floating point numbers. Excess notation 3. many decisions were taken in the standard to facilitate this objective. The exponent bits are placed before the mantissa bits.11. The integer arithmetic unit is faster.00110101101011010101101111 * 24 * 24 * 24 = 0. Thus.00110101101011010101101111 * 163 exponent –64 = 3. Sign magnitude notation 2. This simplified sorting of floating point numbers. Since its announcement most computers use this standard. Table 4. Binary point on left 2. the magnitude of the exponents need only be compared first. Most computers have separate hardware units to perform floating point arithmetic and integer arithmetic. U. One of the aims of the IEEE 754 standard was to allow integer units to perform some aspects of floating point arithmetic. (IEEE) formulated a standard. . exponent = 67 = 1000011.A.9 Binary Floating Point Number Representations Mantissa 1.9 summarizes the various choices in floating point representation of binary numbers. In other words. known as IEEE 754 floating point standard [13] in the ’80s for representing floating point numbers in computers and performing arithmetic operations with floating point operands.S. increment and decrement are done using the integer unit. To have the sign bit of the mantissa as the most significant bit. 2. Exponent base binary or hexadecimal 4. operations such as comparison of numbers.Arithmetic and Logic Unit–I 123 Solution The number in normalized form is: .110101101011010101101111 * 210 = 0. Leading bit non-zero 3. Such a standard is necessary to ensure portability of programs between different computers. 2’s complement for negative mantissa Exponent 1. Therefore.

Comparison of exponents with different signs pose a challenge. Single precision 2. The number is thus interpreted as: (–1)s * (1 + s22 2–1 + s21 2–2 + … + s0 2–23)* 2(exponent – 127) where s 22. The most negative exponent is all 0s and the most positive all 1s. the range of numbers is ± (1 + 1 – 2–24) ´ 2–127 to ± (1 + 1 – 2–24)2128 which is approximately ± 2. in the IEEE format the significand is 24 bits long. In this representation the most significant bit of mantissa must always be 1. 23 bits stored in the word and an implied 1 as the 24th or the most significant bit. we get 24 bit for significand. the general form of the representation is Sign (1.0 + mantissa) ´ 2(Exponent–bias) Using these general ideas IEEE 754 standard describes three formats for representing real numbers. 0 127 255 The IEEE 32-bit standard assumes a virtual 24th bit in the significand which is always 1. As this is implied it is assumed to be on the left of the (virtual) decimal point of the mantissa. To simplify this a biased exponent representation is used. are bits at 22nd. They are: 1. to increase the precision of mantissa. The sign bit is 0 for positive numbers and 1 for negative numbers. The term significand is used instead of the term mantissa in the IEEE standard. etc. Extended precision Single precision real number representation: The format for single precision uses 32 bits to represent floating point numbers. 21st–0 th bit position of the significand.10 IEEE Single precision floating point representation. Besides this. s 21. Thus. 31 s 30 Exponent 23 22 Significand 0 FIGURE 4. Double precision 3.124 Computer Organization and Architecture 3. The distribution of bits is shown in Figure 4. This assumption is valid as all floating point numbers are normalized and the leading bit should be 1.0 ´ 10±38. . Thus.10. Thus. IEEE 754 uses a normalized mantissa representation.. The exponent uses excess 127 format (also called biased format). Thus.

The significand is: ¾® Bit number 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The exponent is: (exponent – 127) = –1 Thus.11 IEEE Double precision floating point representation.Arithmetic and Logic Unit–I 125 EXAMPLE 4. The number of significand bits in the fractional part of a double precision number is (1 + 20 + 32) bits where 1 is the implied most significand bit in normalized mode.625 will be represented as: Note that exponent is 2–1 which is expressed as: (exponent – 1023) = –1 \ exponent = 1022 = 01111111110 .625) decimal in IEEE 754 floating format. The floating point representation is thus: Sign = 1. exponent = 126. 31 30 Exponent 11 bits Sign 31 32 bits Significand (continued) 0 Second word 20 19 Significand 20 bits 0 First word FIGURE 4. EXAMPLE 4. This is approximately 16 decimal digits. The format is given in Figure 4. The exponent range is 2–1023 to 21024 which is approximately 10±308.625)10 = –(0.26 Repeat +0. (–0.01 * 2–1 The sign in IEEE format is 1.25 Represent –(.101)2 = –1. In double precision the number 0.11.625 in double precision. Exponent Bits 30 to 23 = 01111110 and significand = Bits 0 to 22 = 010…0 Double Precision Representation: Double precision representation for floating point numbers uses 64 bits.

4.10 IEEE 754 Floating Point Representation (i) Single Precision (32 bits) Exponent e width = 8 bits 1 < e < 254 0 255 255 (s is sign bit) (ii) Double Precision (64 bits) Exponent e width = 11 bits 1 < e < 2046 0 2047 2047 Significand f width = 52 bits Any bit pattern f 0 0 ¹ 0 Value (–1)s 2e–1023 (1. the precision of numbers is around 19 digits compared with 16 digits for double precision. Exponent is represented using biased format. IEEE 754 floating point standard also specifies some bit patterns for cases such as exact 0 and numbers too large or too small to be accommodated in the number of bits available in a word.126 Computer Organization and Architecture Extended Real: This format uses 10 bytes (80 bits) to represent floating point numbers. TABLE 4. The format is as follows: 1. The bias value is (215 – 1) = 32767. This is called NaN (not a number). 2. An exact value of 0 is represented by 0 for the exponent and 0 for the mantissa. Exponent bits are from 78 to 64 (15 bits). Thus. This is summarized in Table 4. The bits are numbered from 0 to 79 with bit 79 as the most significant bit. 3. 5. An exponent of 255 and a zero for mantissa is used to represent infinity. It is physically present as part of the significand.f) (–1)s 0 (–1)s ¥ NaN Significand f width = 23 bits Any bit pattern f 0 0 ¹ 0 Value (–1)s 2e–127 (1. In this format there is no hidden 1 in significand.f) (–1)s 0 (–1)s ¥ NaN .10. An exponent of 255 and a mantissa not equal to zero is used to represent exception conditions which occur when one attempts to divide 0 by 0 or subtract infinity from infinity. Significand bits are 63 to 0. Sign of the number is 79th bit. Its main purpose is for storing intermediate results obtained during computations to shield the final result from the effects of rounding errors and underflow/overflow during computation. It is usually not used to store numbers in memory. The exponent range is thus 10±4932.

f) (–1)s 0 (–1)s ¥ NaN Similar to big endian and little endian in integers we also have the same situation where floating point numbers are stored in byte addressable machines.11 (Figure 4. If the most significant bits are stored in the smallest byte address.6). EXAMPLE 4. For greater details the reader is referred to reference [8] given at the end of this book.12 FLOATING POINT ADDITION/SUBTRACTION If two numbers represented in floating point notation are to be added or subtracted the exponents of the two operands must be made equal.Arithmetic and Logic Unit–I 127 (iii) Extended Real (80 bits) Exponent e width = 15 bits 1 < e < 32766 0 32767 32767 Significand f width = 64 bits Any bit pattern f 0 0 ¹ 0 Value (–1)s 2e–32767 (0.1234 ´ 10–3 . 4. In doing this an operand would need shifting. This is to preserve the significance of the results. The points will be clarified by considering a few examples. Decimal numbers are represented using the format given in Section 4.4568 ´ 10–3 .27 Operand 1 Operand 2 Addition Number . Single precision reals are stored in four bytes and double precision reals in eight bytes. the IEEE standard also specifies extra bits called guard bits to be carried during calculation using floating point numbers. Rounding rules have also been specified in the standard. it is called the big endian representation and when the least significant bits are stored in the smallest byte address it is called little endian representation.5802 ´ 10–3 Register form 123447 456847 580247 . A little endian floating point representation is shown as follows: X+3 Sign and exponent 32 bits X+2 X+1 X ÆByte address Least significant bits of significand Besides the representation of floating point numbers.

EXAMPLE 4.1234 ´ 108 +456858 –123458 Add nine’s complement of mantissa of operand 2 to operand 1. EXAMPLE 4. five places right and add.1234 ´ 103 . Declare result positive.4568 ´ 108 Register form 123453 456858 Shift operand 1.29 Operand 1 Operand 2 Add Number .31 Operand 1 Operand 2 +. If overflow add 1 to result.128 Computer Organization and Architecture EXAMPLE 4.0468 ´ 103 Register form 123453 923453 104654 Shift result right one digit 0.1690 ´ 103 169053 EXAMPLE 4.1234 ´ 103 123453 Operand 2 . Answer .4568 ´ 108 –.9234 ´ 103 1. Increment exponent by 1.30 Operand 1 Operand 2 Add Number .28 Operand 1 .4568 ´ 102 456852 Addition–Shift operand 2 right by one place and add 123453 045653 Sum .1234 ´ 103 .4568 ´ 108 456858 456858 In this case the first operand is too small compared to the second operand and is not added. 4568 8765 3333 1 3334 Overflow (1) Result positive + 333458 .

6800 ´ 10–52 –6800 (52) In this case the answer is less than .8268 ´ 1049    826899 Sum = 1.4500 ´ 10–50    –450000 Sum =  . Result 200055 .4568 ´ 10–50    +456800 –.32 +4568 ´ 108 –4566 ´ 108 Operand 1 Operand 2 +456858  –456658 Add 9’s complement of operand 2 (mantissa) to operand 1 (mantissa).12.4568 ´ 1049 456899 . If there is a hardware feature in the machine for floating point addition and subtraction.Arithmetic and Logic Unit–I 129 EXAMPLE 4. overflow is declared. Add overflow 4568 + 5433 = 10001 + 1 = 10002. In this case the exponents are added and the mantissas multiplied. Thus.2836 ´ 1049 . As both mantissas are normalized to less than 1. If the exponent exceeds 99. The mantissa. however.0068 ´ 10–50  . For each left shift the exponent is reduced by 1.1283 ´ 1050 1238 (100) In this example the sum exceeds . It is shifted to three spaces till the most significant digit is non-zero.1 Floating Point Multiplication Floating point multiplication is relatively simple. 4. In the added exponents there would be an excess of 50.0002 ´ 108 (In register) Shift right three digits to get 200055. can have leading zeros. EXAMPLE 4. the product mantissa cannot exceed 1. This is subtracted.1 ´ 10–50 and is called an underflow condition.34 Add +. .33 Add . it should account for all the above conditions. In this case the answer has three leading zeros. EXAMPLE 4.9999 ´ 1099 and is called an overflow condition.

Same procedures can be followed when both exponent and mantissa are represented in binary form. e y its exponent and my its significand. Similarly. The fractional part of the answer is stored in MQ at the end of the division. 4. ey. . Any exponent overflow/underflow has to be taken into account and significand normalized. after the product is obtained with adjustment of exponent. mx Operand 2: y: sy. (This is a blind method and the ease of dividing is gained at the expense of one significant digit. Division of the mantissa parts proceed as usual.12. 4. To ensure that division does not stop we may shift the dividend right by one digit and add 1 to the exponent before the beginning of division. my Result: z : sz. e x the exponent of x in excess 127 format and m x is the significand of x. would be necessary.) The dividend is placed in the accumulator and the divisor in the MD register. in division if we divide x by y we divide the significand of x by that of y and subtract the exponent of y from that of x. To multiply we multiply the two significands of the numbers and add the exponents. Operand 1: x: sx. We have seen floating point arithmetic in this section in decimal system. s y is the sign of y. ex. Fifty is added as the excess 50 in accumulator and MD exponents would be cancelled in the subtraction operation. We define first the notation that will be used in this section. Similarly.130 Computer Organization and Architecture normalization. The quotient is shifted left till the most significant digit is non-zero. mz where s x is the sign of x. Multiplication and division are straightforward.13 FLOATING POINT ARITHMETIC OPERATIONS We saw in the last section how decimal arithmetic operations are performed on decimal floating point numbers.2 Floating Point Division In this case the divisor mantissa should be larger than the dividend mantissa to ensure that quotient is less than 1. If this is not true a divide stop will occur. We should take care of exponent overflow/underflow and normalize the significand of the product. The exponent part of the answer is equal to (ACC exponent – MD exponent – Number of left shifts + 50). ez.

ez ¬ ez + 1 /*Normalization */ if ez > 255 then overflow. exit if mx > my then {shift right mx. exit if my = 0 then z ¬ ¥. y if mx or my = 0 then z = 0.Arithmetic and Logic Unit–I 131 ALGORITHM 4. exit if ez < 0 then underflow error. If two bits are there and they are 11. If it is 10. y /* we need x/y */ if mx = 0 then z = 0. If it is 00 or 01 nothing is done. We give the algorithm for division as Algorithm 4. exit if ez < 0 then underflow error. ez ¬ ez – 1 /*Normalization */ if ez < 0 then underflow. Division operation is similar. exit if mz > 1 right shift most significant bit of mz.3.4. exit sz = sx Å sy Result = sz.4. Division of two floating point numbers Step Step Step Step Step Step Step Step Step 1: 2: 3: 4: 5: 6: 7: 8: 9: Read x. a 1 is added. exit ez = ex – ey + 127 /* Excess 127 used for exponents */ if ez > 255 then overflow error. exit mz = mx * my /* multiplication */ if most significant bit of mz = 0 then left shift mz . then if the least significant bit is 1. as addition of a positive number to a negative number is actually subtraction. mz Step 8: Step 9: Step 10: In the above algorithm we have not rounded the result. then a 1 is added to the least significant bit. mz Step 10: Step 11: Step 12: Step 13: Step 14: Floating point addition/subtraction:  Floating point addition and subtraction will be considered together. Most machines will have at least two bits beyond the allowed number of bits in mz. If it is 0 nothing is done. ez. ex ¬ ex + 1} mz = mx/my /*Division */ if most significant bit of mz = 0 then leftshift mz. exit ez = ex + ey – 127 /* Excess 127 assumed for exponents */ if ez > 255 then overflow error. Multiplication of two floating point numbers Step Step Step Step Step Step Step 1: 2: 3: 4: 5: 6: 7: Read x. The algorithm is developed as Algorithm 4. ALGORITHM 4. The idea is to round the number to an even number. ez ¬ ez – 1 /* Normalization step */ if ez < 0 then underflow error. exit sz = sx Å sy Result = sz.5. As before . We will again assume that the operands are in sign magnitude form with normalized significand (most significant bit 1) and excess 127 representation for exponent. ez.

exit else z = sz. ey. ex. mz = (sx. mz End of algorithm Step 5: Step 6:    Step 7: Step 8: Step 9: Step 10: Step 11: Step 12: We have discussed a number of methods of performing arithmetic operations in digital systems with a view to achieve simplicity and economy in implementing these with electronic circuits. In this section we will develop logic circuits to perform addition and subtraction. my. ez. mx and y = sy. ez = ey. mx. my if operation is subtract then sy = sy if y = 0 then z = x.14 LOGIC CIRCUITS FOR ADDITION/SUBTRACTION So far in this chapter we have examined methods of representing integers and real numbers and appropriate algorithms to perform the four basic operations of addition. if my = 0 then z = sy. ez ¬ ez + 1 if ez > 255 then overflow error. ey. We will first present the basic circuits to illustrate how it is done. exit /* The following steps are needed to ensure the exponents of the two operands are equal before add/subtract */ ed = ex – ey {if ed ³ 0 then right shift my by ed bits. mz. exit} until most significant bit of mz = 1 z = sz. e the exponent and m the signficand.5. exit if x = 0 then z = y. exit} else {if ed < 0 then right shift mx by ed bits. ey. . if my = 0 then z = sx. ez.132 Computer Organization and Architecture the two operands are x = sx. mx and y = sy. Nowadays these are integrated as part of a Medium Scale Integrated (MSI) circuit. and the result is z = sz. Addition/Subtraction of Floating point numbers Step Step Step Step 1: 2: 3: 4: Read x = sx. We will also examine one of these MSI chips and what it provides. ez. 4. my. mx + sy. multiplication and division. exit if mz overflows by 1 bit then right shift mz. subtraction. ALGORITHM 4. my) /*Add signed significands */ if mz = 0 then z = 0. exit if most significant bit of mz = 0 then repeat    {Left shift mz by 1 bit. We will now discuss the actual logic design of subsystems to perform these operations in the succeeding sections. ez = e x. ex. exit} sz. ez ¬ ez – 1    if ez < 0 then report underflow error. ex. mz where s is the sign. These are realized as hardware units which form part of an Arithmetic Logic Unit (ALU) of a computer.

.1.1) s x x y c y (a) x x y s y c (b) x y Sum = x Å y Carry = x . y (c) FIGURE 4.11 A Half-Adder Truth Table x 0 0 1 1 y 0 1 0 1 s 0 1 1 0 c 0 0 0 1 The Boolean expressions for the sum s and carry c are given by s x y x ¹ y  x ¹ y.Arithmetic and Logic Unit–I 133 4.1 Half and Full-Adder Using Gates A half-adder: The truth table for a half-adder was given as Table 4. TABLE 4. It is reproduced here as Table 4.11. c x¹y (4.12 Half-adder using NAND and XOR gates.14.

we obtain the circuit for the full-adder shown in Figure 4. Using the expressions above and the half-adder expressions. (4. a full adder).e. z + x .12(c). y + y . . c = x .12(b) which uses only 5 NAND gates.3) (4.6) (4.12(b). there is no straightforward and simple method of obtaining the realization of Figure 4. y (4.2) s = x Å y.2) where Å is the exclusive OR operator.7) x¹yx¹y x¹ y  z¹s The expression for c¢ may be verified by inspection of the Karnaugh map of Figure 4.12(a). The half-adder expression of Eq.4) (4.. However. We will illustrate the second method.1) can also be expressed as Eq. A full-adder: The truth table of a full-adder is shown in Table 4.14.12.134 Computer Organization and Architecture Using NAND gates. by using gates to realize s¢ and c¢ (i. (4. by using two half-adders for obtaining a full-adder. From the table we obtain Boolean expressions for the sum s¢ and carry c¢ as: s„ z ( x ¹ y  x ¹ y )  z( x ¹ y  x ¹ y) (z ¹ s  z ¹ s ) (4. It uses seven NAND gates. We can rewrite the expression for the sum and carry as: s„ where s z ( x ¹ y  x ¹ y )  z( x ¹ y  x ¹ y) c„ x ¹ y  z( x ¹ y  x ¹ y ) (z ¹ s  z ¹ s ) (4. we obtain the circuit shown in Figure 4.5) which can also be written as s¢ = x Å y Å z c¢ = x . Second.12 Truth Table of a Full-Adder x 0 0 0 0 1 1 1 1 y 0 0 1 1 0 0 1 1 z 0 1 0 1 0 1 0 1 s¢ 0 1 1 0 1 0 0 1 c¢ 0 0 0 1 0 1 1 1 There are two methods of realizing these expressions: First. This may be shown as the block diagram of Figure 4.13. given the Boolean expression. z TABLE 4. With some ingenuity we can obtain the circuit of Figure 4.

one bit addend and carry (from previous addition) bit giving sum and carry to be carried to the next bit position. We can use four such full-adders to add two four-bit numbers as shown in Figure 4.y Half adder s¢ z.2 A Four-bit Adder In the last subsection we saw how a full-adder can be constructed using two halfadders. The propagation delay is 3tp per .13 K-maps for full-adder.14 Full-adder. This is called a ripple carry adder as the carry generated in the least significant bit position is carried to the successive stages. the carry generated in the least significant bit position will be carried to the most significant bit position.s Sum s¢ Carry c¢ Augend x Addend y Carry in Full adder Carry out Sum FIGURE 4.15. In the worst case. We obtained a full-adder with one bit augend. Observe that this is a combinatorial circuit. when 0001 is added to 1111. for example.Arithmetic and Logic Unit–I z xy 00 0 1 0 1 01 1 0 s¢ 11 0 1 10 1 0 z xy 00 0 1 0 0 01 0 1 c¢ 11 1 1 10 0 1 135 FIGURE 4. Thus. 4. Augend x Addend y Carry z s Half adder x. the answer will not be ready till the carry propagates through all the four stages.14.

c1) (4. b1. namely ci only and not from any of the previous stages. (4. 2.13) (4. b2. the same can be computed using ai and bi values in that position and the carry from the (i–1)th stage.12) (4.14) . c2 = g2 + p2 . g0 = f2(a1 b1. a1. ci+1 = gi + pi . c1) c3 = g2 + p2 . p1 .11) Thus. g0 = f3(a2. …. The methods of designing such a combinatorial circuit will now be explained.15 A four-stage ripple carry adder. bi = gi and pi = (ai + bi). We can build such an adder whose carry propagation delay is fixed regardless of the value of n provided we are willing to construct fairly complicated combinatorial circuit to generate the final carry bit. c1 = g1 + p1 . we can rewrite Eq.10) ci+1 = aibi + aici + bici If we call ai .9) (4. cn+I for (for n stage adder) using values of ai.8) (4. FIGURE 4. as follows: c1 = g0 as c0 = 0 c2 = g1 + p1 . For example. the sum and carry can be written using only ai. There are other situations where the carry generator does not propagate beyond a bit position.. bi for i = 1 to 4. one should be able to obtain the carry out of the final stage. bi and ci. To see how we can generate carry with only a combinatorial circuit we can express the carries in stages 1. The sum si of stage i of an n stage adder can be expressed as follows: si ai ¹ bi ¹ ci  ai ¹ bi ¹ ci  ai ¹ bi ¹ ci  ai ¹ bi ¹ ci = a i Å bi Å c i (4. ci (4. etc. Similarly.10) for carry out of the ith stage as Eq. The case we considered is the worst case. the carry generated in the least significant bit is added to the next significant bit and there are no more carries. 3.136 Computer Organization and Architecture stage where tp is the gate delay per level of gating. g1 + p2 . if we add 0001 to 1001. Thus. We can speed up the addition process if at any bit position i. the total delay for four stages will be 12tp. This is called a carry look ahead adder. (4.11).

g2 + p3 ... b3. pn–1 . p2 . The circuit of a four-bit carry look ahead adder is given in Figure 4. b2. f4 are carry look ahead combinatorial circuits. g1 + p3 . the delay in generating the carry is only two gate delays.. g0 (4. gn–2 + . c3 = g3 + p3 . p2 . + p2 . a2. p1 .. cn+1 = gn + pn . pn–2 . we see that the final carry can be generated using a combinatorial circuit.16.15) (4. b1.16 Structure of a carry look ahead 4-stage adder. c1) and in general. a3.Arithmetic and Logic Unit–I 137 c4 = g3 + p3 . p1 . f3. + pn . Blocks marked f2. gn–3 + . gn–1 + pn . pn–1 . g0 = f4 (a1. . It requires only two levels of gating and thus. FIGURE 4.16) Thus.

17. It is called 4-bit carry look ahead full adder chip (Also known as 74283 chip). 4-bit augend. the carry in and generates 4-bit sum and carry to next stage as shown in Figure 4.17 A 4-bit carry look ahead full-adder chip.16 that the delay in generating s0 and s1 is two gate delays each.16. however. . the total delay is 2tp. We can obtain a 16-bit adder by cascading four such chips as shown in Figure 4. The maximum gate delay of the chip is 3tp for obtaining the sums as was explained earlier. 3tp as the combinatorial circuits f2 and f3 use two levels of gating thus leading to a delay of 2tp each. The carry c4 has only two levels of gating generated by f4.18. which is much smaller than the time delay of 48tp if 16 full adders are cascaded to add two 16-bit numbers.138 Computer Organization and Architecture Observe from Figure 4. The gate delays for generating s2 and s3 is. Such a chip is available commercially. It accepts 4-bit addend. When we cascade four chips. If we call a gate delay tp. c0 a0 b0 a1 b1 a2 b2 a3 b3 74283 chip c4 s0 s1 s2 s3 FIGURE 4.18 A 16-bit hybrid adder. The symbol 4 indicates that 4 bits are carried in 4 parallel lines which are inputs/ outputs of the chip. the time delay in generating the final carry c16 will be 4 ´ 2tp = 8tp. Developments in integrated circuit technology have made it possible to put in one chip the entire combinatorial circuit of Figure 4. Observe that the maximum delay is 3tp and is not a function of the number of bits to be added. FIGURE 4.

The input operands are a0.Arithmetic and Logic Unit–I 139 There is no need to consider subtractors separately as we saw that subtraction is nothing but taking the 2’s complement of the subtrahend and adding it to the minuend. For 16-bit addition we use four of these chips and one carry look ahead chip.19 we give the logic symbol of model SN745181. a1.3 MSI Arithmetic Logic Unit An MSI arithmetic logic unit (ALU) is a combinatorial circuit integrated in a single medium scale integrated circuit which can perform one of a set of specified arithmetic operations or logical operations on a pair of n bit operands. FIGURE 4. A typical MSI ALU has two four-bit operands and five bits for operation selection. They are useful if a number of these chips are to be cascaded to add. They can be used in a carry look ahead adder circuit to reduce propagation delay. f2 and f3 are the result of performing the specified operations on the input operands. 16-bit operands. b1.14. . for example. The data sheets specify 4-bit add time of 11 ns if one of these is used and 18 ns if two of these are used with ripple carry from the first 4-bit chip to second chip. The outputs f0. s1. 32 different operations can be performed on a pair of four-bit operand inputs.5. 4. s3 and m. four-bit ALU chip of Texas Instruments. b3 and cn (which is carry in). Thus. a3 and b0. f1.19 Schematic of ALU SN745181 of Texas Instruments. b2. The operation to be performed is selected by the bits s0. s2. The other outputs are cn+4 which is the carryout. no separate chips are made and the same unit is used as an adder/ subtractor unit using the rules explained in Section 4. In Figure 4. This configuration produces sum and carry in 19 ns. Thus. g and p which generate and propagate outputs of addition corresponding to g and p of the carry look ahead adder explained in the last section. a2.

140 Computer Organization and Architecture In this MSI circuit observe that inputs a0-a4. Observe that all the 16 binary operations on two boolean operands (explained in Chapter 3) are available. we select the code s3 – s0 = 0110 which gives a minus b minus cn with cn = 1 as cn acts as the complement of the . In Table 4.13 Functions Performed by ALU SN745181 Operation Code s3 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 s2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 s1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 s0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Functions Performed m = 0 (Arithmetic) f = a plus cn f = (a plus b) plus cn f = ( a  b ) plus cn f = 0 minus cn f = a plus ( a ¹ b ) plus cn f = a  b plus ( a ¹ b ) plus cn f = a minus b minus cn f = a ¹ b minus cn f = a plus (a .13 a and b represent the two 4-bit input operands and f-represents the 4-bit output.13 the functions performed by SN745181 ALU is adapted from Texas Instruments data sheet. to perform two’s complement addition. g and p are ignored. The carry input. b f = 1111 f = a b f = a + b f = a Observe from this table that m = 0 selects arithmetic operations and m = 1 logic operations. TABLE 4. When logic operations are selected. output. the operations are carried out simultaneously on a0–a4 and b0–b4. Thus there are a total of 32 operations. g and p are complemented assuming active High as 1 (which we have used throughout this book). b) plus cn f = a plus b plus cn f = ( a  b ) plus ( a ¹ b ) plus cn f = a ¹ b minus cn f = a plus a plus cn f = (a + b) plus a plus cn f = ( a  b ) plus a plus cn  f = a minus cn m = 1 (Logical) f = a f = a b f = a ¹b f = 0000 f = a¹b f = b f = a Å b f = a ¹b f = a b f = a§b f = b f = a . In Table 4. we select the appropriate code from Table 4.13 to perform a plus b plus cn (cn will normally be 0). To perform two’s complement subtracting. Another output a = b will be asserted if a = b. When m = 0. b0-b4 and outputs f0-f4.

20 and Table 4. Many of them are rarely used.14. There is also simpler MSI ALU with only eight operations shown in Figure 4.20 A simpler MSI ALU chip SN745381.Arithmetic and Logic Unit–I 141 borrow during subtraction. The other operations such as a plus cn and a minus cn are useful to increment (with cn = 1) and decrement (with cn = 0) input operand. TABLE 4. b 1111 . SN745381 s0 s1 s2 cn a0 a1 a2 a3 a4 b0 b1 b2 b3 b4 g p f0 f1 f2 f3 f4 FIGURE 4. There are many more arithmetic operations which are there as they come as bonus from intermediate variables.14 Function Performed by ALU SN745381 Operations Codes s2 0 0 0 0 1 1 1 1 s1 0 0 1 1 0 0 1 1 s0 0 1 0 1 0 1 0 1 f f f f f f f f = = = = = = = = Functions 000 b minus a minus 1 plus cn a minus b minus 1 plus cn a plus b plus cn a Å b a + b a .

Remember that an AND gate is a multiplier of two 1-bit operands. For multiplying two 8-bit operands we need 64 AND gates and 56 full-adders. however. p1 = b0a1 + b1a0. Final product is computed as follows: b0a3 + + + p7 b3a3 p6 b2a3 b3a2 p5 b1a3 b2a2 b3a1 p4 b1a2 b2a1 b3a0 p3 p2 p1 p0 b0a2 b1a1 b2a0 b0a1 b1a0 b0a0 First partial product Second partial product Third partial product Fourth partial product Final product FIGURE 4.22. b0. a2. This is very large and thus sequential multipliers are used which implement Algorithm 4. b2. a0 by b3. p2 = cp1 + b0a2 + b1a1 + b2a0 p3 = cp2 + b0a3 + b1a2 + b2a1 + b3a0 p4 = cp3 + b1a3 + b2a2 +b3a1 p5 = cp4 + b2a3 + b3a2 p6 = cp5 + b3a3 p7 = cp6 The equations given above for products can be implemented by multipliers and full adders as shown in Figure 4. A total number of 16 AND gates and 12 full adders are used by this multiplier. for (4 ´ 4) bit multiplication n2 AND gate and n(n – 1) full-adders are needed. In general. This is illustrated in Figure 4. a1.21 for 4 ´ 4 multiplication of a3. Observe the regularity of the structure of this multiplier. In this case the same adder is used repeatedly.21 Multiplication of two 4-bit operands. Multiplication is not inherently sequential as all partial products can be added by combinatorial adders. It is. p0 = b0a0.1. b1. Such regular structures are easy to implement as an integrated circuit. . slower.7 as primarily sequential.15 A COMBINATORIAL CIRCUIT FOR MULTIPLICATION The algorithm for multiplication was presented in Section 4. We will discuss sequential implementation in the next chapter.142 Computer Organization and Architecture 4.

SUMMARY 1. An adder which adds two bits and finds the sum and carry (if any) is called a half-adder. Subtraction can be performed by adding the one’s or two’s complement of subtrahend to the minuend.Arithmetic and Logic Unit–I b0a3 b0a2 b0a1 b0a0 143 b1a3 0 FA b1a2 b1a1 b1a0 FA FA FA 0 b2 a3 b2a2 b2a1 b2a0 FA FA FA FA 0 b 3a 3 b3a2 b3a1 b3a0 FA FA FA FA 0 p7 p6 p5 p4 p3 p2 p1 p0 FIGURE 4.4. We have implicitly assumed in this section that the two binary numbers to be multiplied are represented in sign magnitude form. finds the sum and carry bit (if any) is called a full-adder. . How to represent negative numbers using one’s and two’s complement system is described in Section 4. 3.22 A combinatorial multiplier to multiply two 4-bit numbers. An adder which adds two bits and carry bit (if any) propagated by adding the bits to its right. 2.

Standards for double precision (64 bits) and extended real (80 bits) have also been defined by IEEE. The range of numbers is ±2. Tables 4. s21. 7. The n least significant bits of ACCMQ is used to store the multiplier. IEEE has standardized the number of bits to be used in the significand and exponent respectively. Just like in multiplication it can be implemented using a n-bit register to store the divisor. Real or floating point numbers are stored in computers using a representation with a fractional mantissa (also called significand) and an integer exponent. It assumes a virtual 24th bit as the most significant bit which is always 1 (normalization requirement). Details are given in Table 4. At the end. if exponent exceeds the maximum size of exponent. If the most significant bits are stored in lowest memory address it is called big-endian representation and if the least significant bits are stored in the lowest memory address it is called little-endian representation. the result is declared incorrect due to exponent overflow. 21st bits. are 22nd. The sign of the result is the exclusive OR of the signs of the two operands. When integers are loaded in CPU from memory.7 and 4. .0 * 2–127 to ±2. a register called ACCMQ of length 2n + 1 to store the product. If the exponent is smaller than the smallest exponent which can be stored. Algorithm 4. It can be implemented using a multiplicand register whose length equals number of bits n in multiplicand. 11.144 Computer Organization and Architecture 4. The mantissa is normalized adjusting the exponent.1 gives details of this. etc.2 gives the details. A set of bits stored in this format are interpreted as (–1)s * (1 + s22 2–1 + s21 2–2 + … + s0 2–23) * 2(exponent–127) where s22. Multiplication of two binary numbers is similar to long-hand multiplication. Arithmetic on floating point operands is straightforward: the mantissas are multiplied and the exponents added. 10. a 2n+1 bit register whose n least significant bits store the dividend and later the quotient. Magnitudes are multiplied using long-hand multiplication method. 8. The fractional mantissa is normalized to make the most significant bit 1. Decision tables can be used to summarize rules to add/subtract two operands represented in one’s or two’s complement notation.10. 5. Algorithm 4. then also the result is declared incorrect. If 32 bits are used to store floating point numbers. one bit for its sign and eight bits for the exponent. Division of two numbers is similar to long-hand division. 12. 32 or 64 bits in most computers. multiple bytes are transferred.8 are the appropriate tables. 9. Integers are normally represented using 16. Multiplication of numbers represented using sign magnitude notation is simple. 6. 23 bits are used to represent the significand. It uses excess 127 format in which all exponents greater than 127 are treated as positive and all exponents less than 127 are treated as negative.0 * 2128. The mantissa and exponent may have independent signs. There is no separate sign for the exponent.

To divide. The mantissa is normalized.Arithmetic and Logic Unit–I 145 13. each 4-bit long are to be added. Starting point are the appropriate truth tables. This is called a ripple carry adder as a carry (if any) generated by adding lesser significant bits ripple through the more significant bits. 4 full-adders may be cascaded to construct a 4-bit adder. In Section 4. the mantissa of the number with the smaller exponent is moved right by the number of bits needed to make its exponent equal to the larger exponent. the combinatorial circuit to generate the carry is very complex. The circuits become more complex and expensive as n increases. In the text we have illustrated the functions of two such chips (ALU SN 745181 and SN 745381) which have two 4-bit operands as inputs.01 . the dividend is divided by the divisor. the total delay is only 3tp and is independent of n. 17. EXERCISES 1. The resulting mantissa is normalized adjusting the exponent. If these are not. Find the sum and difference of the following pairs of binary numbers: (i) 111. 16. Multiplication/division may be performed either using sequential circuits or combinatorial circuits. For an n bit adder. To add/subtract. Combinatorial circuits may be designed to realize add/subtract operations. The mantissas are added/subtracted. Time to add n bits will equal (in the worst case) 3ntp where tp is the gate delay per level of gating. Overflow or underflow of exponents may occur (similar to multiplication) in which case an error is declared. the result is declared incorrect.11 – 111. To reduce the delay an adder called a carry look ahead adder may be designed. 20. It uses a complex combinatorial circuit to generate the final carry simultaneously with the sum bits.11 (iii) 110. 15. Integrated circuit chips have been designed to realize a carry look ahead adder. 18. 21. Medium-scale integrated circuits are available to realize arithmetic (add/ subtract) and logic operations on n bit operands.13. If two operands. A full-adder circuit has as inputs two one-bit operands and carry if any.111 (ii) 11.15 we have illustrated how a combinatorial circuit can be designed for multiplying two 4-bit operands. If the exponent lies outside the allowed range. From these truth tables logic circuits may be designed using the techniques discussed in Chapter 3. Detailed algorithms for floating point arithmetic operations are given in Section 4. 19. 14. the exponents of the two operands should be equal.01 + 10. However.01 + 110.

Multiply the following numbers represented in two’s complement form.25 (c) 35 * 10–3 11. Use first the method of Section 4. (i) 23-12 (ii) 17-6 (iii) 23-29 7.7. (i) 0011011 ´ 1011011 (iii) 101101 ´ 110011 (ii) 110011 ´ 001100 (iv) 011011 ´ 011101 Repeat using booth coding method. Multiply the following binary numbers using the algorithm developed in Section 4. (i) 110/111 (ii) 0011/1011 6.25 (ii) –12. (i) 1110 ´ 0111 (iii) 101110 ´ 101011 (ii) 101 ´ 010 (iv) 101010 ´ 01010 4. Repeat Exercise 8 assuming the exponent is represented using hexadecimal base. 9. Show all steps in your computation.2. Represent the following decimal number in a 32-bit word in: (i) Normalized floating point mode in excess 128 form (ii) IBM floating point format (iii) IEEE 754 format (a) 2500 (b) .) (i) 183.00048 8. how would the following decimal numbers be represented in this system? (Assume a normalized floating point representation is used. Divide the following binary numbers using Algorithm 4. 10. 5. A digital system is to store binary numbers in a 16-bit register. Subtract the following decimal numbers using: (a) 9’s complement representation (b) 10’s complement representation. If an exponent range of at least ±999 and 8 significant decimal digits are needed for the mantissa. . A binary computer uses 48-bit registers to store numbers. 3. Eight bits are used for the exponent and the exponent is represented in excess 128 form. Show all steps in the calculation. find an appropriate scheme to represent floating point numbers in this system. A binary computer uses 36-bit registers to store numbers. Represent the following number in double precision IEEE format (64 bits). Repeat Exercise 1 using 1’s and 2’s complement representation of numbers. (i) 0. Find the approximate range of decimal numbers handled by this computer.7.4 (iii) .654 (ii) 75 ´ 103 12.146 Computer Organization and Architecture 2. Assuming five bits are reserved for the exponent and 11 bits are reserved for the mantissa and sign.

. 17. 14. 15.3 assumes that the most significant bit of the product should be 1 for normalization.13 with m = 0) 20. 22. (Realize row 10 of Table 4. Repeat Exercise 4 for row 3. 21. Modify Algorithm 4. 1 and m = 0.5 for addition/subtraction assuming IEEE 754 floating point standard for 32-bit operands. Repeat Exercise 16 for division of two 3-bit operands. s2. Obtain MUX realization of ALU SN745381 (Table 4. 0. The output should have the right sign and magnitude.13. Obtain a combinatorial circuit to multiply two numbers each of which is 3 bits long and has 1 bit for sign. Algorithm 4. s0 = 1. Modify Algorithm 4. Modify the algorithm using IEEE 754 floating point standard for 32-bit operands. 18.Arithmetic and Logic Unit–I 147 13. 16. Obtain a combinatorial circuit using PAL to realize a 4-bit carry look ahead adder. s1.4 for division using IEEE 754 floating point standard for 64-bit operands.14) for row 4. 0. 19. Obtain PAL realization of ALU 745181 function for operation code with s3. Repeat Exercise 19 for row 7 of Table 4.

In this chapter we will explain how arithmetic and logic operations can be realized using sequential logic circuits. 148 . Which of these two alternatives is appropriate depends on the specific application and a designer has to use his judgement. 5. How to design logic circuits for common arithmetic/logic operations using ASM charts and corresponding HDL description. and how these operations can be realized using combinatorial circuits.1 INTRODUCTION In the last chapter we explained various algorithms for performing arithmetic and logic operations in a computer. Such circuits perform the operations fast but are expensive in terms of use of hardware resources. their sequential implementation. How to develop ASM charts for common arithmetic/logic operations for How to express ASM charts in a Hardware Description Language (HDL). This method of realization is economical in use of hardware but takes longer time to perform the specified operation.ARITHMETIC LOGIC UNIT–II 5 LEARNING OBJECTIVES In this chapter we will learn: â â â â How to evolve Algorithmic State Machine (ASM) charts to express sequential algorithms relevant to the design of digital systems.

The controller to generate the timing signals is synthesized. 5. an algorithm is evolved to perform the intended task. The general methodology of design follows the steps enumerated below: 1. . Thus. which are necessary to perform the required processing. Algorithms for digital data processing will be developed and presented using a chart known as Algorithmic State Machine (ASM). counters. In this chapter we will illustrate the design of some simple digital systems (in particular ALUs) following the above steps. It consists of a data processor which transforms given inputs into External inputs Controller Status information Data processor Control commands Inputs Outputs FIGURE 5. in parallel) so as to minimize the overall time required to complete the steps are identified. logic gates and clocks. digital systems (including ALUs) are built using building blocks which include registers. Given the requirements specification. etc. 2.e.2 ALGORITHMIC STATE MACHINE The design of a digital system can be divided into two distinct parts. From the algorithm the size and number of registers needed to store data is determined and the kind of operational units such as adders and comparators. flip-flops.. The operations that could be performed simultaneously (i. adders.Arithmetic Logic Unit–II 149 In general. One part is to design a processor to process the data stored in registers.. The design is completed and the system is tested with appropriately selected test inputs. a general block diagram of a digital system is shown in Figure 5. 3. the details of performing that step by using the operational units is obtained. For each step of the selected algorithm.1. The timing signals which will control the sequencing of different steps of the algorithm are generated. 5. multiplexers. 4. decoders. flip-flops. and the other is to design a controller to correctly sequence the operations to be performed by the data processor. are specified.1 Generalized block diagram of a digital system.

the condition to be tested is the contents of a flip-flop or the presence of a signal. It is. The sequence of operations carried out by the data processor is controlled by a controller. B). Inside the condition output box operations to be carried out on registers in the digital system are specified. . The design of sequential circuits studied in Chapter 3 is primarily useful for designing the control signals. The diamond shaped box [Figure 5. In Figure 5.150 Computer Organization and Architecture outputs.2(a) is labelled T2 with binary code 010. Each state box is assigned a binary code (normally during hardware realization) which is written at the upper right-hand corner. The rectangular box is known as the state box. We will introduce another design method known as Algorithmic State Machine Chart which assists in developing digital systems. The box may also contain names of outputs which will be initiated during this state. interpreted differently to specify what operations are carried out simultaneously in a given state during one clock interval and when one advances to the next state. Inside the box. however. All these operations are carried out simultaneously. Usually.2 Symbols used in algorithmic state machine.2(b) the condition is the status of flip-flop F. A signal LOAD is initiated during this state. The input to a condition output box must come from the output of a decision box. all operations are controlled by a central clock. The condition to be tested is written inside the box. The third type of box used in ASM is called a condition output box and is drawn as a rectangle with rounded sides [Figure 5. The operations are clearing a register A to 0 and incrementing register B (Incr. If it is true one path is taken and if it is false the other path is taken. The state box is given a label which is written on its top left-hand corner. the state box shown in Figure 5.2(c)]. For example. The two paths are then indicated as 1 and 0 which correspond to the contents of the flipflop.2. An ASM chart is drawn using three types of blocks shown in Figure 5.2(b)] specifies a decision to be taken. T2 A¨0 INCR B LOAD 010 =0 =1 DECR A P¨0 F Exit B (a) State box (b) Decision box Exit A (c) Condition output box FIGURE 5. Algorithmic state machine chart is similar to flow charts used to describe data processing algorithms. operations to be carried out on registers when the system is in this state are given. In clocked synchronous systems. It is called a decision box. When F = 1 exit A is taken when it is 0 exit B is taken.

This is called a block of ASM. the next state is T2 else it is T3. For example.4(b). These operations are carried out simultaneously.3. Observe the dotted rectangle enclosing the state box T1 and boxes connected to its exit up to the next state box. Each block in an ASM chart describes events which take place during one clock interval.e. T1 S¨0 Q¨0 START =0 =1 001 P E¨1 T2 INCR A DECR X 010 L¨0 T3 DECR B 011 T4 DECR Y P¨1 100 FIGURE 5. A is incremented and X decremented during one clock (i. The end of this clock transfers the system to state T2 or T3 depending on the value of P. Q ¬ 0. This chart uses all the symbols defined in Figure 5.3 is shown in Figure 5. Observe that there are four blocks in the chart of Figure 5. The state transition diagram corresponding to the ASM chart of Figure 5. If P = 0.Arithmetic Logic Unit–II 151 These operations are carried out when the condition output box is reached based on a decision taken in a preceding decision box.3. Consider the ASM chart of Figure 5. All the operations in a block are carried out simultaneously during this clock interval. . Sometimes a state transition diagram obtained from the ASM chart is useful in designing the control logic of the digital system. In state T2. setting L ¬ 0 if P = 1 or E ¬ 1 if P = 0 are all performed in one clock while the system is in state T1 [see Figure 5.4(a)]. carried out simultaneously).2. the operations S ¬ 0.3 An ASM chart. The ASM chart is a generalization of the state transition diagrams used in designing sequential circuits. initiating START signal.

E ¨ 1. EXAMPLE 5. On start the most significant bits of X and Y. If X < Y a flip-flop B is to be set to 1. On a start signal X and Y are compared. Observe that in state T0 in the ASM chart X and Y registers are loaded with inputs. We will now consider an example to illustrate how an ASM chart is obtained from a problem specification.4 State transition and operations during a clock interval. It is also set to 0. The method is to load X and Y. Y 3 = 0 then X > Y. This logic system is called a comparator. If X > Y a flip-flop A is to be set to 1. then it is not possible to decide at this stage whether X > Y. If X = Y a flip-flop E is to be set to 1. At some stage if X3 and Y 3 are not equal. B and E are set to 0. C is a 2-bit counter used to count the number of shifts of registers X and Y. L ¨ 0.1 Two 4-bit registers X and Y are loaded with values. All this takes place in one-clock pulse interval. If X3 = 0 and Y3 = 1.) If X3 = 1. . This method is specified as an ASM chart in Figure 5. We shift both X and Y left by one bit and compare the bits moved into X3 and Y 3. T3 (a) 001 T2 Incr A Decr X T3 Decr B P=0 010 P=1 011 100 (b) FIGURE 5.152 Computer Organization and Architecture Clock 1 State T1 Clock 2 State T2 or T3 S¨0 Q¨0 If P = 0.5. and the flip-flops A. If after three shifts there is no difference then X = Y. clear all flip-flops and give a start signal. = Y or < Y. T2 If P = 1. we can find out which is larger. then X < Y. (It is assumed that the bits of X and Y are respectively X3 X2 X1 X0 and Y3 Y2 Y1 Y0. If X 3 = Y 3 = 1 or X3 = Y3 = 0. At the end of the interval the system transits to the state T1. namely. X3 and Y3 are compared.

Observe that the state box labelled T1 has no operations specified. If C1 · C0 = 0 then more bits in X and Y are still to be examined. To show that the operation of comparison is done in state T1. Based on this comparison flip-flop A is set to 1 if X > Y and B is set to 1 if X < Y. Thus. If C1 · C0 = 1 then all the four bits have been compared and found equal. If all the four bits have been compared then C will have 11 stored in it. In fact in this state the most significant bits of X and Y registers are compared. E is set to 1 and the algorithm terminates. In the next clock pulse the counter is incremented as one comparison has been completed. Thus. This does not mean that nothing is done in state T1. If the most significant bits of X and Y are equal then the other bits of X and Y should be checked. all these operations are enclosed in a dotted box. This is allowed in ASM. .5 ASM chart for a comparator. Registers X and Y are then shifted left by one bit and the most significant bits are compared again. C0 =1 E¨1 FIGURE 5. To do this the system transits to state T2 at the end of the clock interval.Arithmetic Logic Unit–II T0 X ¨ Input Y ¨ Input A¨0 B¨0 E¨0 C¨0 T1 01 00 153 =1 =0 =1 X3 =1 =0 =0 Y3 Y3 A¨1 B¨1 T2 11 INCR C SHL X SHL Y =0 C1 . This is checked by the condition box (see Figure 5. the system goes back to state T1.5).

Q2 equals 1 at the end of four pulses. If S is 1 then the numbers to be added are loaded into registers A and B. It is required to add a 4-bit number stored in a register A to another 4-bit number stored in register B. The numbers to be multiplied are assumed to be represented in sign magnitude form with one bit for sign and four bits for magnitude. T1 is executed four times adding the four bits of the operands A and B. Observe that in state T0 an externally generated signal S is tested.6. All operations indicated in the state box T0 are carried out during one clock pulse. Q is a 3-bit counter.154 Computer Organization and Architecture EXAMPLE 5. T1 is executed repeatedly until Q2 = 1. We call the registers: . C + B0 . B0 + A0 .3 Binary Multiplier: Let us consider the design of a binary multiplier. As soon as counting is over signal S is reset to 0 and control returns to T0 and waits. T0 Initial state =0 =1 0 S A ¨ Input B ¨ Input C¨0 Q¨0 T1 1 SHR A SHR B A3 ¨ C ≈ (A0 ≈ B0) C ¨ A0 .2 We will now obtain an ASM chart for a serial adder. In the next clock pulse state box T1 is executed. C INCR Q =0 Q2 =1 S=0 FIGURE 5. EXAMPLE 5. The ASM chart for this adder is given in Figure 5.6 ASM chart for a 4-bit serial adder. Thus. When S is set another addition starts. We need two 5-bit registers to hold the multiplier and the multiplicand and their signs and a long register of five bits to hold the unsigned partial sums obtained during multiplication. The carry flip-flop C and the counter Q are cleared.

3. A schematic of the registers used by the multiplier system is shown in Figure 5. The reader should observe the following three facts about this method: 1. we will not require it again. The right shift by 1 bit of AC and MQ together is essential to mimic the long-hand multiplication method. 2. After that bit of the multiplier is used. For every step we need the multiplicand and one bit of the multiplier (again from right to left) to generate the partial product. .8. This partial sum can be four bits or five bits long. This is why we give a name R to AC concatenated to MQ.7. SR AC SM SHR Carry bit Partial sum 5 bits SMQ MQ 4 bits 4-bit parallel adder ADD IF MQ0 = 1 Busy CP 4 bits SMD MD FIGURE 5. The steps are repeated four times where four is equal to the length of the register which is the same as the length of the multiplier or multiplicand. It is based on the long-hand multiplication method explained in Algorithm 4. we add four bits of MD to AC. This suggests that AC concatenated to MQ could be a shift register and its contents could be shifted right after every step.7 Schematic of a four-bit multiplier.1. In each step. AC must be five bits long. Thus. The multiplication procedure itself is shown in the ASM chart of Figure 5.Arithmetic Logic Unit–II 155 MQ SMQ MD SMD AC R SR : : : : : : : Multiplier register Sign of MQ Multiplicand register Sign of MD Accumulator register (Holds partial sum) AC and MQ registers concatenated and called R (Holds partial products) Sign bit of result The result is stored with the four least significant bits in MQ and the four most significant bits in AC and the sign bit in SR.

namely. A three-bit counter C is used to count iterations. The clock period . T0 Initial state =0 =1 SM MD ¨ Input MQ ¨ Input SMD ¨ Input SMQ ¨ Input AC ¨ 0 C¨0 Busy ¨ 1 T1 Add state =0 MQ0 =1 AC ¨ AC + MD T2 SHR R INCR C =1 =0 C2 SR ¨ SMQ ≈ SMD SM ¨ 0 Busy ¨ 0 FIGURE 5. The start multiplication signal SM is supplied by an external event and the clock input will be used to generate the timing signal. MQ0 is checked.156 Computer Organization and Architecture The above facts are taken into account in obtaining the ASM chart of the multiplier subsystem shown in Figure 5.e. in state T0 when SM becomes 1 the operands and their signs are loaded.8. A flip-flop BUSY is set to 1 to indicate that the multiplier is busy.8 ASM chart for multiplier. If it is 0 nothing is done. In the ASM chart. one which adds all the bits simultaneously in one clock cycle) and the result stored in AC.. a counter is initialized and multiplication is started. In state T1 the least significant bit of MQ. If it is 1 then the contents of AC and MD are added by a parallel adder (i.

Thus. SM and BUSY are reset to 0. etc. etc. counters. . the least significant bit of AC is shifted into the most significant bit of MQ and the least significant bit of MQ (which has already been used to multiply) is lost. In state T2 the bits of AC and the bits of MQ (excluding its sign bit SMQ) are considered as a single unit and called R register. The language has features to declare flip-flops.. 0] and MQ [3 . The declaration concatenated register uses the symbol dot (. Several such languages have been defined over the years and many have been standardized. . Besides this. 5. the system automatically transits to the next block. Comments are enclosed by /* */. This is shifted right one bit. In the if then else construct the operations which are carried out when the predicate is true/false are enclosed in curly brackets. Observe the declarations made at the beginning of the algorithm. Observe that R is not a new register but is a name used for convenience . Thus. These languages are complex. 0] may be assumed to be joined together as one unit and called R. flip-flops. we provide methods of representing operations carried out simultaneously in one clock cycle and those which are sequential. As one partial product had been computed the counter C is incremented by 1. Using these basic requirements of hardware description we express the ASM chart of Figure 5. When C = 4. the clock period must be larger than this. After completing the operation in the square bracket. In this example the parallel addition will take the maximum time. . which are the building blocks of digital systems. counters. Another advantage of such a language with strict syntax and semantic rules is that it will be possible to write a translator for a computer to translate it and obtain a schematic of the corresponding digital system or even obtain the layout of an integrated circuit to realize it. the most significant bit of C. The notation MD [3 .3 ALGORITHMIC REPRESENTATION OF ASM CHARTS The ASM chart is similar to a flow chart and assists in developing and understanding digital systems.8 as an algorithm shown in Figure 5. Such a language is known as hardware description language. Among them two languages known as VHDL (Very High Speed Integrated Circuit Hardware Description Language) and Verilog are very popular. namely.9. Its main disadvantage is that it is time consuming to draw and for large systems it will become cumbersome. We will use a simple informal language to describe ASM charts. used in the digital system. The individual operations are separated by semicolons. Thus. C2 becomes 1 and multiplication is over. We use labels to represent the states in ASM chart and if then else structure to represent decision boxes of ASM. Such a declaration is required to specify the registers. registers. 0] indicated that the register MD is a 4-bit register with the least significant bit MD0 and the most significant bit MD3.) to indicate that the registers AC [4 .Arithmetic Logic Unit–II 157 must be chosen in such a way that the slowest operation is completed within this period. usually a hardware-oriented programming language which has facilities to express specific features of hardware algorithms is used. The sign bit of the result R is obtained by exclusive ORing the sign bits of MQ and MD.. Observe that all the operations enclosed in square brackets take place simultaneously during one clock period.

Y ¬ input. . MQ [3 .9 is a straightforward representation of the procedure depicted in the ASM chart of Figure 5. B ¬ 0. SM ¬ 0. . 0] Flip-flops: A. 0] Concatenated register: R [8 . MQ ¬ input. The declaration counter C [2 . AC [4 . clear (CLR) are valid for counters. SHL X. Operations of increment (INCR). 0] Procedure T0: [if SM then {MD ¬ input. INCR C.10. Operations such as shift left (SHL). SHL Y. SMD. exit. T0} else T1] FIGURE 5. The most important additions are the set of declarations defining various registers. We will do this later in this chapter. .5 as an algorithm in Figure 5. . exit} else {if Y3 then B ¬ 1. After completion the system goes to T2 automatically */] T2: [SHR R. SMQ the sign of multiplier and SR the sign of result */ Counter: C [2 . C ¬ 0] T1: [if X3 then {if Y3 then T2 else A ¬ 1. decrement (DECR). A ¬ 0. 0] Flip-flops: SM. .    if C1·C0 then E ¬ 1. 0]. C ¬ 0.} else T0] T1: [if MQ0 then AC ¬ AC + MD/* This is parallel addition. 0] Counter: C [1 . MQ [3 . . . . R7 = AC3 and R0 = MQ0. SMQ ¬ input AC ¬ 0. . E Procedure T0: [X ¬ input. . exit else T1] FIGURE 5. BUSY. if C2 then {SR ¬ SMQ Å SMD. Concatenated registers may be shifted left or right. BUSY ¬ 0. SMD ¬ input. SMQ /* SMD stores the sign of multiplicand. . 0]. counters and flip-flops. B.10 Algorithm for a comparator. Observe that the algorithm of Figure 5. 0]. 0].9 ASM chart of multiplier expressed in a hardware description language. else T2] T2: [INCR C. Declarations Register: X [3 . Y [3 .8. SR ¬ 0. E ¬ 0. In such a case the system will shift the bits belonging to the individual registers which make up the concatenated register. We now express the ASM chart of Figure 5. This algorithm is very useful in synthesizing the logic diagram of the multiplier. Busy ¬ 1. . SR. 0]: = AC [4 . Declarations Registers: MD [3 .158 Computer Organization and Architecture of description. 0] declares that C is a 3-bit counter. shift right (SHR) are allowed with registers.

Q ¬ 0} else T0] T1 : [SHR A. INCR Q. C ¬ 0. SHR B.11.Arithmetic Logic Unit–II 159 The ASM chart for a 4-bit serial adder (Figure 5. The controller is synthesized using the method presented in Chapter 4 of the book Digital Logic and Computer Organization written by the authors of this book[31]. 5. 0]. B0 + A0 .5 the inputs are X3. For the ASM chart of Figure 5.. C. . Draw a controller box (see Figure 5.. C is a 2-bit counter. B [3 . T0} Else T1] FIGURE 5. T1 and T2. A. We will do the synthesis at the end. B and E are flip-flops. For the ASM chart of Figure 5.4 DESIGNING DIGITAL SYSTEMS USING ASM CHART Synthesizing comparator from its ASM chart Designing a digital system from an ASM chart is straightforward.11 Algorithm for a 4-bit serial adder. B ¬ input. C0 Controller T0 T1 T2 FIGURE 5. The data processor consists of the registers and the flip-flops which appear in state boxes and decision boxes.5 the registers are X and Y.. The inputs to the controller are the variables appearing in condition boxes of the ASM chart and the outputs are the labels of the state boxes. X3 Y3 C1 . 0] Counter: Q [2 . S Procedure T0 : [if S then {A ¬ input.12 Inputs and outputs of controller of a comparator. Declarations Register : A [3 . The outputs are T0.5) as the first example.12). 0] Flip-flops: C.      if Q2 {then S ¬ 0. A3 ¬ C Å (A0 Å B0).6) is expressed as an algorithm in Figure 5. We illustrate the method using comparator ASM chart (Figure 5. Y3 and C1 · C0. 2. We perform the following steps: 1. C + B0 .      C ¬ A0 .

T0 X Input T1 T2 CP X3 X2 X1 X0 SHL Y3 Y2 Y1 Y0 SHL Y Input Counter C1 C0 X3 . The entire system is driven by a single clock. In Figure 5.13 Data processor block diagram for a comparator. B and E are indicated. flip-flops. Y3 J K CK B B Data processor C1 . indicated in the decision boxes are controlled by both the state in which the operations take place and the path from the condition box. the operations on the registers X and Y. the counter C and the flip-flops A. Y3 CK A K Count up 0 J A X3 . C0 J K CK C E FIGURE 5. This is shown in Figure 5. The operations on registers indicated in the state boxes in ASM are controlled by the controller output corresponding to that state. The operations on the registers.13. which is the algorithm corresponding to the ASM chart. .10. etc.. 4.160 Computer Organization and Architecture 3.

5 and the equivalent algorithm of Figure 5.15. T1 and T2 by the binary tuples 00. . We need two flip-flops P and Q to represent states T0. The controller is synthesized using the method explained in Chapter 4 of [31]. The controller outputs are T0. In other words.14 State transition diagram to design the controller of a comparator. T1. Q P 0 0 00 1 01 T0 11 T1 T2 1 (a) CP X3 ≈ Y 3 C1 .14.Arithmetic Logic Unit–II 161 5. C0 K J CK P P T1 Q T2 Q T0 J K CK (b) FIGURE 5. C0 FIGURE 5.10. C0 is true. We obtain the block diagram for the controller shown in Figure 5. T2. T1. They are the three states of the algorithmic state machine. In the chart we have coded the states T0.15 Synthesized controller for comparator. These state transitions are represented by the state transition diagram of Figure 5.10) we see that T2 to T1 transition happens when C1 .10 we see that the transition from T0 to T1 is unconditional. The controller is synthesized as follows. Y3 + X3 ¹ Y3 is true. 11 respectively. This is used to synthesize the controller. 01. The transitions from state T0 to T1 as well as from state T1 to T2 are governed by the ASM algorithm represented by the ASM chart of Figure 5. T2. By inspection of the algorithm of Figure 5. T0 00 X3 ≈ Y3 11 T2 T1 01 C1 . The transition from T1 to T2 takes place either when both X3 and Y3 are true or when both X3 and Y3 are false. the transition from T1 to T2 takes place when X3 . Again by inspection of the algorithm (Figure 5.

During this state.16 A 4-bit serial adder. T0 Input A T1 CP A3 A2 A1 A0 Input B SHR B3 B2 B1 B0 SHR CLR Q2 Q1 Q0 INCR J K CK C Full adder Start Add J K CK S FIGURE 5. The counter will have three bits as four clock pulses are needed for adding four bits and the fifth clock pulse indicates end of addition. the next step is to design the controller to generate the control signals.16. resetting the system to start another add operation. By inspection of the algorithm we see that the synthesis of the data processing part of the algorithm is straightforward and is shown in Figure 5. The carry bit (if any) will be in the carry flipflop and the four bits of the sum will replace the augend bits in the A register. T0 is to be sent to 0 during this phase. clears the carry flip-flop and the counter. The algorithm for the adder corresponding to this ASM chart was given in Figure 5. Once the data processor is designed. The addition operation takes place when T1 = 1. The sum will have four or five bits.11.162 Computer Organization and Architecture A 4-bit serial adder As a second example we will design a 4-bit serial adder whose ASM chart is given in Figure 5.6. Observe that a signal T0 = 1 corresponding to state T0 loads A and B registers. In order to do this we observe that an external start . Observe also that a full adder is used to add a pair of bits and carry (if any) during each clock pulse. T1 = 0.

The data processor without the control paths is obtained using the declarations. SM the output of a flip-flop which is set . As there are only two states we need only one flip-flop to synthesize the controller. Observe that the flip-flop P (which is the controller) is set when S = 1 making T0 = 1 which loads registers A and B and clears flip-flops C and Q2. the adder will be in idle state. A flip-flop S has as its J input ‘Start Add’ signal which initiates addition process. namely.17(a)].8. A 4-bit multiplier We first refer to the schematic of a 4-bit multiplier (Figure 5. FIGURE 5. Q2 becomes 1 and T1 = 1. The state transition chart is very simple and is shown in Figure 5.17(b).Arithmetic Logic Unit–II 163 signal will set flip-flop S.9. This signal is used to reset flip-flop S. This is shown in Figure 5. The end of the add operation is indicated when Q2 (the most significant bit of the counter) becomes 1. When Q2 is cleared Q2 becomes zero. The signal clears flip-flop P making T1 = 1 which initiates the addition process [see Figure 5.17(c). This is also needed as the second control input.7). When addition of four bits is completed. The controller of this adder has only two outputs T0 and T1 [see Figure 5. This flip-flop output is needed as an input to the controller. The ASM chart showing the sequence of operation is given in Figure 5.17 Design of a controller for a serial adder. Till the next ‘Start Add’ signal. Using these we have to synthesize the multiplier. The corresponding algorithm is given in Figure 5.17(c)]. The controller has two inputs.

SMQ to store the sign of the multiplier. the predicates which cause transitions between states in the multiplication algorithm (see Figure 5.164 Computer Organization and Architecture when ‘Start Multiply’ signal is given (Figure 5. The controller is synthesized using the state-transition diagram shown .18 A four-bit serial multiplier. SMD that of the multiplicand and SR that of the result. The adder in this implementation is a parallel adder which adds the multiplicand bits with the partial product stored in ACC . When flip-flop SM is set and the multiplication operation starts in the T0 state.7). This flipflop is cleared at the end of multiplication operation which is indicated when C2 bit of the counter becomes 1. another flip-flop called BUSY flip-flop is set to 1 to indicate that multiplication is in progress.18) and C2. There are three more flip-flops. MQ (concatenated) register. The controller has two inputs SM and C2. T1 and T2. The signs SMQ and SMD are loaded and SR cleared during T0. The outputs are T0.9). SM C2 CP CLR J K SR Controller T0 T1 T2 5 CLR Load 5 SHR ACC MQ 4 Parallel Adder CP J SMD CP J SMQ Load Initialise C CP C2 C1 C0 INCR 4 MD ADD CP 4 CP Start multiply J K CK SM J K CK Busy FIGURE 5. In the diagram for data processor (Figure 5. parallel transfer of bits from registers is shown by putting a slash across the data path and putting a number 4 adjoining it to show that there are four lines running in parallel.

00]. During T2 the sign of the result is set and the SM and BUSY flip-flops are cleared. multiplication and division. SM = 0 SM = 1 T1 01 T0 00 C2 = 0 C2 = 1 11 T2 FIGURE 5. MY Result: Z = SZ . EZ ... These are named: Operand 1: X = SX .18. MZ the signficands of operand 1. EX . In Chapter 4 we evolved algorithms for floating point addition. EY .Arithmetic Logic Unit–II 165 in Figure 5..0]. We assume that the significand of the two operands is 23 bits long and is normalized with the most significant bit being 1.18. Further. operand 2 and result respectively. MQ.00] = MX [22.4 is left as an exercise to the student. This is left as an exercise to the student. Observe that during T0. MY. SY. EY. We saw that the algorithms for multiplication and division are straightforward whereas that for addition/subtraction is more complicated due to the need to match exponents of the two operands (before proceeding with the addition/subtraction of the two significands). X[31] = SX. The exponent is eight bits long and represented in excess 128 notation. MZ We remind the reader that (.0] = SX . . SZ are the sign bits.. EZ the exponents and MX.) is the concatenation operator.5 given in Chapter 4 as the basis to obtain an algorithm for floating point add/subtract operation in our hardware description language.. EX. X[30…23] = EX. For example.0]. Developing the detailed logic diagram from the algorithm using the method presented in Section 5. EX [7. the operands are loaded and flip-flops cleared. During T1 addition takes place and result stored in ACC .19.5 FLOATING POINT ADDER Floating point arithmetic is nowadays a standard feature in general purpose computers. We then give a block diagram for add/subtract unit. SX. subtraction. MX [22. and X [22. 5. The multiplier logic diagram including the control paths is shown in Figure 5. In this section we will use Algorithm 4.19 State transition graph of the controller of the multiplier of Figure 5. We assume two 32-bit registers to store the two operands and a 32-bit register to store the result. Observe that T1 to T2 transition is automatic and takes place with clock. The significand has its own sign. All exponents >128 are positive. X[31. MX Operand 2: Y = SY . This controller uses this flip-flop and its synthesis is straightforward using the procedure explained in [31].

0].. Z = SZ ..*/ ERROR /* Set to 1 if overflow or underflow error */ Counter: C[7. The algorithm is given in Figure 5. BUSY ¬ 0. EX. SZ. BUSY ¬ 0. SA ¬ 0.0]. T0] T3: [TED ¬ EX – EY] /*Subtract exponents parallel subtractor */ T4: if (TED > 0) then [{ SHR MY by TED bits. EY.0) ¬ MY – MX. TEXP[8. BUSY ¬ 0. T0}   else T11 T9: [if TEMP[22] = 0 then {SHL TEMP. EX[7.. T0}] /* we have not shown details of shifting register. BUSY ¬ 0. TEMP [23. TED and TEXP store intermediate results */ Flip-flops: SX. MY. Y = SY. EY[7..      TED ¬ 0. TEXP ¬ 0.0]. We assume sign magnitude add/subtract and also assume that the significands are added or subtracted using a parallel adder/subtractor. Addition of signed significands can now proceed */ T5: [if (OP = 0) then S(Y) Ž S(Y) ] T6: [if (SX = SY) then   {TEMP[23. BUSY ¬ 0. TEXP ¬ TEXP – 1}] T10: [if TEMP [22] = 1 then T11     else if (TEXP > 0) then T9 else {ERROR ¬ 1. BUSY ¬ 0. OP = 0 for subtract */ SA. MZ Procedure T0: [if SA then { X ¬ input. Declarations Registers: MX[22. If in this process TEXP < 0 it indicates underflow error */ T11: [EZ ¬ TEXP. TEXP ¬ EY} if (MX = 0) then {Z ¬ Y SA ¬ 0. MZ ¬ TEMP. T0}] else if (TED < 0) then [{ SHR MX by TED bits... Z ¬ SZ .0] /* used to count shifts of significand */ Concatenated registers: X = SX. /*Start operation.0]. MZ.0] /*TEMP. T0] FIGURE 5. It is done by first setting counter C to a value equal to TED and shifting mantissa TED bits */ /*In T3 and T4 significands are aligned.20 Algorithm for floating point add/subtract using a hardware description language. during development of the result we will need some temporary registers. BUSY ¬ 0. TEXP ¬ EX}      if (MY = 0) then {Z ¬ X.0) ¬ MX – MY. ERROR ¬ 0. EZ. TEXP ¬ TEXP + 1}] T8: [if TEXP[8] = 1 then {ERROR ¬ 1.0]. . T0}] /* T9 and T10 are used to normalize result significand. SA ¬ 0. MY[22... T0] T2: [if MY = 0 then Z ¬ X.. TEMP ¬ 0. OP /*OP = 1 for Add.0] ¬ MX + MY. BUSY ® 1} else T0] T1: [if MX = 0 then Z ¬ Y. SA ¬ 0.0].0].. SA ¬ 0. Y ¬ input. SY. SA ¬ 0. C ¬ 0... EZ . EZ[7. SZ ¬ SX}      else {TEMP (23. MX. SA ¬ 0. SZ ¬ SY}] T7: [if TEMP[23] = 1 then {SHR TEMP. TED[7. SZ ¬ SX} else if (MX > MY) then {TEMP(23.166 Computer Organization and Architecture Besides these.20. These registers are signified by using first letter T in their names.

Arithmetic Logic Unit–II

167

Block diagram for a floating point adder/subtractor is developed using the algorithm given in Figure 5.20.

FIGURE 5.21

Floating point add/subtract using block diagram.

The major blocks of the hardware are numbered in Figure 5.21. Their functions are explained in what follows: Block 1: A subtractor which computes TED = (EX – EY). From this computation we can find the larger exponent. The significand of the operand with the smaller exponent has to be shifted right a number of bits equal to the magnitude of the difference TED.

168

Computer Organization and Architecture

Block 2: The significand to be shifted right is picked in box 2. If TED > 0, MY is shifted else MX is shifted. Block 3: This block shifts right the signficand of the operand with the smaller exponent by TED bits and adds TED to its exponent. The functions of box 2 and 3 are performed in the algorithm of Figure 5.20 in step T4. Block 4: The exponents have been made equal and thus add/subtract can proceed. The magnitudes of the two operands are compared and depending on whether the operation is add or subtract, signfiicands are added/subtracted. Steps T5 and T6 perform this in the algorithm (Figure 5.20). Block 5: At the end of add/subtract operation the result stored in TEMP[23..0] would be unnormalized. It may either have an overflow bit in TEMP[23] or TEMP [22, 21 .. etc.] may be 0s. In the former case, TEMP should be shifted right by 1 bit and 1 added to TEXP. In the latter case, the number of leading 0s in TEMP has to be found and TEMP shifted left till TEMP[22] is 1. Block 5 counts leading 0s. Block 6: Normalization of TEMP is performed in this block. The output of this block is the normalized significand of the result of add/subtract. Block 7: This block adjusts the value of TEXP. It adds 1 to TEXP if TEMP[23] = 1. It subtracts from TEXP the number of left shifts of TEXP to normalize it. Its output is the exponent of the result of add/subtract. The functions of blocks 5, 6 and 7 are performed by steps T7, T8, T9 and T10 given in the algorithm of Figure 5.20. We will not develop hardware algorithms for floating point multiplication and division as they are much simpler. They are left as exercises for the student.

SUMMARY
1. A digital system consists of a data processor, which performs a sequence of operations that transform given inputs to outputs, and a controller that sequences the operations performed by the data processor. 2. An Algorithmic State Machine (ASM) chart is useful in describing and designing digital systems. 3. ASM is similar to a flow chart. It uses three symbols. A rectangle represents a sequence of operations such as SHR, INCR, LOAD and STORE. A rhombus is used to indicate decisions taken and their outcome, and a rectangle with rounded edges depicts operations performed when decisions are taken. Figure 5.2 shows the symbols. Rectangles are labelled with a symbol Ti to indicate the ‘state’ in which the operations are performed. 4. ASM chart is a generalization of state transition diagram used in sequential systems design. State transition diagram derived from ASM chart is useful in designing control logic of digital systems.

Arithmetic Logic Unit–II

169

5. Examples 5.1, 5.2 and 5.3 describe how ASM charts are obtained for a comparator, serial adder and a 4-bit multiplexer. 6. ASM charts are useful to describe the algorithm used in a notation similar to a programming language. Such languages are widely used to design digital systems and are known as Hardware Description Language (HDL). Two of the widely used languages are known as VHDL (Very High Speed Integrated Circuit Hardware Description Language) and Verilog. They are quite complex. 7. An algorithm in a hardware description language has two major parts: a declaration part which specifies registers used (with their lengths), flip-flops, counters and concatenated registers (if any), and a procedure part which specifies actions to be carried out one after the other. Each action is specified by a labelled statement which would take place normally in one clock period. The labels are clock times. An action has a number of operations which are carried out simultaneously. An operation is typically moving the contents of a register to another, SHR, SHL, Increment a counter, etc. Another important operation is carrying out an action (if condition then action 1 else action 2) which performs one of the two alternative actions. 8. The steps in design consist of: (i) Obtaining an ASM chart using requirement specifications. (ii) Expressing it using a HDL. (iii) Designing a controller whose inputs are conditions appearing in if then else statements and outputs are the clock times which are labels T0, T1, etc., of HDL statements. (iv) Controller is synthesized using state-transition diagram. (v) The timing signals obtained from the controller drive the data processor which consists of registers, counters, etc., specified in the declarations of HDL and used in each HDL statement. 9. We have illustrated the above procedure by designing three sequential arithmetic and logic limits, namely a comparator, an adder and a multiplexer. 10. Floating point add/subtract unit is a challenging system to design. In this chapter we have developed a procedure in a hardware description language for such a unit using the algorithm described in Chapter 4. We also give a hardware block diagram which is evolved using the HDL description of a floating point adder/subtractor.

EXERCISES
1. Obtain an ASM chart for a serial adder/subtract unit for 8-bit integers. Assume 2’s complement for subtract. 2. Obtain an HDL for Exercise 1.

170

Computer Organization and Architecture

3. Obtain a logic circuit for adder/subtractor using Exercises 1 and 2. 4. Obtain an ASM chart for an integer divider for 4-bit operands. Use registers used for multiplier. 5. Obtain an HDL for Exercise 4. 6. Obtain a logic circuit for a divider using results of Exercises 4 and 5. 7. Obtain an ASM chart for a floating point multiplier. Use 32-bit representation of floating point numbers used in the text for add/subtract. 8. Obtain an HDL corresponding to ASM chart of Exercise 7. 9. Obtain logic circuit for a floating point multiplier. 10. Obtain an ASM chart for a floating point divider with 32-bit floating point numbers using representation used in the text. 11. Obtain an HDL corresponding to ASM of Exercise 10. 12. Obtain logic circuit for a floating point divider. 13. Using HDL for floating point adder/subtractor given in Figure 5.20 of the text develop a logic circuit for the same. 14. Modify the HDL description for floating point adder/subtractor given in Figure 5.20 of the text if IEEE 754 representation for 32-bit floating point numbers is used. 15. Obtain a logic diagram for a floating point adder/subtractor using hardware description language algorithm of Figure 5.20.

BASIC COMPUTER ORGANIZATION

6

LEARNING OBJECTIVES
In this chapter we will learn:

â â â

instructions are sequentially fetched and executed.

How stored program computers are organized. How machine language programs are stored in main memory, and how Using a series of hypothetical computers, how an instruction set
architecture is evolved based on application requirements.

6.1

INTRODUCTION

Real life, that is, commercially available computers are quite complex. In order to introduce the fundamentals of computer organization, it is traditional to start with a hypothetical small computer first and then systematically expand it to study the features of a real-life computer. Following this philosophy of learning, in this chapter we will introduce a hypothetical computer and call it SMAC+ (Small Computer). This introduction provides a programmer’s view of a computer system at the level of the machine language. Small ‘machine language’ programs are written for SMAC+ using a small set of instructions, S1. The difficulties encountered in programming this computer to solve more interesting problems will then become evident. In order to facilitate programming, we will expand the instruction set from S1 to S2 and
171

172

Computer Organization and Architecture

then to S3. We will also add extra architectural features to SMAC+ organization in two steps and rename it as SMAC++. Our objective is to make SMAC++ simple enough so that the students can simulate it using C++ or JAVA.

6.2

MEMORY ORGANIZATION

OF

SMAC+

We will assume that the memory of SMAC+ consists of a number of addressable storage ‘boxes’, also called memory locations. Each memory location stores 32 bits (8 hexadecimal digits) which we call a word. To start with we will assume that SMAC+ memory has 1M words addressed from 00000 to FFFFF in Hex. Later we will increase the size of the memory. 1M words = 4MB as one word is 4 bytes. The memory unit has an assembly of binary cells and two registers named MAR and MDR. The Memory Address Register (MAR) holds the address of the word in the memory to be accessed. The data read from the memory or the one to be written in it is held in the Memory Data Register (MDR) which is also known as Memory Buffer Register (MBR). As the memory has 1M addressable locations, MAR needs to be only 20 bits long (220 = 1024 ´ 1024). As each location in memory stores a 32-bit word, MBR will be 32 bits long. Figure 6.1 depicts the memory of SMAC+. The operation to be performed, namely, reading from or writing into memory is initiated by a Read/Write signal which is sent by the Central Processing Unit (CPU). If a memory location is to be read, CPU places the address of the location in MAR and issues a Read control signal. The memory circuit retrieves the data from the specified address and
32 bits Word

Address MEMORY

8 Hex MAR

8 Hex MDR

Read/Write signal

CPU

Task completion signal

FIGURE 6.1

Block diagram of the memory.

Basic Computer Organization

173

places it in MDR. A signal then reaches CPU informing it that the command to read has been carried out. If a data is to be written into memory then the CPU places it in MDR and places the address where the data is to be written in MAR. It then sends a write signal to the memory circuit. The memory circuit writes the contents of MDR in the memory location specified by MAR. After this job is completed, the memory sends a completion control signal to the CPU.

6.3

INSTRUCTION SMAC+

AND

DATA REPRESENTATION

OF

An instruction for the hypothetical computer SMAC+ defines an operation that is native to the machine. An instruction is said to be native to the machine if it is implemented in the hardware by its designer. Each instruction will contain the following information: 1. The operation to be performed is denoted by means of a code (Op-code for short). 2. Address in memory where the operand or operands will be found. The operation will be performed on these operands. 3. Address in memory where the result of the operation will be stored. The operand(s) may be stored either in the main memory or in registers and their addresses will be used for pointing to them. Instructions may have zero, one, two or three addresses of operands and zero or one address for the result (if any). Accessing the main memory, which is a random access memory, takes considerably more time compared to the data transfer rates in the CPU. Thus, the trend has been to build into the CPU local memory capacity in the form of General Purpose Registers. These are registers built with high speed flip-flops. Storing and retrieving data from these registers is thus fast. An instruction should be able to refer to these registers by using register addresses. We assume that SMAC + has 16 registers. For the sake of simplicity, we assume that one 32-bit word of SMAC+ stores one instruction and all instructions of SMAC+ are of the same length. (This need not be the case in real machines when optimization is needed). The 32 bits of an instruction are formatted and coded to specify the operation performed by that instruction and point to (i.e., provide the address of) the operands needed for that operation. This format is known as an instruction format needed for that operation. There are three different instruction formats in SMAC+ called R-format, M-format and J-format, and they are shown in Figure 6.2. Note that four bits are used to address one of the 16 registers. In these instruction formats, we have assumed the Op-code field to be 8 bits long. With 8 bits we can have 28 = 256 different Op-codes or instructions which are more than sufficient for most purposes.

174

Computer Organization and Architecture
Op-code 8 R1 4 R2 4 R-Format Op-code 8 R1 4 Address 20 bits M-Format Op-code 8 Address 24 bits J-Format Unused 16 bits

FIGURE 6.2

Instruction formats of SMAC+.

The R-format of SMAC+ specifies two operand addresses both of which are register addresses. An instruction such as AND R1, R2 is interpreted as the contents of R2 is ANDed to that of R1. To execute such instructions, the CPU need not access the memory during the execution of the program. The M-format is useful to define instructions that will refer to one or two registers for the operands and store the result in memory. LOAD and STORE instructions could be of this type. In this format 8 bits are reserved for the Op-code, 4 bits for the register reference and the remaining 20 bits are reserved for memory reference. The 20-bit memory address of an M-type instruction can directly address any of the 220 locations in memory. Each location is 32 bits long. Thus, we can address 220 words, each 32 bits long (i.e., 1M words or 4 MB). Addressing each byte in a 32-bit word has several advantages particularly in processing character strings. Most modern computers use byte addressing. However, we will use word addressing for simplicity. A question arises, “if SMAC+ has a memory larger than 1M word (220), how to address the other memory locations?” We will answer this later. The J-format is useful to define branch type instructions that are essential in programming. A simple jump instruction has 8-bit Op-code and a 24-bit memory address. According to this format, the jump instruction can branch control to any of the instructions in the addressing range of 224 words. However, as MAR is only 20 bits long we can use only the last 20 bits of this field. Data stored in a word of SMAC+ may be any of the following: 1. A positive or negative integer. The number is stored in binary form. Negative numbers are represented in two’s complement form. The range of integers is thus –231 to (+231 – 1). 2. A character string. Eight bits are used to represent a character in the standard ASCII format and thus 4 characters may be stored in each word or a register of SMAC+.

Basic Computer Organization

175

When a word is read from the memory of SMAC+, it may represent an integer, a character string or an instruction. It is the responsibility of the programmer to keep track of consistent interpretation for the 32-bit data stored in a memory word. The CPU of SMAC+ consists of a set of registers, arithmetic and control circuits which together interpret and execute instructions (see Figure 6.3). It has a set of registers accessible to a programmer and another set of registers that are used in interpreting and executing instructions and are not accessible to programmers. There are seven registers in the CPU of SMAC, that are not accessible to programmers but are essential for program execution. These are the PC, IR, MAR, MBR, FLAG, MQ and MR registers. The MQ and MR registers are used during multiplication and division operations as explained in Chapter 4. The bits of the FLAG register are expanded and named C, V, Z, N, P, M and T (from left to right). The interpretation for these bits are as given below and they are automatically set by the CPU hardware at the end of each instruction execution depending on the status of the instruction execution.
Registers Accessible R0 32 bits Registers Not Accessible IR 32 bits

R1 . . . R15

32 bits

MBR

32 bits

PC 32 bits MAR

20 bits

20 bits

CVZNPMT

Status–8 bits (Flag register)

MQ

32 bits

MR

32 bits

FIGURE 6.3

SMAC+ registers.

Bit 1, C: (carry bit)  Set to 1 if a carry is generated during an add operation or a borrow is generated during subtract operation. It is otherwise cleared to 0. Bit 2, V: (overflow bit)  Set to 1 if an add or subtract operation produces a result that exceeds the two’s complement range of numbers. Else it is 0. Bit 3, Z: (zero bit)  Set to 1 if the result of an operation is zero. Otherwise it is 0. Bit 4, N: (negative bit)  Set to 1 when the result of an arithmetic operation is negative.

176

Computer Organization and Architecture

Bit 5, P: (positive bit)  Set to 1 when the result of an arithmetic operation is positive. Else it is set to 0. Bit 6, M: (compare)  Set to 1 if the ‘bit comparison operation’ succeeds, else it is set to 0. Bit 7, T: (Trace bit)  When this bit is set to 1 by a special instruction (not discussed in this chapter) program execution stops and special debug software is executed. The hardware organization of SMAC+ at the level of registers is summarized as follows: Registers of SMAC+ 1. All registers of SMAC+ are 32 bits long. 2. Registers accessible directly to programmers of SMAC+ are named: R0, R1, R2, …, R15. 3. Registers, which are not accessible to programmers directly but are in the hardware organization, include MAR, MBR, PC, IR, MR, MQ and the FLAG register. 4. The Flag register and its bits are explained above. Before a program is executed, the sequence of instructions of that program is stored in memory. The instructions are then executed one after another. Execution of an instruction has two phases: (a) Instruction fetch cycle (I-cycle). (b) Instruction execution cycle (E-cycle). Following is the sequence of operations carried out during I-cycle: (I-1) The address of the instruction to be fetched is found from the PC. The instruction is fetched from memory and placed in IR. (I-2) The PC is incremented by 1 to get the next instruction during the succeeding I-cycle. (I-3) The Op-code is decoded to see which operation is to be performed. Execution of each operation needs a set of control signals which are initiated after this decoding. The execution cycle (E-cycle) of an instruction involves the steps given below: (E-1) The Operands are fetched either from registers or from the memory. (E-2) Using these operands, the operation is performed. The execution of an operation requires the processing unit, which is also known as Arithmetic and Logic Unit (ALU), and a set of control signals. These control signals differ from one operation to another. There are some instructions which do not need fetching of an operand. At the end of an E-cycle, the execution of the next instruction, that is, the I-cycle of the next instruction begins. Thus, the I and E cycles alternate until the machine comes to a halt.

Such a mechanism is called an input unit. Another approach is to reduce the instructions available in a computer and make them simple and regular. Which approach should be chosen is a designer’s choice.4 INPUT/OUTPUT FOR SMAC+ Every computer needs a mechanism to receive data from sources outside it for processing. This approach has come to be known as the RISC (Reduced Instruction Set Computer) approach. back-up storage systems such as magnetic tapes and discs are also used as intermediaries in I/O operations. printer. While writing these programs we might feel the need for more instructions. keyboard. A variety of I/O devices such as video display. I/O operations have to be repeated if more than one word is to be read or written. Besides these. This is performed by an output unit. it will also be assumed that only one word is written or read at a time. programming them effectively requires a thorough knowledge of their internal structure. This approach is known as CISC (Complex Instruction Set Computers) approach. One of our aims in this process of expansion of the instruction set is to bring out the trade-off in design. A computing system would normally consist of a number of I/O devices besides the CPU and memory. 6. The devices used for input and output units are referred to in the literature as I/O devices. The instruction set of a CISC type microcomputer contains powerful instructions which enable programmers to write efficient programs. Using the set S1. . A certain operation could be carried out either by hardware as a new instruction or with a better software support. Pentium and Motorola 68040. One trend in the design of computers is to introduce a large number of complex instructions. SMAC+ will be assumed to have a keyboard as the input device and a display as the output device. This assumption is not realistic but has been made to simplify the simulator program for SMAC+. The compilers for CISC machines are also very complex. Examples of CISC computers are Intel 80486. First. mouse. For the present we will give a rudimentary I/O capability to SMAC+.Basic Computer Organization 177 6. We will then expand the instruction set S1 to S2. After the data is processed the results are to be communicated outside.5 INSTRUCTION SET OF SMAC+ The instruction set of a computer determines its power. This allows one to gain efficiency by pipelining the execution of simple instructions. we introduce a small set of instructions for SMAC+ and call this set S1. we write some programs. Further. As there are several hundred instructions in CISC computers. The design of SMAC+ will follow this approach. and speech synthesizer are available.

R2 Example: ADD R1. 6.2 Instruction Formats of SMAC+ The following are the instruction formats of SMAC+: 1. One may also have a form CMP R3. R2. R-Type Op-code. 5 the second operand is a constant and is available as part of the instruction. We have chosen to use a constant instead in R4 field as in many cases it simplifies programming.5. R1. mem-address [mem-address=20 bits-long] Example: LOAD R5.5. R. M-Type Op-code. R4 in which case the contents of R3 and R4 are compared.1 Instruction Set S1 of SMAC+ TABLE 6. R2  [in Hex format: 01 34 00 00] In ADD the two operands are available in R1.178 Computer Organization and Architecture 6. 2. The sum is stored in R1.1 lists the instruction set S1 of SMAC+. the maximum value of the constant is 15. In the case of CMP R3. In this case there are two restrictions: (i) The constant cannot be changed during program execution (ii) As only 4 bits are available in R field. X [in Hex: 02 50 FF FF where 0FFFF is the address of the variable X in memory] .1 Instruction Set S1 of SMAC+ Op-code HALT ADD LOAD STORE JUMP JMIN INPUT OUTPUT INC CMP JLT AND OR NOT SUB MULT IDIV JOFL Op-code in Hex 00 01 02 03 04 05 06 07 10 11 12 20 21 22 31 32 33 34 Type R R M M J J M M R R J R R R R R R J Table 6.

Set LT. R6.Basic Computer Organization 179 3. 2. R4. R1 R1. Load stage: When the assembled and stored program is loaded into the main memory and made ready for execution. 0. R5.2 Instruction Format of Semantics of Instruction Set S1 of SMAC+ Instruction Hex form OP 00 01 02 03 04 05 06 07 10 11 12 20 21 22 31 32 33 34 REG 00 34 40 50 00 00 00 00 10 12 00 34 56 40 43 78 67 00 ADDR 0000 0000 FFFF EEEE DDDE CCCF BBBE AAAA 0000 0000 9999 0000 0000 0000 0000 0000 0000 BBBB Symbolic form OP HALT ADD LOAD STORE JUMP JMI INP OUT INC CMP JLT AND OR NOT SUB MULT IDIV JOFL Reg1 R3. 0. 0. Assembly or Translation stage: The symbolic program is automatically translated into an equivalent machine language form by the system software. 0. TABLE 6. 0. Execution stage: When the program is ‘run’ or executed by the hardware one instruction at a time. R4 R6 P ADDR 0000 X Y Z D B A Stop execution C(R3) ¬ C(R3) + C(R4) C(R4) ¬ C(X) C(Y) ¬ C(R5) PC ¬ Z If (N in status Register = 1) PC ¬ D C(B) ¬ INPUT OUTPUT ¬ C(A) C(R1) ¬ C(R1) + 1 Compare C(R1) with B. Programming stage: The program is developed by the programmer in symbolic form using a programming language. 3. EQ. 0. 4. R4 R4. 0. 0. 0. J-Type Op-code. R7. GT flag depending on result of comparison If LT flag set PC ¬ P C(R3) ¬ C(R3) Ù C(R4) (Ù bitwise AND operator) C(R5) ¬ C(R5) Ú C(C6) (Ú bitwise OR operator) Semantics R4 R4 (bitwise complement) C(R4) ¬ C(R4) – C(R3) C(R7) ¬ C(R7) * C(R8) C(R6) ¬ Integer Quotient of C(R6)/C(R7) If OVFL bit set PC ¬ C l Now we will present four stages in the development and execution of programs in SMAC+. in particular we identify the following four stages: 1. R3. 0. jump-address [jump-address=24 bits-long] Example: JLT BEGIN [in Hex: 12 00 16 40 where 001640 is the jump address BEGIN in memory] In Table 6.2 we list the format of all the instructions of SMAC+ and what they do (semantics). B 0. 0. R5. C R3 R8 R7 Reg2 R4. .

Y in FFFFE and Z in FFFFD. R2 and R3). This stage will be referred to as programming stage.1 we note three data variables (X. When loading is completed.180 Computer Organization and Architecture It is intuitively clear that these four stages are executed sequentially in the order specified here. During this stage. the assembler software determines where to store the program and what memory addresses to assign to operands. Let us suppose we use the mapping shown in which the data area starts from the highest address in memory (FFFFF) and gets filled backwards. starting from a given start address in memory. instruction fetch and execution are repeated cyclically. At load time the loader decides where is free space available in memory and chooses a suitable memory region for loading.Z) and reference to three registers (R1. One table used by the assembler is the Op-code table.6 ASSEMBLING THE PROGRAM LANGUAGE FORMAT INTO MACHINE In Program 6. The execution stage of the translated machine language program is known as run time. the program is ready for execution. Thereafter. Thus X will be stored in FFFFF. This phase is referred to as assembly stage. 6. The address of the first instruction to be executed is placed into the PC-register by the loader software and then the execution is started.1).Y. During this time the instructions are executed one at a time. A system software known as loader performs the job of loading the given machine language program. Loading becomes simple if we assume that every instruction in the machine language format occupies exactly one word (or 4 bytes) in memory. Our first task is to map these variables and registers to memory locations and registers in SMAC+. As an example case we have chosen the problem of adding 5 pair of numbers using SMAC+ assembly language (Program 6. Following this. . During this stage an appropriate algorithm is chosen and the programmer takes several other decisions about naming the data items and naming the instructions to branch to. Besides this. R1 will be used to store X. assembler software automatically translates the symbolic program into its equivalent machine language program. It also constructs and uses several tables and takes other important decisions. mnemonics are used to represent operation codes and symbols are used for addressing data and branch addresses. we have a machine language program that is ready to be stored in the main memory. When the assembly process is completed. R2 to store Y and R3 as a counter to control the number of iterations. See Table 6. The process of storing the program in main memory ready for execution is called loading and this stage is called load time.1 where each Op-code is uniquely associated with an 8-bit code which will be decoded and interpreted by the CPU hardware. that is towards decreasing memory addresses. In assembly language.

X R2. We have conveniently assumed that every instruction will fit exactly into one memory word or 4 bytes. The assembler software can automatically build a symbol table by scanning the assembly language program statement by statement. This simulated machine will behave as SMAC+. BEGIN. . Program 6. The physical end of the machine language program will be coded with a data FF for Op-code field. Let us suppose that the first instruction in memory will be located in 10000 in Hex. the symbolic address in the program. This decision automatically makes the address of the instruction INC R3 to be 10001 and the address of HALT instruction to be 1000B.1. Thus SMAC+ machine language program can be fed as data to this SMAC+ simulator and executed. They are absent in the stored machine code) 6. A table showing all the symbols in a program and their associated memory addresses is called symbol table.Basic Computer Organization 181 Our next task is to decide the starting memory address from where the translated machine language program will be located in the memory in successive words. FFFFE and FFFFD. gets memory locations associated with it as 10001. With these assignments we are now ready to translate the symbolic program in Program 6. namely.R2 R1. 12 memory words and the data area occupies memory locations FFFFF.7 SIMULATION OF SMAC+ In this section an algorithm to simulate SMAC+ will be developed. Based on the program’s starting address.1:  Machine language to add 5 pairs of numbers Machine Code 02 10 06 06 02 02 01 03 07 11 12 00 30 30 0F 0F 1F 02 12 1F 0F 30 01 00 00 00 00 00 FF FF FF FE FF FF FF FF 00 00 FF FD FF FD 00 05 00 01 00 00 Stored at 10000 10001 10002 10003 10004 10005 10006 10007 10008 10009 1000A 1000B Explanation R3 ¬ 0 R3 ¬ R3 + 1 C(FFFF) ¬ X C(FFFE) ¬ Y R1 ¬ X R2 ¬ Y R1 ¬ R1. Remember that in SMAC+ we have assumed that by its design the register R0 will always contain the number zero. R2 Z ¬ R1 Output Z Is R3 ³ 5? If No go to BEGIN Halt execution R3.1.0 R3 X Y R1. Then the next instruction will be located in 10001 and the next in 10002 and so on. The machine language of the assembly language program is given as Program 6. (The spaces in the machine code are used for readability.Z Z R3.5 BEGIN Symbolic Code BEGIN LOAD INC INPUT INPUT LOAD LOAD ADD STORE OUTPUT CMP JLT HALT We observe that the program occupies memory locations 10000 through 1000B.Y R1.1 to its equivalent machine language format by using the Op-code table in Table 6. This algorithm can be easily converted by a student to a C or Java program and executed on a real computer such as a PC.

. Instruction overflow ¬ false repeat Read. J or M Case R-type: M-type J-type end of case Opr1 Opr2 Opr1 Addr Addr ¬ ¬ ¬ ¬ ¬ IR[23.. instruction Memory [PC] ¬ instruction PC ¬ PC + 1 if (PC > FFFFF) then memory overflow ¬ true until op-code of instruction = FF or memory overflow Execution phase PC ¬ starting address of program Halt ¬ false Repeat IR ¬ Memory [PC] PC ¬ PC + 1 Op-code ¬ IR[31...0] IR [23.16] IR[23.0] Case Op-code R type 0: halt ¬ true 1: R(opr1) ¬ R(opr1) + R(opr2) 10: R(opr1) ¬ R(opr1) + 1 11: if (R(opr1) > x then set of GT flag     else if R(opr1) = x then set EQ flag         else set LT flag endif endif 20: R(opr1) ¬ R(opr1) Ù R(opr2) /* Ù is bitwise AND */ 21: R(opr1) ¬ R(opr1) Ú R(opr2) /* Ú is bitwise OR */ 22: R(opr1) ¬ NOT R(opr1) /* NOT is bitwise NOT*/ 31: R(opr1) ¬ R(opr1) – R(opr2) 32: R(opr1) ¬ R(opr1) *R(opr2) / Multiplication */ 33: R(opr1) ¬ R(opr1) / R(opr2) /* Integer division */ end case /* R type */ Case Op-code M type 2 : R(opr1) ¬ Memory (addr) /* Load */ 3 : Memory (addr) ¬ R(opr1) /* Store */ 6 : Memory (addr) ¬ Read input (Contd.20] IR [19.182 Computer Organization and Architecture Algorithm 6.) ..20] IR[19.1: Program loading phase SMAC+ Simulator PC ¬ starting address of program in main memory.24] /* bits 31 to 24 of IR*/ Find op-code type as R..

The student should convert this algorithm into a C or Java program.4 Flow chart depicting the sequential execution of instructions. Execute Instruction Is it Control Inst. The PC (also known as IP for instruction pointer) will be initialized to 100 00 as the first instruction of the program is located at that memory address.4. ? Is it Halt ? No Yes No Stop FIGURE 6. As stated Initialize value of PC Modify PC Based on Result of Execution IR ¬ MEM (PC) PC ¬ PC + 1 Decode Instr. For this purpose we will use Algorithm 6. . 6.8 PROGRAM EXECUTION AND TRACING In what follows we will examine in detail how the machine language program (Program 6.Basic Computer Organization 183 7 : Write Memory (addr) end case /* M type */ Case Op-code J type 4 : PC ¬ addr 5 : if (N) then PC ¬ addr 12 : if (LTflag) then PC ¬ addr 34 : if (OFLO) then PC ¬ addr end case /* J-type */ until Halt Algorithm 6.1) would be executed by SMAC+ instruction by instruction in a sequence.1 only gives a general idea of how SMAC+ can be simulated. We have left out a lot of details.1 and the flow chart shown in Figure 6.

During the operation execution subcycle. will point to the next instruction in sequence. In some arithmetic instructions there may be a fourth sub-cycle of storing the result of execution in either a register or memory. Decode Operand Fetch Execute Operation Store Result Execute cycle Fetch cycle FIGURE 6. The cycle keeps repeating.3 one notices that there are two paths and we have followed the path in which the contents of PC is incremented by 1. Thus the PC. after this increment. operand-fetch and operation-execution.5 we show these. the three sub-cycles of the execution cycle are followed one after another. the contents of R0.3. Fetch Instrn. Based on the coding of the Op-code.5 Instruction fetch and execution cycles.3 Program Trace of Program 6.1 Instruction-Address 10000 10001 10002 Effect of executing that instruction C(R3) ¬ 0 C(R3) ¬ 1 C(FFFFF) ¬ the data input for X (Contd. The program trace or instruction by instruction execution of the above program is shown in Table 6. that is zero.) . In the fetch cycle. and the fourth digit refers to source register. is moved into the register R3 which is denoted as: C(R3) ¬ 0.3.1. the SMAC+ hardware knows that this is an R-type instruction. Optional Instrn. Then. The operand-fetch sub-cycle in this case is a null cycle because both operands are in registers and there is no memory reference. and it must be decoded in the format that the first two hex-digits for Op-code.184 Computer Organization and Architecture in Section 6. After the fetch cycle of the first instruction in Program 6. the contents of the IR will be 02 30 00 00. In Figure 6. and this Op-code involves data move at the execution time. We will denote the effect of this fetch cycle as follows: C(IR) ¬ 02 30 00 00 Immediately after the fetch cycle. From the flowchart of Figure 6. the third digit refers to the destination register. The next instruction is fetched and executed. TABLE 6. and the remaining four hex-digits are to be ignored. the memory is read and the contents of memory at the address contained in the PC register are loaded into the instruction register or IR. every instruction execution has two phases or cycles: instruction fetch cycle and instruction execution cycle. the execution cycle consists of three subcycles: instruction-decode.

we observe the need for supporting the following operations more efficiently. For example.) Instruction-Address 10003 10004 10005 10006 10007 10008 10009 1000A 1000B Effect of executing that instruction C(FFFFE) ¬ the data input for Y C(R1) ¬ C(FFFFF) C(R2) ¬ C(FFFFE) C(R1) ¬ C(R1) + C(R2) C(FFFFD) ¬ C(R1) Output ¬ C(FFFFD) C(R3) ¬ compared to 5 LT flag set to False LT flag is False. At the same time. without having to read from the input device. It is. These set . This counter is initialized to some value. hence C(PC) ¬ 10001 [when LT is True C(PC) ¬ 1000B] HALT 6. multiplication can be carried out by successive additions. their applications in many problems have grown enormously.5 we decided to have a set of 18 instructions for SMAC+. because input operation is slow.9 EXPANDING THE INSTRUCTION SET In Section 6. however. We can extend the instruction set of SMAC+ to include three new instructions which are explained below: 1. From the sample program.3 Program Trace of Program 6. 3. In principle it is possible to program a computer even with lesser number of instructions. and a conditional jump executed. Separate instruction for multiplication is not really necessary.Basic Computer Organization 185 TABLE 6. To be able to move the contents of one register to another. 2. With the growth of powerful and affordable computers. computing speed and hardware cost. it should be pointed out that implementation of each instruction results in extra hardware and consequently higher complexity and cost. easy programming of a variety of applications is becoming an important factor in the design of computer systems. Instruction sets for computers have evolved with the primary goal of simplifying programming and the secondary goal of reducing computing time.1 (Contd. The designer thus has to trade-off between programming convenience. Iterative execution of a program segment requires a counter to be used. convenient to have a multiply instruction as it will simplify programming and speed up computation. compared against a limit value. To be able to load a register with a constant other than zero. decremented after every iteration. As a result of this.

0. + 4 [Op-code for LDIMM is 40. this type of instruction will ease writing loops in programs. We call such an instruction BCT or branch and count. As looping in programs is frequently used. MULT and IDIV to contain three register addresses instead of two register addresses. We give below an example with the SUB instruction in this 3-address format. The reader is urged to note the advantage of the three-address formatting in arithmetic operations by studying Program 6. As we have more registers in the CPU. SUB. R-type] 41 24 00 00 C(R2) ¬ C(R4) BCT R4. The symbolic form of these three instructions and their semantics is given below: (i) LDIMM: Load immediate instruction Symbolic form: LDIMM R2.2. we will modify the instruction formats of the ADD. R3 31 5 4 3 C(R5) ¬ C(R4) – C(R3) and the contents of R3 and R4 remain unchanged. NEXT [Op-code for BCT is 42.186 Computer Organization and Architecture of actions (except initializing) can be packaged into a single machine instruction. we will modify the semantics of the INPUT and OUTPUT instructions of SMAC+ to refer to a register as opposed to memory locations. R4 [Op-code for MOVE is 41. . Therefore. This will pressure the two operands and not replace one of them with the result of an arithmetic operation. J-type] 42 40 478B (assuming the address of NEXT is 0478B) C(R4) ¬ C(R4) – 1 If [C(R4) = 0] then PC ¬ NEXT Modifications to I/O and arithmetic instructions of SMAC+: When an I/O instruction is executed normally the data input is transferred from a peripheral register into the computer via a CPU register. M-type] Machine language form: 40 20 00 04 Semantics: C(R2) ¬ + 4 (ii) MOVE instruction Symbolic form: Machine language form: Semantics: (iii) BCT instruction Symbolic form: Machine language form: Semantics: MOVE R2. Symbolic form: Machine language form: Semantics: SUB R5. R4.

Read first number as big LOOP INP R3 . Let us consider the users of a computer system. When we say the architecture of a building is excellent. A system programmer has to keep track of several low level details such as the memory allocation and register allocation. x R4: holds the largest number.2: Read five input numbers one after another and find the largest among them. The term ‘application programmers’ is used to refer to those who program applications using higher level languages. LOAD 4 INTO R1 INP R4 . Not only the number of instructions is reduced but also execution will be faster. In Program 6. its external appearance as well as its functionality are taken into account. if x < big go to NO-CHANGE MOVE R4. R4 . Viewing a computer system at this level of detail has come to be known as instruction set architecture. IF (C(R1) > 0) go to LOOP STORE R4.Basic Computer Organization 187 Program 6. Those who program a computer at the level of the assembly language are called assembly language programmers or system programmers. big LARGE: is a memory location Symbolic Program LDIMM R1. This type of instructions is known as pseudo instructions or assembler instructions. Some of the factors to be taken into account are as follows: .2 shows that writing an iterative loop in a program is facilitated by the BCT instruction. C(R5) ¬ x – big JMINUS NO_CHANGE . store the result in a location named LARGE. R3. There is a lot more to instruction set architecture than the set of instructions. He/she should also remember the mnemonics. Read x SUB R5. instruction types and their semantics. 4 .2 we have introduced a new type of instruction DW. There is a wide range of other problems where these instructions will reduce program size and speed up execution. existing technology and a good knowledge of the underlying architecture. C(R1) ¬ C(R1) – 1 . LARGE . Output big HALT LARGE DW 1 Program 6. Developing a good set of instructions for a computer is based on the designer’s experience. C(LARGE) ¬ C(R4) = big OUT R4 . R3 . What is meant by the term ‘architecture’? The term is well understood in the case of buildings. x > = big & so big ¬ x NO_CHANGE BCT R1. The purpose of the ‘LARGE DW 1’ (define word) instruction is to inform the assembler to introduce a symbol by the name LARGE into the symbol table and assign a memory address for that symbol so that data can be stored is that address or accessed by the programmer. There are several levels of users. Storage allocation R1: used as a counter for loop R3: contains the data Read. This is not really an Op-code like ADD or MOVE. LOOP .

3. This read operation is expressed using a for loop in Algorithm 6. The control and data paths used in the execution of instructions for moving the bits from one place to another (called buses). 7. because the for loop starts counting from 1 and increments the loop variable ‘i’ right at the beginning of the loop we initialize SA to (300–1)} for i = 1 to 20 do begin Read D. Compilers. The organization of I/O devices and their interface to memory and central processing unit. (D is the data item) Memory (SA + i) ¬ D. before designing a new computer system (whether hardware or software). we must develop a skill to analyze.2: Reading a vector SA ¬ 299 {SA: Start address. The matrix is a twodimensional vector in which each row (or each column) can be viewed as a vector.) in the form of a ‘coherent’ system.2. We can have a vector of integers. a thorough understanding of the instruction set architecture of a given computer system is like analysis of a system. real numbers.188 1. end. Strings of bytes or characters occur commonly in word processing applications and in high level language programming. Operating Systems. etc. Let us suppose that we want to read a vector of 20 integers and store it in memory with the vector base address as 300. 9. The basic operation we need with vectors is reading a vector of data items from an input unit and storing it in memory starting from a specified address. 6.10 VECTOR OPERATIONS AND INDEXING An ordered sequence of data items is known as a vector. . 6. 8. Computer Organization and Architecture What are the different formats of the instructions? How are the various registers classified and addressed? How is the status register affected at the end of an instruction? Organization of the memory system which is usually in the form of a hierarchy. The organization of a collection of system software programs (Loaders. How is the memory addressed? Speed of memory. Assemblers. 4. Algorithm 6. In essence. This starting address of the vector is called the base address of the vector. 5. Thus. A sequence of bytes is also known as a string. 2. or ASCII coded bytes. Linkers. A good analysis is prerequisite to any design.

If SA is given a value of 299 and R5 is initialized to zero. but in indexed STORE we have two register references. we had only one register reference. This address modification is achieved through the use of an index register. one data at a time and store them as a vector in memory with the vector base address of 300. the address where D is stored gets modified. As i goes from 1 to 20 in the for loop. Initialize R5 to zero . 20 0 R5. R5. R5 in this case plays the role of an index register. Program 6. The last component will be stored in address 319 Storage allocation R1: Loop counter register R2: Data is read into this register R5: Index register Symbolic program LDIMM LDIMM INP INC STORE BCT HALT R1. the effective address will also get incremented in each iteration by 1. We need a mechanism in SMAC+ to realize such a variation of the address of an instruction in successive iterations of a loop. Load R1 with 20 . The instruction format of SMAC+ has to be redefined in order to accommodate indexed addressing. R1.3: To read 20 data items. We will do this later in this chapter and call the modified SMAC+ as SMAC++. R2 R5 R2. the effective address initially will be 299. Repeat for Loop LOOP In the above program the effective address for the STORE instruction is calculated as follows: Effective address = contents of the index register specified + the value of the address in the symbol field . SA    [indexed addressing] In the previous uses of the STORE instruction. increases from 300 to 319. Indexed address as two registers are used . 299 LOOP . Note that i is the for loop counter which gets incremented by 1 in successive iterations and it starts with an initial value of 1. Consider the following instruction with indexed addressing. R5. The data read can then be stored in successive locations in memory. STORE R2. The effective address for the store operation is obtained by adding the contents of R5 to the address denoted by the symbol SA in the symbol table.Basic Computer Organization 189 In every iteration of the for loop of the above algorithm. As R5 is incremented by 1 in every iteration. that is SA (Start Address) gets incremented by i. the addresses where D is stored.

it can be incremented or decremented using INC or DEC instruction. Such a combination makes the stack faster. The size of the stack can be varied and may be as large as the memory. Washed plates are added to the top of the stack. then every PUSH instruction would increase the SP by a constant and the POP instruction would decrease the SP by the same constant.11 STACKS A stack is a very useful data structure in computer software design. This constant would be 1 if the data elements of the stack are one word long and the memory is word-addressed. In certain other computers like 80486. Let us suppose that we make an arbitrary choice and designate the CPU register 15 (1111) as SP. we can use any of them (except R0) for indexing. PUSH and POP instructions would increment or decrement the SP register as shown in Figure 6. It would be desirable to combine both by storing certain number of top elements in registers and the rest in memory. A pile of plates (or trays) in a cafeteria operates as a stack. Similarly. A stack when implemented will have a finite capacity. usually called top of the stack. Using the main memory of the computer as a simulated stack yields a low-cost solution and the possibility for a large-sized stack. Register sets used as a stack would allow high speed operations (POP and PUSH) because no memory access is required. in a stack structure both addition (known as push) and removal (known as pop) take place at the same end. Stack Pointer (SP) is a register which holds the address in memory where the top element of the stack is stored. A computer designer has a choice of providing a set of registers in CPU as a stack or let the programmer use part of the main memory as a simulated stack. expensive. and a plate for use is removed from the top of the stack. or a stack of other structures (stack of stacks!). For example. Consecutive PUSH operations carried out without any intervening POP has a chance of causing a stack overflow condition.190 Computer Organization and Architecture Because the register field of an M-type instruction can refer to any of the 16 registers.6. the stack size will be very small as GPRs are addressable and the number of GPRs in computers rarely exceeds 256. The LOAD and STORE instructions . In certain computers like Pentium there is a separate register for SP. because SP can be addressed like other registers. separate registers are reserved only for the purpose of indexing. This designation has certain advantages and some disadvantages. however. stack of words. will be slower as data storage and retrieval from main memory is much slower compared to register access. The method. If GPRs are used as a stack. Thus. Since the registers of SMAC+ can be used for multiple purposes. 6. In software systems. they are sometimes known as General Purpose Registers (GPRs). This is useful in simulating the stack in memory. consecutive POP operations without a PUSH operation can possibly lead to popping of an empty stack leading to a stack underflow condition. however. a stack could be a stack of bytes. whereas in SMAC+ we have assumed Register 15 (R15) to be SP. It is. If we assume that the stack is stored from its top to bottom in successively increasing memory addresses.

6 Illustrating PUSH and POP instructions. Then PUSH R1 (push the contents of the register R1 into the stack) will be equivalent to the following two instruction sequence (R15 is assumed to store stack pointer): INC R15 C(memory location whose address is in R15) ¬ C(R1) . We also assume that the programmer is very careful so that stack underflow and stack overflow conditions do not occur (later we will relax these constraints in an exercise). Let us also assume that the top of stack is at address 04FFF. Such trade-offs occur often in any system design and computer design is not an exception. we have one register less for general purpose use. let us assume that the bottom of the stack is at the memory address 04000. After PUSH Before PUSH 04221 SP 04220 SP Sita Rama Guha Lax 04221 04220 Memory 0400 (Bottom of stack) Before POP After POP 04221 SP 04220 SP Sita Rama Guha Lax 04221 04220 0421F Memory 0400 (Bottom) FIGURE 6. For convenience. A compare instruction can be used to check if SP has gone out of the stack boundary causing stack overflow or stack underflow conditions. we have to realize it by software with other instructions. On the other hand we are restricted in the use of R15 and it cannot be used for any other purpose. Thus.Basic Computer Organization 191 in their indexed mode can be used to read or write the top element of the stack. Thus the stack size is FFF words (4K words). When the stack is not implemented in the hardware.

1. Replace the contents of the memory location x by the contents of R1. R15    C(C(R15)) ¬ C(R1) And POP is equivalent to: LDIND R1.) 2. Let us assume (for now) that the assembler has this capability.) Semantics of LDIND instruction Symbolic form: LDIND R1. . R-type] Machine language form: 46 F1 00 00 Semantics: Let contents of R15 be x.R1 [Op-code for STIND 46. R15    C(R1) ¬ C(C(R15)) DEC R15 Instead of using the two instruction sequences. In order to illustrate the concepts presented in this section. let us write Program 6. let us introduce a new mode of addressing called register-indirect addressing and two new instructions called store-indirect and load-indirect. STIND (store indirect: Similar to load in addressing. we can use PUSH and POP in the symbolic programs we write. C(C(R15)) ¬ C(R1) or C(x) ¬ (CR1) Then PUSH is equivalent to the two instructions: INC R15 STIND R1.4 as given below.R15 [Op-code for LDIND 45. for the sake of readability of programs. Load the data value from that address into the second register. But the corresponding machine language program will have two equivalent machine instructions substituted for one symbolic instruction. and so also for the POP. POP R1 (pop the top of stack into the register R1) is equivalent to: C(R1) ¬ C(memory location whose address is in R15) DEC R15 Note that SP is decremented after reading the top element of the stack from memory. R-type] Machine language form: 45 1F 00 00 Semantics: C(R1) ¬ contents of the address stored in R15 Semantics of STIND instruction Symbolic form: STIND R15. Thus. i.192 Computer Organization and Architecture Similarly.e. LDIND (Load indirect: One of the two registers specified contains the address of the data. In order to realize the effect of the data transfers indicated above to implement a software stack. we want the assembler to substitute PUSH by the corresponding two instruction sequences. we wish to be able to use PUSH and POP. Whenever the assembler software encounters PUSH.

LOOP2 . The organization of a program into modules while writing large programs (i. What will be the sequence of output obtained? How is it related to the input sequence? 6. 3 Symbolic Program LOOP1 LOOP2 The reader is urged to observe the following three points by simulating the execution of Program 6. 3.12 MODULAR ORGANIZATION PROGRAMS AND DEVELOPING LARGE With the growth of computer application in all walks of life. Such large programs are usually developed by a team of programmers working together and sharing the total work. Modular organization is needed even within the task of one programmer to manage the complexity of the software. R15 . Storage allocation R1: R2: R15: Holds the data read Holds the counter for repeating 3 times Reserved for stack pointer LDIMM INP PUSH BCT LDIMM POP OUT BCT HALT R2. Each module is normally structured so that it carries out a well-defined quantum of work and is of manageable size. 3 . R15 . . R1 R1. at the software level) is supported by the hardware design of a computer system by using a technique known as subroutines and linkages. RAMA. R2. Each person develops a subsystem. . The readability of the symbolic program. Subroutines are also known as subprograms. SITA and PUSH each item into the stack as the item is read. R1. R2.4: Read three data items each 4 bytes long.4: 1. The two-instruction sequences indirect LOAD (LDIND) and DEC is substituted for POP and INC and indirect store (STIND) is substituted by PUSH in the machine language version. Push in top of memory stack C(R1) LOOP1 . All such subsystems are then combined into one large software.Basic Computer Organization 193 Program 6. Then POP one at a time repeatedly as output. R1 R2.e.. . Pop data from top of stack and load it in R1 . one at a time. 2. in the sequence GUHA. some application programs have become so large that their sizes run into thousands or even millions of lines of code.

2. where the control should be returned when the called program completes its task.194 Computer Organization and Architecture Organizing a program into subprograms has numerous advantages. The same subprogram may be a called program at one stage and a calling program at a different stage during execution. that is. at the end of Sub 4 the control should return to Sub 1 or Sub 2 whichever was the caller at that time. The point of return. Each subprogram can be separately programmed. Sub 1 and Sub 2 in turn call Sub 4 to complete their task. no subprogram may call the main program. . the caller should know the location where the called program puts the result. assembled. control should return to the appropriate location in the calling program. The main program is the coordinating program. The addresses of the arguments or the arguments are to be communicated. For example. we will use the terms calling program (or caller) and called program (or callee). in Figure 6. tested and then interconnected. At the end of its execution. In the sequel we will explain how these transfers can be achieved. then the storage locations of x and y or x and y are to be given by the caller to the called program. A subprogram may be called (used) by the main program several times and in turn the subprogram may call other subprograms.8. For example. It can never be a called program. Usually factors such as logical independence and program-length would be the guiding factors. The art of splitting a complex task into a set of subtasks requires careful analysis and experience. Similarly. We should observe that between the caller and the called two kinds of transfers. Since a subprogram can be called or may call another subprogram. Suppose a subprogram is to be called from the main program a number of times. control transfer and data transfer. the main program calls Sub 1. Main Program Sub 1 Sub 2 Sub 3 Sub 4 FIGURE 6.7. should take place. Sub 2 and Sub 3.7 Subroutine linkages. that is. Suppose Sub 4 is a subroutine to compute (x + y)2. In general a large program would have to be broken down to a number of subprograms and held together by a main program which acts as the coordinator. as shown in Figure 6. A calling program should transmit the following information to the called program: 1.

We observe that the return address can be easily obtained from the address of the instruction that transfers control to the called program. First call Called 0304 195 Second call Return Return 0399 FIGURE 6. . the address for returning. In the first call (see Figure 6. J-type] 43 30 2EFA (assuming POLY is located at address 2EFA) Semantics: Top of stack ¬ address of the memory location following that of the CALL instruction. 0097. RET (short for RETURN) (Return to the calling program) Semantics of CALL instruction CALL POLY [Op-code for CALL is 43. The address of the first instruction of the called program is also known as the address of the callee.8 Subroutine call and return. We will enrich the instruction set by adding two new instructions: 1. . . i. . Recursive calling .e. that is. 0160 0161 . PUSH PC (Save return address in top of stack) PC ¬ 2EFA (the starting address of the subroutine) Semantics of RET instruction Symbolic form: Machine language form: Semantics: RET [Op-code for RET is 44. . We can introduce a new instruction to store the contents of the program counter in a specified register (remember that after the current instruction is fetched.8) this address is 0096 and the return address will be the address of the next instruction.Basic Computer Organization Caller 0008 . 0096 0097 0098 . CALL (to call a subroutine) 2. . R-type] 44 04 00 00 PC ¬ top of stack Symbolic form: Machine language form: We note that the semantics of CALL and RET instructions (which are hardware instructions) are defined in terms of the PUSH and POP operations and a stack in the hardware is used to store the return addresses involved in the subroutine calls. . that is. A called subroutine can in turn call another subroutine and so on. the program counter contains the address of the next sequential instruction) and then transfer the execution to the first instruction in the called program or subroutine.

XY4 DW 1 LOAD R1.. To pass parameter addresses between the caller and the callee we have two options: Convention 3:  Pass the values of arguments in mutually known registers (call by value)... STORE R1.5: Read x and y one at a time and then compute (x + y)4 using a subroutine that computes (x + y)2. R1 STORE R4.. XYSQ . MUL R4. This way the callee can be free to use all the CPU registers for its own computation. X INPUT R1 .. Storage allocation R1: R4: R6-R8: MAIN internal working Contains output Used by subroutine Read X Read Y Parameters stored here Results stored here SUBR will return here Symbolic program INPUT R1 .. Convention 4:  Store the arguments in a block of consecutive locations in a reserved part of the memory in an orderly fashion and pass only the starting addresses of that block (call by reference).... XY4 OUTPUT XY4 . STORE R1... Suppose we have written a subroutine to compute z = (x + y)2 then the addresses of x and y known to the caller have to be communicated to the callee and conversely the address of z known to the callee has to be communicated to the caller. . Y CALL SUBR X DW 1 .... Convention 2:  The called program saves the contents of the registers it uses and restores them before returning control to the caller.. Store address from stack in R6 R6 contains address of the data X .196 Computer Organization and Architecture to any depth becomes straightforward with the use of stacks. Parameters are to be passed by passing the base address through the stack.SUBR POP R6 .. We still have the problem of passing parameters between the caller and the callee which has to be solved. Convention 1:  The caller saves the contents of all the needed registers in a known area in memory before calling and restores them when the control returns to the caller. conventions are established for passing these parameters or arguments and sharing the CPU registers for their computation.. While writing subroutines.... R1. Y DW 1 XYSQ DW 1 . Program 6..

e. Semantics of LEA instruction Symbolic form: Machine language form: Semantics: LEA R2. LDIMM for conveniently loading a constant into a register (load immediate). M-type. . This intermixing of program and data can be avoided by putting all the arguments (data) together at the end of the program and by passing their base address through the stack. and indirect addressing through LDIND/STIND (load/store indirect).5 to follow this convention as an exercise to readers. There has to be appropriate modifications in the hardware architecture of SMAC+ to accommodate these new features. . this method will place the arguments of the subroutine in the main memory immediately following the placement of the CALL instruction. We have introduced eight new instructions into SMAC+ and three new types of addressing modes through indexing for vector operations. R6 R6 R6 R6 . R7 now has the value of X (indirect address) R6 now points to Y R8 now has the value of Y (x + y) in R7 (x + y)2 in R8 R6 now points to XYSQ result stored in XYSQ R6 now points to XY4 R6 now points to return address Top of stack has return address stack top points to the return address The method of passing parameters as shown above is adequate but it is not elegant as data areas and program instruction areas are mixed in memory. R7. . . This distinction must be clearly understood. indexed addressing with R3] 47 23 789A (assuming the address of DELTA is 0789A) C(R2) ¬ 789A + C(R3) {i. . R7 R6 R8. . After every CALL instruction. We will leave rewriting Program 6. R6 R7. . R6 R6 R8. This modified SMAC+ is summarized below and re-named as SMAC++ with its new instruction set called S2.Basic Computer Organization 197 LDIND INC LDIND ADD MULT INC STIND INC INC PUSH RET END R7. 6. . R3. DELTA [Op-code for LEA 47. the effective address} Note that the LEA instruction leaves the address of data (not the data value) in the specified register. R7.13 ENHANCED ARCHITECTURE—SMAC++ The architectural enhancement in SMAC++ when compared to SMAC+ is summarized below: . . we introduce a new instruction called Load Effective Address (LEA). In order to facilitate this rewriting process. R8 R8. .

1. 18. Op-code HALT ADD LOAD STORE JUMP JMIN INPUT OUTPUT INC CMP JLT DEC AND OR NOT SUB MULT IDIV JOFL LDIMM MOVE BCT CALL RET LDIND STIND LEA Op-code in Hex 00 01 02 03 04 05 06 07 10 11 12 13 20 21 22 31 32 33 34 40 41 42 43 44 45 46 47 Type R R M M J J R R R R J R R R R R R R J R R J J R R R M . 5. 23. 15. 4. 19. 27. 17. The software architecture (the Assembler) has to be so designed as to distinguish the pseudo operations (like DW) from the machine operations and take appropriate actions. TABLE 6. the instruction format has to be appropriately modified so that the decoding hardware can do its job properly.5 we give the instruction format and semantics of the new added instruction set S2. 3. 4. 24. 11. 3. 12.No. 25. 14. 22. 20.4. 10. 16. The team of assembly language programmers agree on suitable conventions in developing their subroutines (or modules) so that they can share each other’s subroutines and pass parameters correctly while using the subroutines. The instruction set S2 of SMAC++ is given in Table 6. 7. immediate addressing and register-indirect addressing).198 Computer Organization and Architecture 1. 9. 5. 6. 2. 8. The stack has to be supported with a special SP or dedicating a general purpose register for stack pointing. 21. The instruction set is enriched to become S2 2. 26. Hence. Three new addressing formats are incorporated (indexing. In Table 6.4 Instruction Set S2 of SMAC++ Sl. 13.

R15] [Equivalent to LDIND R1. R2 0. DELTA 6. R3. Y XX X X Reg2 ADDR C(R1) ¬ INPUT (Modified from S1) OUTPUT ¬ C(R2) (Modified from S1) C(R4) ¬ Y C(R1) ¬ C(R2) C(R4) ¬ C(R4) – 1. In the Op-code field of SMAC++ we have 8 bits. LOAD R1.13. STORE R1. R2. J. STIND R1. PUSH POP CALL R1. DEC R15] [Equivalent to PUSH PC.1 Modifications in the Instruction Formats for SMAC++ The following are desirable modifications in the instruction formats for SMAC++: 1. INC R15 13 45 46 10 46 45 13 43 F 1F 1F F » ¼ 1F ½ 1F » ¼ F ½ 00 2EFA DEC R15 LDIND R1. PC ¬ POLY] Jump to subroutine lable POLY after storing return address in stack PC ¬ (Contents of top of stack) (Return to calling program) C(R2) ¬ C(R3) + DELTA (R2 stores effective address) Semantics LDIMM R4. MOVE R1. POLY R15 R15 R15 R15 44 00 RET 47 23 789A LEA R2. R1. If (C(R4) = 0) PC ¬ XX C(R1) ¬ C(X + C(R2)) [Indexed address] C(X + C(R2)) ¬ C(R1) [Indexed address] C(R15) ¬ C(R15) + 1 (R15 is assumed to be reserved as stack pointer) C(R15) ¬ C(R15) – 1 C(R1) ¬ C(C(R15)) C(C(R15)) ¬ C(R1) [Equivalent to INC R15.5 Symbolic Instructions and Semantics of New Instructions in SMAC++ Instruction Hex form OP 06 07 40 41 42 02 03 10 REG 1 2 40 12 40 12 12 F 4567 8543 4672 7432 ADDR OP INP OUT Symbolic form Reg1 R1 R2 0.Basic Computer Organization 199 TABLE 6. STIND R1. M) and use the following . Let us use the two high order bits to denote the type of the instruction (R. R15. R2. BCT R4.

Then. 3. any available register can be used by the programmer or the assembler as an index register. 4. Yet another table will be the symbol table that is constructed by the assembler during the assembly process and used for translation to machine language. .14 SMAC++ IN A NUTSHELL In this chapter. 3. In order to facilitate indexing. One table will contain the Op-codes and their corresponding hex-codes and it will be used in translation to the machine language. In certain other machines like Pentium. We have introduced a number of basic concepts in this chapter and they are summarized as follows: 1. Allocating a separate field for index register specification reduces the number of bits available for the operand-address field in the address space to 16 bits. 6. Access to the main memory for operands are only with a few instructions: Load Store (in the case of SMAC++). Another table will contain the list of all pseudo-ops so that they can be recognized for taking appropriate actions.200 Computer Organization and Architecture coding. Concepts of stored programs. It is not a complete computer yet because we need to study its memory organization and I/O organization. Simplicity in design—all instructions are 32 bits long and all Op-codes are 8 bits long. 5. The remaining 6 bits in the Op-code field of an instruction are sufficient to code instruction set S2. instruction fetch. immediate operands. The assembler software will use several tables. we introduced a pedagogical computer called SMAC+ and explained how a machine language program for SMAC+ is stored and executed. the format of the M and J type instructions will be modified by adding another register reference that is called index register. SMAC+ and its derivative SMAC++ are not real computers. Simple modes of addressing data or instruction are introduced (indexing. They are used to explain the organizational concepts. in succeeding chapters. We started with SMAC+ and then improved it to SMAC++. direct addressing. register-indirect addressing). l 01  M-type l 10  J-type l 11  R-type 2. When registers in the CPU are general purpose. we will not be focusing on SMAC++. However. Our purpose in this chapter has been to explain basic concepts and not to make SMAC++ a real computer. 2. there are separate registers available for indexing. we wrote several programs to bring out the issues involved in designing an instruction set for SMAC+. instruction execution and instruction sequencing using the PC (IP) register.

the address(es) where the two operands will be found. 8. number of operations to be supported. 12. the number of registers in the CPU and how many of them need to be addressed in an instruction.Basic Computer Organization 201 6. All three addresses refer to registers. The basic principle of trade-off in design. the address where the result is to be stored and the location where the next instruction will be found. (Subroutines are needed for modular organization of large program). instructions such as branch on count. Currently word lengths of 32 bits are common. 13. In SMAC++ we introduced 3-address format for arithmetic instructions. M and J types (note the two left most bits of the Op-code determine the type). How to support subroutines. The need to write programs in an elegant manner. 5. 2. the use of registers. A computer’s instruction has five parts. They are: operation to be performed. 11. 6. Three simple formats for instruction—R. The choice of an instruction. 9. Symbolic programs and their machine language equivalents. machines with 64-bit words are currently entering the market. The concept of stack and its uses. new operations and addressing modes to illustrate how these features simplify programming. The instruction length (or the word length) is the number of bits retrieved from the main memory in parallel in one cycle. and modes of addressing are all determined by the requirements of providing simple and convenient method of writing a variety of application programs. 7. . An introduction to assembler instructions which are not machine language instructions (like Define Word or DW) 15. The address of the next instruction to be executed is kept in Program Counter (PC). 10. The “instruction set” is driven by the application needs. In this chapter we have used two hypothetical computers SMAC+ and SMAC++ in which we introduced registers. compare and various jump instructions are useful. 4. The number of bits to be assigned to the various parts of an instruction depends on the instruction length. SUMMARY 1. It is seen that to implement loops. and two or three address fields. 3. However most instruction sets have only operation code. The difference between word addressing and byte addressing of main memory. However. 7. the size of the memory. 14.

2. pick appropriate number of bits for the mantissa and exponent. and immediate addressing simplifies operations with vectors. (a) What is the instruction format if in an instruction 3 GPRs can be addressed? (b) What is the word length of the machine? (c) What is the largest integer that can be stored in a word? (d) Is the word size sufficient to represent floating point numbers? If yes. each word to be addressable. 3. Using registers reduces access time to main memory and simplifies arithmetic. Using this simulator execute Program 6. A byte addressed machine has 256 MB memory. what solutions would you suggest for floating point representation? (e) What CPU registers are needed and what are their lengths? 4. It has 160 instructions and 16 GPRs. Answer the following questions: (a) What is the format of an instruction for this computer? (b) What is the number of bits in an instruction? (c) If an integer is to be stored in a word using a two’s complement representation. 64 operation codes and 8 registers. EXERCISES 1. what is the range of integers which can be represented in this machine? (d) Can a floating point number be represented in this word? If not. instructions to increment/decrement values in registers. (e) What are the advantages of byte addressing over word addressing and what are their disadvantages? (f) If the number of registers is to be increased to 32. To implement stacks. 9. Write a SMAC+ program to count the number of ‘As’ in this sentence. indirect addressing is useful. Print a trace of the program generated by the simulator. Represent the following sentence using these character codes “RAHUL FOUND A PEN”. Assume ASCII codes for the letters A-Z. . Justify your choice. Discuss their respective advantages and disadvantages. Use of stack simplifies passing parameters and return from subroutines to calling programs.202 Computer Organization and Architecture 8. A computer is to be designed with the following specifications: 32K words of memory. and certain types and jump instructions.1 given in text. Use of registers as index registers. logic. 10. Write a simulator in C to simulate SMAC+. discuss at least three choices you can make to redesign the instruction format.

10. (b) Rewrite Program 6. Illustrate how a stack can be used when a program calls a subprogram A which in turn calls another subprogram B. Compare your program with Program 6. Y. W from top to bottom. Write an algorithm similar to Algorithm 6. (c) Shift circular right the contents of a register by a specified number of bits. Write a program for SMAC+ to count the number of zeros in a given 32-bit word. Show how the parameters and the return addresses will be transmitted. 13. 14. Rewrite Program 6. Scan your program carefully to assure the number of instructions cannot be reduced any further.4 of the text using these new instructions. Do you find the instruction set to be adequate? Comment. 12.1 in the text to simulate SMAC++. (b) If the contents of the stack has to be X. 6. Introduce instructions in SMAC++ to do the following and define their semantics: (a) Shift right the contents of a register by a specified number of bits. Trace your program and show that it is correct.2 using the two addressing format for the arithmetic operations. 8. Which instruction set is appropriate for this problem? 7.2 and comment. how should input data be presented? (c) In what order will the output characters be printed? 11. Write a program in SMAC+ assembly code to find the largest and the second largest integer from a set of 10 integers.4 of the text write: (a) The machine language equivalent to the assembly code. (c) Write the machine language equivalent of the assembly code of (b). Use this simulator to execute the program you wrote to solve exercise 7. Write a C or Java program to simulate SMAC++. . Expand the SMAC++ simulator to include these instructions. Z. 15. Use SMAC++ assembly language. Is it possible to implement a stack using indexed addressing instructions instead of indirect addressing? (a) If yes write assembly programs for SMAC++ to simulate PUSH and POP instructions. For Program 6. 9. Use logical instruction of SMAC++ to replace every occurrence of a byte A5 (in Hex) in a word by the byte B6. Write a subroutine to add the 5 data elements in a vector. Use this subroutine to add the elements of the columns of a 5 ´ 5 matrix. (b) Shift left the contents of a register by a specified number of bits.Basic Computer Organization 203 5.

R2 where R2 is a register address. In Chapter 5 we gave an algorithm for floating point addition of two 32-bit numbers. Assemble Program 6. Modify it to CMP R1. Rewrite Program 6. 20. Write machine language equivalent. 17. Modify the simulator of Algorithm 6.5 using LEA instruction described in the text.204 Computer Organization and Architecture 16.1 to reflect this changed definition. In SMAC+ we have defined the compare instruction as CMP R1. 18. Discuss the advantages and disadvantages of defining CMP in this manner. X where X is a constant. Write assembly codes for floating point multiply and divide.5 into its machine language equivalent using the instruction set S2. . 19. Rewrite Program 6. Write that algorithm as an assembly language program for SMAC++ with the enhanced instructions of Exercise 15.1 appropriately.

the central processing unit (also known as CPU) is the heart of the system. a powerful CPU alone does not make a powerful computer system. How clocks are used for timing a sequence of events. Encoding and decoding of instructions and addresses. In Chapter 6. small or large. another.1 INTRODUCTION In any computer system. Microprogramming of control unit of CPU. we presented 205 . How instruction formats and instruction sets are designed. 7. However. Memory sub-system and the I/O sub-system should also match the power of CPU to make a computer powerful. Different modes of addressing the contents of a RAM and their applications. How buses are used for transferring the contents from one place to Data paths and control flow for sequencing in CPUs. What determines the size of the register set.CENTRAL PROCESSING UNIT 7 LEARNING OBJECTIVES In this chapter we will learn: â â â â â â â â â What features are relevant to a typical Central Processing Unit.

There were 16 registers in the CPU. In this chapter we will examine what constitutes the central processing unit and how it is organized at the hardware or microprogramming level. 8. This is unlike the Pentium processor. the set of registers in the CPU and how the registers are functionally partitioned or divided for the programmer’s use (register organization). In Chapter 6. Most real computers use byte addressing instead of word addressing used by SMAC++. how the Op-codes and addresses in these instructions are encoded (instruction format). the Op-code field encoding an instruction was of fixed length of 8 bits for all instructions. 7. the data paths which facilitate efficient data flow from component to component in the CPU so as to enable correct instruction execution. When a memory location was referenced in SMAC++. Strict timing and sequencing of micro-operations become important. The instructions were of fixed length (32 bits). The main focus was on how to write concise programs to solve meaningful problems using a set of elementary instructions. . it was directly addressed or addressed using indexing. There were only 3 simple instruction formats (referred to as R. this number was limited due to the 4-bit address field reserved for register references. we needed more registers. The memory system was assumed to be organized into words of 32 bits each. Within an instruction. 2. In this case. 3. The registers were called General Purpose Registers (GPRs) because they were not pre-assigned to hold any specific data items nor reserved for any specific purpose. how the various registers and the ALU are connected to each other by one or more buses. A string of 32 bits was called a ‘word’. In that process of simplifying programming. Normally this is not true in many real computers which use variable number of bits in the Op-code field with the objective of optimizing coding efficiency. M and J types). more instructions and newer addressing modes. We will summarize below the elements of the CPU of this hypothetical computer SMAC++: 1. The CPU of a computer system is characterized by many features such as: the set of instructions supported by that system (instruction set). the diverse ways in which the operands and instructions in memory can be addressed (addressing modes).206 Computer Organization and Architecture a hypothetical computer and its organization from the point of view of the instruction set and assembly language programming. the timed sequencing of micro-operations in the form of a microprogram and the control signals which facilitate it. These were the two basic addressing modes provided. In SMAC++ all registers were 32 bits long. we evolved the instruction set of a hypothetical computer called SMAC++ in two steps. 6. 5. One instruction was stored in a word even if the instruction did not require all the 32 bits. micro-level operations such as “transfer a register-content to another register” play a major role. 4.

2 1. 4. FLAG (status) and IP or PC (instruction pointer or program counter). memory access is required. 3. and so on. With this coding scheme.Central Processing Unit 207 9. then the code space consisting of the (28) combinations will be partitioned into 4 equal sets. However. The decoder output can be used to trigger the appropriate sequences of microlevel operations for register access or memory access. 11. mainly used to control instruction sequencing.2). There were three registers: IR (instruction).1). With 8 bits we can represent 28 = 256 different operation codes. the first two bits of the 8-bit Opcode could have been used to identify the R-type. An instruction in SMAC++ had an Op-code field that was 8 bits long and the codes for the 26 operations were assigned rather arbitrarily. For an R-type instruction. no memory access is needed whereas for an M-type instruction. . The selection is based on electronic and memory technology available at a specified time. In the following sections we will study in detail the various aspects of a typical CPU but will not restrict our study only to SMAC++. 2. Direct addressing restricted the range of memory addresses (in some cases SMAC++ used 16-bit address giving 64K word range and in other cases a larger address range). 10. the programming method and languages of relevance at that time. If we do so. This mapping could have been done in a systematic or hierarchical manner. There are a number of ways in which the above information can be specified. wherever appropriate. the intended applications of the computer. we will use SMAC++ as an example. but we used only 26 of these possibilities. This representation or mapping of a set of Op-code symbols to binary strings on a one-to-one basis is called encoding. compatibility with the older computers designed by the manufacturer. Each set can then be assigned to denote one type of instruction. M-type and J-type instructions (see Figure 7. 7. a 2-input decoder would determine the type of instruction being considered (see Figure 7. The instruction set of SMAC++ had less than 30 instructions as opposed to hundreds of instructions in a Pentium like processor. Computer designers over the years have designed computers with a number of different instruction formats as there is no optimal way of selecting an instruction format. OPERATION CODE ENCODING AND DECODING In general an instruction for a computer should specify the following: What operation is to be performed? Where are the operands stored (in register or where in memory)? Where will the result be stored (in register or where in memory)? What is the address of the instruction to be executed next? An instruction is designed to codify the information related to the questions given above. For example.

there is not much gain in terms of saving bits. . we cannot use any available free space in another partition.208 Computer Organization and Architecture 00 01 10 11 00 0000 00 0001 00 1111 Unused 01 0000 01 0001 01 1111 R-Type 10 0000 10 0001 10 1111 M-Type 11 0000 11 0001 11 1111 J-Type . FIGURE 7. In the above encoding. 00 . If we go beyond the limit of 64 instructions in one type.2 Instruction type decoder. The coding will thus be as shown in Figure 7.2 ´ 8 = 4. We have brought complexity in decoding without substantial gain. Partitioning the code space. Leftmost 2 bits 2 Input 4 Output Decoder 01 10 11 FIGURE 7.1 .. all operation codes are of equal length. Recall the instruction set of SMAC++ where there are many more R-type instructions than other types. group 1 instructions may be used to directly address a larger address space. Suppose 15 operation codes are used much more frequently than others. The unused 16th combination 1111 is used along with 4 more bits to encode 16 operation codes of group 2 (those that are not used frequently). Out of the 16 combinations of 4 bits we use the first 15 to code group 1 instructions. Illegal Op-code R-Type M-Type J-Type . We can encode these 15 operations with 4 bits. then the ‘average Op-code length’ with this encoding would be: Average length = 0... Thus. If the group 1 instructions were used 80% of the time by different programs.8 ´ 4 + 0.8 bits This is to be compared with 5 bits needed in simple fixed length encoding. We can consider variable length codes instead of fixed length codes.3. However. The number of instructions of each type is not the same. We will illustrate this with an example. Call these 15 operation codes as belonging to group 11. The type of hierarchical encoding explained above has one disadvantage..

1 A computer has 32-bit word length.4. we need 4 bits.3 Use of variable length Op-codes. It has 16 registers. The other instructions with 2 registers and memory address may be designed as full word instructions as shown in Figure 7. Option 1  To encode 16 register addresses. Op-code 0 0000 4 4 4 4 15 Half-word instructions reg 1 reg 2 reg 3 14 1110 FIGURE 7. EXAMPLE 7.1). Suggest an instruction format for this computer.4 Half-word instructions (Example 7.5. The other instructions have 2 register operands and the remaining bits address main memory. There are 15 instructions which use 3 register (addresses) for operands. 15 instructions which use 3 register addresses for operands may be designed as half-word instructions as shown in Figure 7.Central Processing Unit 209 FIGURE 7. Observe that we use 8 bits for the .

the designer chooses a small set so that the assembly language programmer or the assembler software can use this set more effectively. Op-code 15 1111 8 0000 4 4 16 reg 1 reg 2 address 31 1111 1111 reg 1 reg 2 address FIGURE 7.1). The reader is urged to observe how the various instructions are encoded in a computer they use. With this design. the number of different instructions runs into three or four hundred.5 Full word instructions (Example 7.210 Computer Organization and Architecture Op-codes of the second set of instructions. Direct addressability is 128 KB (see Figure 7. we can encode 128 Op-codes. Option 2  Use all instructions as full word instructions.6). Set 1 Op-code 7 reg 1 4 reg 2 4 reg 3 4 unused 13 Set 2 Op-code 7 reg 1 4 reg 2 4 address 17 FIGURE 7. Direct addressability is 64 KB (as 16 bits are used for address part of instruction). 7.6 Using full word instruction option. in a RISC (Reduced Instruction Set Computer) type computer. This can be done by observing the output generated by an assembler software when assembly language programs are written. The instruction decoder unit (hardware) in a computer decodes these fields individually. In a CISC (Complex Instruction Set Computer). The instruction set is a collection of all the instructions that a computer designer has selected for hardware implementation. The total number of instructions is limited to 31. .3 INSTRUCTION SET AND INSTRUCTION FORMATS An instruction has several fields and each field may be encoded differently. Normally.

or minimize the storage occupied by the program. A compiler can enrich the instruction set by providing higher level operations. we do not mean the theoretical minimum but small in the sense of engineering design. then the .) Branch instructions Subroutine control instructions Shift type instructions I/O instructions Instructions to support the Operating System When the instruction set is too large (how large is too large?). or minimize the human effort needed to understand/modify the program developed by a programmer. This will make the CPU design more complex. etc. let us enumerate the various classes of instructions needed in an instruction set.Central Processing Unit 211 7. an assembler used with that computer may provide the floating point multiplication using an assembler program known as a macro. Most of the software developments for applications are done using HLLs (Higher Level Languages). For example. If the computer system is mostly used with HLL compilers. The reader is urged to use SMAC++ as an example to appreciate the following classes and note that we have not introduced any instruction of the type C9 below in this textbook. In early machines (designed in the 1960s) one had to program in assembly language.1 Instruction Set The instruction set of a CPU determines its power to do various operations on data items.3. if a floating point multiplication is not available as part of an instruction set of a CPU. but this is not true any more. Instruction set designers have followed two different approaches. If no compiler for a higher level language is available. In order to understand this. mixing and matching them as needed: Approach 1: Provide as many instructions as possible as part of the instruction set. the assembly language programmer has to be smart enough to judiciously use them and optimize a chosen objective function. a programmer has to develop the application software using the assembly language and the instruction set of a computer. Additional support needed is provided through suitably designed HLL compilers or macros. When we use the term minimal as above. Class Name C1 C2 C3 C4 C5 C6 C7 C8 C9 Type of Instruction Load register from and store register to memory Move instructions (Register-to-register) Arithmetic instructions Bit oriented instructions (Logical operations. Approach 2: Provide a minimal set of frequently used instructions as part of the instruction set. The objective function could be: minimize the execution time of the program.

Such programs are known as benchmark programs. Without loss of generality. Comment on the merits and demerits of classifying instructions using this idea. discussed in Chapter 6. for this purpose or instrument a program by putting counters at suitable places from which dynamic Fis can be calculated. Designing an instruction set for a computer system is not an algorithmic process and it requires extensive experience.212 Computer Organization and Architecture compilers have to be smart enough to do the above. A designer uses both scientific reasoning and intuition based on past experience. …. we need a collection of representative programs. which represent the expected use of that computer system. But how can we estimate the Fis for a computer system that is being designed? One way to do that would be by simulating the proposed computer system at the level of details of the instruction set and iteratively refining the instruction set.2 Instruction Format The instruction format of a given instruction reveals how the various fields or bits forming that instruction have to be interpreted by the hardware for execution and by the programmer to understand the meaning of that instruction. For the case study undertaken would you need another type called miscellaneous? Explain why? 7. However. Using Fis one may decide to provide hardware only for the first k instructions such that ¥ k i 1 Fi c 0. its usefulness can be evaluated through benchmark programs and static or dynamic Fis. For SMAC++ we decided arbitrarily that the various instructions will fall into one of the three . We can evaluate a computer system that is in use by writing these benchmark programs and calculating the Fis. From experience and statistical study. let us assume Fi ³ Fi+1 i = 1. it has been observed that a core set of instructions are used very frequently and the remaining set of instructions are used rarely or not at all. You may put those instructions into the classes (Class 1 through Class 9) presented in this section.95 and implement the rest with macros. We could use the trace files. The reader is urged to study the complete instruction set of at least one computer system and examine it carefully. Researchers like Knuth have found that there is a good correlation between the dynamic and static Fis. This is the reason for some designers to promote approach 2 mentioned earlier. the benchmark programs are not executed and the frequencies are calculated from the compiled (or assembled) programs.3. we are faced with the question: Should we consider the Fis for the static case or the dynamic case? In the static case. N – 1 In order to estimate or compute Fi. Let there be N instructions in the instruction set of a CPU and let Fi be the frequency of the ith instruction. While measuring the Fis for a computer system in use. once an instruction set is designed. 2. To find Fi for the dynamic case. we execute the benchmark programs using typical data sets as input data and observe the various instructions as they are executed.

the instruction will be shorter compared to the case when one of the two operands is in the main random access memory (RAM). DIVIDE or MOVE requires two operands (in some cases three if we count the destination or the result of the operation to be different): ADD X. EXAMPLE 7. three operand references). A small integer constant (that may be present in some instructions). The price one pays for making the RAM byte addressable is the increased address length for addressing the same size RAM. At the least. Of course. Memory reference (if one of the operands is located in memory). we now enumerate the various fields that might be present in an instruction format: 1. If both operands are in registers. M-type or J-type.Central Processing Unit 213 formats: R-type. In such cases two half instructions can be packed into a single word. we followed the designers of a commercial RISC type computer.2 IBM Year Word length Memory Number of index registers Number of Op-codes Memory cycle time 7094 1962 36 bits Word addressable Maximum size 32K words 7 185 2 micro seconds (Magnetic core memory) In a memory system. a short instruction occupied half of a word. an instruction format should have the Op-code field and nothing else. The HALT instruction could be of this type. 3. we have the benefit of fetching two instructions in one memory access. Register reference (there may be zero. normally each memory access fetches one word and stores it in the MBR (Memory Buffer Register). MULT. Consider a typical instruction of the above type. one. SUB. In some computer organizations (starting from IBM system 360). 2. Of the various fields of an instruction. All modern computers are byte addressable even though machines built in the 1960s were word addressable. MIPS-R4000. From what we studied in Chapter 6. the memory reference field is the longest and the register reference field is the shortest. two. For an instruction fetch cycle the MBR will be moved to IR (Instruction Register). 4. In doing so. In order to find the address of these instructions. Op-code field. A diadic operation such as ADD. . every byte in the RAM must be addressable. If two short instructions are packed into one word. Y which means X ¬ X + Y. that is 16 bits. there is some amount of book-keeping overhead to determine if the next instruction to be executed is already in the IR or not. An early word addressable computer that was popular in the 1960s was IBM 7094.

214

Computer Organization and Architecture

In the 1970s byte addressability became popular as it was realized that many applications, particularly in business data processing, required manipulating character strings. Strings were packed in many words before storage and unpacked for processing. This slowed down processing. Short strings such as a single character had to be allocated a whole word which was wasteful. Thus, byte addressing became almost a universal standard during the 1970s for general purpose computers. Even though it increases the number of bits in the address field of an instruction, the convenience of byte addressing in writing both systems and applications software has induced computer designers to use byte addressing almost universally. We had 16 registers in the CPU of SMAC++. In contrast to this, let us suppose that there is only one register for all arithmetic and logic operations. Let us call this register accumulator register (ACC). All operations of this computer will be performed using this ACC register. Thus, the higher level language construct like C = A + B will translate into LOAD A ; ADD B ; STORE C ; A’s value is loaded into ACC B’s value is added to the contents of ACC ACC is stored in C

Such a computer organization is called single address computer because each instruction has (at most) one address. The second operand for diadic operations is implied to be in the ACC register. The assembly language programs for single addressed computers tend to be long; thereby occupying more RAM space and also making the program less readable. However, designing the CPU with a single ACC register made the hardware design simpler during the 1950s and 1960s when LSI chips were not available.
EXAMPLE 7.3

The instruction structure of IBM 7090 that was popular in the 1960s which had an accumulator for all arithmetic operations was designed as follows: To To To To address 32K words memory : encode 185 Op-codes : address 7 index registers : address 4 different index registers in a single instruction : 15 bits 8 bits 3 bits 12 bits

It had indirect addressing and one bit called indirect bit for indicating it. Thus the instruction structure in IBM 7090 was as shown in Figure 7.7.

FIGURE 7.7

Instruction structure of IBM 7090.

Central Processing Unit

215

Characters were encoded using 6 bits per character (lower case letters were not used). Thus each word could accommodate 6 characters. An instruction format of this type for a modern processor is totally inappropriate. Can you list all possible arguments to show why it is inappropriate? Another way to classify the instruction set of a computer system is based on the number of addresses associated with an instruction. There can be 0, 1, 2, or 3 addresses. Each address could be that of a register or a memory location in the RAM. In CPUs that contain multiple registers, most instructions have two addresses. One of the two addresses refers to an operand in a register and the second address could refer to a register or a memory location. Computers based on such type of CPUs are called two addressed computers. What percentage of the instructions in Chapter 6 are two addressed? Consider the task of copying a contiguous set of memory locations (say 125 words) from one region in memory to another region. This task has to be performed when programs are relocated for compacting fragmented free spaces in memory or when a copy of a data set is sent from one program to another. An instruction designed to perform this operation of memory to memory transfer will have the following three fields: 1. Start address (in memory) of the dataset to be copied. 2. Start address (in memory) of the place where the new copy is to be placed. 3. Word count of the data set. An instruction of this kind is possibly the longest instruction due to the two memory references contained in it. In IBM 370 type computers such an instruction existed. Find out how to do this operation in the computer you are using.
EXAMPLE 7.4

IBM 370 SS-type instruction (see Figure 7.8) Op-code: Function: Length of block: From address in memory: To address in memory: Total length of instruction: Word length: 8 bits Copy a specified number of bytes from one place in memory to another place 8 bits 16 bits (4 bits base register + 12 bits displacement) 16 bits (4 bits base register + 12 bits displacement) 48 bits 32 bits

216

Computer Organization and Architecture
RR type (16 bits) RX type (32 bits) RS type (32 bits) SI type (32 bits) SS type (48 bits) opc Ra Rc

opc

Ra

Rx

Rb

Disp-b

opc

Ra

Rc

Rb

Disp-b

opc

lmm

Rb

Disp-b

opc

L

Ra

Disp-a

Rb

Disp-b

opc : 8-bit op-code field Ra, Rb, Rc, Rx : 4-bit register address Disp-a, Disp-b : 12-bit displacement lmm, L : 8-bit integer constant Rx : Refers to use as index register Rb : Refers to use as base register Ra, Rc : Refers to use as operand register

FIGURE 7.8

IBM 370 instruction formats.

We can treat an arithmetic instruction like ADD as a zero address instruction if the addresses of the two operands and the result are implicitly known to the implementing hardware. In stack machines (like HP 3000) there are special instructions called stack instructions. For example, ADD instruction of this type pops the top two elements of an arithmetic stack adds them and pushes the result back into the stack. Thus, all the required addresses are implied in this case. Besides stack instructions, a stack machine will have other instruction formats to load from and store into memory.

7.4

ADDRESSING MODES

Every instruction, except the immediate type instruction, is associated with addresses that point to an operand (data) or another instruction. Conditional jump, unconditional jump and subroutine CALL instructions contain memory address of another instruction. Arithmetic, logic, and shift instructions address one or more data items which may be located in memory or in registers. Thus, addressing is important for using computers. A memory system with 32-bit MAR can address 4 GB of memory. However, with today’s technology most personal computers contain close to a GB of memory. If we have a 24-bit address field in the instruction format, we can directly address 16 MB of RAM. After allowing for the Op-code field and the register reference field, a 32-bit instruction length may not have 24 bits available for the address field. The remaining part of memory must be addressable based on addressing modes other than direct addressing. In this section, we will introduce

Central Processing Unit

217

some new modes of addressing such as base addressing, segmented addressing, relative addressing and review indirect addressing. Recall that in Chapter 6, we have studied four elementary addressing modes: 1. 2. 3. 4. Direct Addressing Indexed Addressing Immediate Addressing Indirect Addressing Effective address = contents of the index register specified + displacement value contained in the address   field of the instruction

An indexed address to memory was obtained as follows:

7.4.1

Base Addressing

Because the index register contains a 32-bit value, the effective address could be 32 bits long, thus allowing 4 GB addressability. This increases the addressable memory range. However, remember that index registers were introduced not for increasing the address range but for facilitating array-oriented calculation. We do not wish to mix these two roles. Hence, in IBM 370 computers, the concept of base addressing was introduced. This feature requires special registers called base registers. The effective address in the case of base addressing is calculated exactly like indexing: Effective address = contents of base register referenced + the value in the displacement field The base register contents is used as the reference point or the origin, relative to which the displacement field specifies the memory address (see Figure 7.9). The procedure by which the effective address is calculated is the same both for indexing as
0016 IR Disp 24A0 B1D2

+

16

Base Reg.

24A0 B1D2

24A0 B1E8

Memory

FIGURE 7.9

Base register usage.

218

Computer Organization and Architecture

well as for base addressing. In both the cases, an integer value contained as part of the instruction is added to the contents of a CPU register. This addition cycle is the extra time overhead in the execution of an instruction when indexing or base addressing is used. The base register contents refers to the base address. A 32-bit base register can point to any address in the 4 GB (232) address space as the base address. The effective address generated in base addressing is relative to the base address. For example, in Figure 7.9 the displacement value of 16 refers to the 16th word from the origin 24A0B1D2 which is the base address. When the base value is changed, say to 24A0C90, the effective address will refer to the 16th word from the new base value. Base addressing is extremely useful when programs are relocated in RAM. Relocation of programs is very common in multiuser or multitasking operating systems.

7.4.2

Segment Addressing

In Pentium-based computers, a variant of base addressing is used and it is called segmented addressing. The similarity stops at the fact that the contents of the segment register is added to the displacement. The purpose of segmented addressing in Pentium is totally different from that of the IBM 370 computers. In this case the large address space (232) is segmented to serve logically different purposes such as to separately store the instructions or code, data, stack, etc. There are six segment registers (CS, DS, ES, SS, FS, GS) in the CPU of Pentium and they are 16 bits long. The segment register is used to select one of the segment base addresses stored in a special area in memory called Segment Descriptor Table. The segment descriptor table is managed by the Operating System. Then the selected segment base address is added to the displacement (which is 32 bits in the case of Pentium) to get an effective address that is used to access the operand or instruction in RAM. Note that the effective address is also 32-bit long. This is depicted in Figure 7.10.

FIGURE 7.10

Use of segment registers in Pentium.

Central Processing Unit

219

7.4.3

PC Relative Addressing

Let us turn our attention to another addressing mode known as PC relative addressing or simply relative addressing. Consider a part of a program shown below: 016 017 018 019 01A COMPARE J MIN CALL OUT A,B *+2 Relative address * means the address of this instruction. XYZ A

The instruction at address 016 compares two data values A and B and the instruction at address 017 causes jump to the address (017 + 2 = 019) if the result of the comparison is negative. The symbol * refers to the current PC value or the address where the instruction with * reference is located in RAM. In this example it is 017. The relative address +2 is with reference to the current PC value. This addressing mode has two major benefits. If the above program is relocated in some other place in RAM, the * +2 (PC-relative reference) would work correctly without any adjustments to the address field. Secondly, the address field has to store only a small constant like +2 which will require fewer bits. For example, if 8 bits are used to store such a constant and treated as a signed 2’s complement integer, the relative address can range from –128 to +127. The positive constant refers to forward reference from the PC value whereas the negative constant refers to backward reference from the PC value. One of the early computers from the Digital Equipment Corporation (DEC) called PDP 11 popularized this kind of addressing mode.

7.4.4

Indirect Addressing

The final addressing mode that we will learn is known as indirect addressing. In this mode of addressing, the effective instruction points to an address in memory where the address of the data item (not the data itself) is stored (see Figure 7.11).

Op-code

Address 2A00

3B10

2A00

Instruction

Data

3B10

RAM

FIGURE 7.11 Indirect addressing.

220

Computer Organization and Architecture

Thus, data access will require not one, but two memory accesses. The first memory access will yield the address of the data item, using which the memory is accessed again to fetch the data. What is the gain one obtains for this time overhead paid as a price for indirect addressing? Suppose the instruction shown in Figure 7.11 is part of an already compiled program and the data item has to be located at the address 3B10 for one execution, and at another address say 4B10 for the next execution. If we have addressed the data indirectly through the address 2A00, there is no need to modify the program at all. In the second execution one has to simply change the contents of the memory location 2A00 from 3B10 to 4B10. This kind of requirement arises often while writing system software, such as Operating Systems and I/O drivers. In the description given above we have used indirect addressing through another memory location (2A00 is an address in RAM). The indirect addressing could very well be through a register. Then it is called register indirect addressing. In this case we will address a data item through a register, i.e. the specified register will contain the address of the data item and not the data itself. The concept of indirect addressing can be recursively extended beyond level 1. That is, in the example shown in Figure 7.11, the location 2A00 may be made to contain another address where the data is stored. Beyond one level, human understanding and tracking becomes quite difficult, and this multilevel indirection is mainly of theoretical value.

7.4.5

How to Encode Various Addressing Modes

In Table 7.1, we have summarized various addressing modes. The roles played by each of the addressing modes are also indicated in this table. Note that some of the roles are played by more than one addressing modes. Not all addressing modes will be present in a particular computer system. If indirect addressing feature is not present in a system, its effect could be simulated through software means. Simulation will be much slower than equivalent hardware implementation. A computer designer should first decide which addressing modes are to be included in the hardware design. The next question is “how to encode them in the instruction format?”
TABLE 7.1 Summary of Addressing Modes
Addressing Mode 1 Direct 2 Indexed 3 Immediate 4 Base 5 Segmented 6 PC relative 7 Indirect Role Played Simple and fast For iterative vector type operations Readability, faster access to data, small data range, better usage of bits Extending addressable range; facilitate relocation of programs Logical separation of the program from data and stack contents Better use of bits, facilitate relocation, shorter range of relative address Location independence, system software support

Central Processing Unit

221

The IBM 370 designers decided to uniformly associate base addressing in every memory address. Thus, except the RR-format (see Figure 7.8) all other formats have a 4-bit field to refer to the base register. The indexing mode is associated with the RX format, that too only with the first operand. Since any of the 16 GPRs in IBM 370 can be used as an index register or as a base register, 4 bits are reserved in the instruction format for register references. The register R0 is assumed to always hold a constant zero. Thus, 0000 in the index field or the base register field of an instruction denotes the absence of indexing or base addressing for that instruction. In this design, the encoding of Op-code field is kept separate from the encoding of the index and base address fields. Thus, every instruction in the instruction set can benefit from indexing and base addressing modes. The Pentium processor did not include base addressing. For indexing only two special registers (ESI and EDI) are used. Indirect addressing is restricted to indirection through a register content and not through a memory content. 1. A designer has to select a subset of addressing modes. (The goal is to provide flexibility and power to programmers). 2. A designer has to encode the fields in an instruction so that the hardware can decode which addressing mode(s) are used in that instruction and appropriately compute the effective address for fetching the operand or another instruction from the RAM.

7.5

REGISTER SETS

Every CPU has a set of registers. IBM 370 had 16 registers, another popular computer in the 1970s called PDP 11 had 8 registers, and INTEL 80486 has 12 registers and Power PC has 32 registers. There is no universal acceptance among designers and manufacturers about the number of registers that must be in a CPU. In fact there is no agreement even on how these registers should be partitioned or grouped. For example, IBM and Power PC group them all as one set of GPR. The Pentium partitions them based on their functions into three sets: arithmetic registers (EAX, EBX, ECX, EDX); segment registers (ESS, ECS, EDS, EES); index registers and pointer registers (ESI, EDI, EBP, ESP). At the other extreme, PDP 11 has the flexibility of allowing the PC to be an explicitly addressable register available to programmers. An assembly language programmer has direct control over various registers in developing the programs. On the other hand a programmer using a higher level language like JAVA does not have direct control over the use of registers, because the compiler is responsible for register allocation and management. A computer designer aims at providing a sufficiently large and flexible set of registers to facilitate efficient register allocation and management. Registers are used as short-term memory to store data values or address values during a computation. Accessing a register is one to two orders of magnitude (10

222

Computer Organization and Architecture

to 100 times) faster than accessing a memory location. Thus, one would believe that there must be a large number of registers in a CPU. However, the registers in CPU have to be effectively allocated and managed either by a programmer or by a compiler. Also, when the CPU switches from one process to another, all the registers must be saved in memory and they must be kept free for use by the newly started process. The time needed to save registers can become a significant part of the ‘process-switching time’ if the number of registers is increased without limit. Yet another issue is that each register has to be connected through hardware and buses (data and control paths) to other registers as well as to other parts of the computer system like the ALU. As the number of registers grows, this connection complexity also grows rapidly. The RISC type design philosophy believes in providing a large number of registers (32, 64, 128, etc.) and manages the complexity through regular structures in the VLSI design. The register allocation and management responsibilities are mostly done by well-designed compilers. In order to help in these activities, the instruction formats are kept simple and the instruction set is kept small. The R4000 CPU from a company called MIPS and the SUN Micro’s SPARC architecture belong to this family of design. For certain types of computer applications called real time applications, process switching has to be extremely fast because the external stimuli (coming from outside the computer) require very fast response. Computer organizations like SPARC provide more than one set of identical CPU registers. When an external interrupt occurs, there is no need to save the register contents; instead the CPU will leave one set of registers intact and use another set of registers. If there are two sets of registers, say S0 and S1, the CPU can switch between two processes P0 and P1 very fast. Moreover, if the address spaces of S0 and S1 overlap, as shown in Figure 7.12, the two processes using these two sets can communicate (data or control information) with one another by storing items in the shared address space (see the shaded region in Figure 7.12).

Process P0

Set S0

Process P1 Set S1

FIGURE 7.12

Two sets of overlapping CPU registers.

Central Processing Unit

223

In some other computer systems, special instructions are provided to store (or to load) the CPU registers all at once. Thus by executing a single instruction, multiple registers can be saved in memory in contiguous locations, thereby reducing the process-switching time.

7.6

CLOCKS

AND

TIMING

The clock is a timing device needed to control the sequential activities that take place in a computer system. In a block diagram, we represent the clock as a rectangle with no input and single output (see Figure 7.13). It produces pulses at regular intervals called cycles. Each cycle has an ON period and a following OFF period. In Figure 7.13(a) we have shown these two periods to be equal. There is a rising edge and a falling edge in each cycle. At the rising edge, the clock signal changes from the low signal level (0 voltage) to the high signal level (5 volts); and at the falling edge the reverse takes place. In Figure 7.13(a), we have shown these changes to be ‘instantaneous’ which is not true in practice. As shown in Figure 7.13(c), the signal takes a finite time (tr) to rise from low-to-high and finite time (tf) to fall from high-to-low. When tr << T we approximate the cycle time by zero and assume the change to be instantaneous. If a state change is initiated at the rising edge, this change will be initiated, in reality, only after a time period tr. In Figure 7.13(b) we have shown how the pulse when differentiated by an electronic circuit will appear. The polarity of the differentiated pulse can be used to easily detect the rising and falling edges.
Clock

+5 volts 0 ON OFF One cycle (a) Pulses

(b)

After differentiation

tr
Rise time (c)

tf
Fall time

FIGURE 7.13 Clock pulses.

14 solid lines with arrows are data paths and dotted line with arrows are control signals. The control signals start and load (shown above dotted lines in Figure 7. Let us assume the following sequence of actions. 3. the edge information can be extracted from a clock pulse through a device that performs differentiation. In the following two sections. The adder takes a maximum of (t2 – t1) units of time to perform addition.14) initiate the pre-determined operations.14 An adder with timing. B and C hold data and data flows from one place to another. The registers A.224 Computer Organization and Architecture While studying time varying systems. and this information is encoded in the control flow. In Figure 7. An edge of a clock pulse (rising or falling) can be viewed as an event and the level of the signal (low or high) can be viewed as a binary state. Events occur and their occurrence causes changes in the state of the system. At t2 (t2 = load) the result of addition is available from the adder and it is loaded into the register C. Which control signals should trigger which events is pre-determined by the designer. Start Load A Adder B C t1 t2 FIGURE 7. so that the adder will perform its function within a system: 1. Consider the example of an adder shown in Figure 7. we have introduced the need for data and control at a level lower than the programming level. A and B acting as input to the adder and C as the output register. . At t1 (t1 = start) the start event occurs which lets the adder start its adding operation. 2. As shown in Figure 7. Through this example. we will study more about data flow and control flow and the paths they need which are known as data path and control path.14. There are 3 registers.13(b). Before the event t1 the input registers A and B are loaded with data. 4. It flows from A to the adder (also from B to the adder) and then from the adder to C. we encounter two terminologies: ‘events’ and ‘states’.

starting at t1. that is at the fifth occurrence of the rising edge give the ‘counter output signal’.15 Using a counter for timing. Suppose. in this example. Start Clock Count 4 Counter output 0 1 2 3 0 Clock 100 400 Counter output t1 t2 FIGURE 7. and generate the t2 event at the trailing edge of the 4th clock pulse. Note that 400 ns is an integral multiple of the clock period (100 ns). that will count 4 clock pulses. 3. is determined by the clock frequency. For example.Central Processing Unit 225 When we use a clock. Immediately after that. . 4.15. in Figure 7. 2.15. However. If the adder is fast enough to produce its output in a time less than 100 ns. The counter value is incremented at the rising edge of the clock pulse. the event t2 occurs after the event t1. Start counting the clock pulses after the ‘start’ signal is given. The elapsed time between t1 and t2. there is an inherent sequencing mechanism. reset the counter to zero and wait for the next occurrence of the start signal. this cannot be done. we use a 10-MHz clock. For generating the various control signals in the CPU. we will need many such counters and a master clock. We can use a counter of 4. After four clock pulses. This is shown in Figure 7. the two events t1 and t2 could be timed with the rising and the falling edge of one clock pulse. The counter’s behavioural specification is as follows: Counter 4 specification 1. For a 10-MHz clock this time will be 100 ns or one-tenth of a microsecond. if the adder requires 400 nanoseconds.

Internal buses are conducting lines which are used to connect ALU with a set of registers and on-chip memory (called caches which we will describe in Chapter 9). External memory. its allocation and control has to be managed by proper protocols. The time instants are defined by a clock. On the other hand. This could be written in a formal language. We will mainly concentrate on internal buses in this chapter. Several subsystems share a bus as the common link for the exchange of data or control. External buses are a set of parallel wires printed on an electronic printed circuit board..7 CPU BUSES Buses are a set of parallel wires used to connect different subsystems of a computer. A bus in a processor board. Each bus design embeds a built-in procedure for determining who becomes the sender. called address lines. Those inside a CPU chip are called internal buses and those outside it are called external buses. The transaction protocol of a synchronous bus is straightforward and it is hardware realizable as a simple finite state machine. There will be some repetitions. in general. Buses are classified as synchronous buses and asynchronous buses. 7. etc.16. The positive aspect of using a bus for interconnection is its low cost and flexibility. and control lines. The organization of I/O buses is quite different because I/O devices are slow and there is a large variety of I/O devices. consists of three groups of wires. Buses are slower when compared to dedicated data and control paths but are flexible and cheaper. Data flow and control flow between subsystems can take place through buses. hardware circuits can be synthesized and tested. In synchronous buses. A meaningful activity on the bus is called a bus transaction. External buses connect the CPU chip to the external memory and I/O devices. new connections can be established easily. This procedure is known as bus arbitration protocol. data lines. They are symbolically shown in Figure 7. graphics chips. However. This clock is transmitted through one of the control lines. because a bus is shared by several subsystems and only one of them can control it at a time. From such unambiguous specifications. Buses may be broadly classified into two types. . like HDL (like in Chapter 5) to avoid any ambiguity. A transaction has two sub parts: Sending the address on the bus and reading from or writing data in that address. are plugged to this bus. it becomes flexible. The control lines are used to select who sends data on the bus and who all will receive it from the bus. When a shared bus is used as a common link for interconnection. I/O interface chips. we make the two discussions self-contained to avoid frequent cross referencing.226 Computer Organization and Architecture We have specified the counter behaviour using an informal language. We will also describe some aspects of CPU-Memory bus in this section. We will describe these buses in Chapter 10. everything takes place at precise time instants specified by the designer of the bus.

If we have to make a tradeoff between speed and flexibility in connecting the processor with memory. we have shown an abstract representation of handshake protocol. In Figure 7. Request Response Acknowledgement Sender Receiver FIGURE 7. does not have a common clock and it uses a hand-shaking protocol.Central Processing Unit Address lines Data lines Control lines (a) 80 lines 227 Terminations Terminations A (b) B C FIGURE 7. This protocol can be viewed as a pair of finite-state machines working on two independent clocks and communicating with each other.16 A bus connecting multiple subsystems.17 Handshake protocol. we have shown the interconnections between CPU and memory of a subset of the hypothetical computer SMAC+ (described in Chapter 6) . we will favour speed because every execution of an instruction involves one or more exchanges between CPU and memory. speed of the bus is very critical. The communication assures that one machine does not proceed beyond a particular state until the other machine has reached an anticipated state. Thus. Let us consider the processor-memory connection.17. on the other hand.18. In Figure 7. An asynchronous bus. The buses used for connecting the processor and memory are synchronous and high speed buses.

read and write which initiate actions. The processor memory buses are generally proprietary buses and they are specific to a particular computer system. add.18 An example of micro-engine with a single bus. An I/O bus is attached to such a bus through a bus adaptor or interface chip. This signifies multiple simultaneous transmissions from one source to several destinations but the bus can carry only one set of signals at a given instant of time. C2¢) respectively. the contents of PC (also known as IP) is transferred to MAR through the bus. FIGURE 7. The two registers R0 and R1 in this figure are connected to an internal bus by means of control signals (C1. Note that there are several bus control signals denoted as C1 through C13. mult.228 Computer Organization and Architecture using a ‘single bus’. Observe that we have shown only two registers R0 and R1 (instead of all 16 registers) in this figure primarily to illustrate the basic ideas. We will call an interconnection of components by means of one or more buses like this figure as a micro-engine. Any number of data flows going out from the bus can be simultaneously asserted whereas only one bus control signal coming into the bus can be asserted at a time. In this figure if C8 and C10 are asserted (that is turned ON). Why we chose this name will become clear from our discussions in the following sections. Observe also the broken lines which signify commands. Some designers wish to minimize the number of . subtract. C2) and (C1¢.

Multiple buses are needed for simultaneous independent data transfers.19). The nodes of such a graph (represented by circles) indicate storage elements.18 although the data transfers Z ® R0. we will denote this dataflow and data storage in the form of an abstracted graph and call it a dataflow graph. For our understanding. In Figure 7. DATA PATHS AND MICROPROGRAMMING In a computer system data is stored in one out of three storage entities. They will be described in Chapter 10. FIGURE 7. data flows from one place to another. During computational process. namely.Central Processing Unit 229 tappings made to connect a bus adaptor to the high speed processor-memory bus. Instead we could use multiple buses to facilitate simultaneous activities. Observe that Figure 7. because only one source can transmit data on a shared bus.18 can be redrawn as a dataflow graph (see Figure 7. 7. In Figure 7. gets processed or reset to an initial value. that is in parallel. we have used a single bus to connect all the entities inside a CPU.8 DATAFLOW. . flip-flops (flags). register. Z ® R1 could take place at the same time data transfers Z ® R0 and MBR ® IR cannot be carried out simultaneously.19 A partial dataflow graph for SMAC+ instruction subset S1.18. We will exclude mass storages like disks at this stage. or RAM. With the development of microprocessors and personal computers several standards have come into existence in the design and re-use of I/O buses and backplane buses.

let us consider a subset of the instructions S1 of SMAC+ that we designed in Chapter 6 given below: HALT.18 can be modified to support the data paths shown in the graph of Figure 7. namely. LOAD. OUTPUT Further. which are: 1. When this transfer takes place.e. ADD. For the sake of simplicity. For example. the name indicates a partial dataflow on which part of the data contained in the source is transferred to destination..230 Computer Organization and Architecture and a directed arc from node i to node j indicates the existence of a path for the data in i to flow to j. We have used the following conventions for the figure: Circles: Rectangles: Small squares: Rectangle with V groove: Denote registers i. Thus.19. flag-names or functionalunit names.19. 3. The reader is urged to verify that the micro engine shown in Figure 7. Recall that there are two distinct sub-cycles in the execution of every instruction. Similarly. The sufficiency condition will be met when the corresponding control signal is asserted (see Figure 7. the label ‘addr’ will indicate the address part of the register and so on. the INBUF is assumed to hold the right data received from the input device. JMIN.e. In Figure 7.19.18). Transfer the PC to MAR Issue memory read Increment the PC by 1 Transfer the MBR to IR . a reader may notice that dotted lines in a path is used to indicate that it performs a control operation like selecting read or write or incrementing the PC by 1. The fetch cycle involves four ordered operations. The interesting point to note is that the control signals have to be asserted in an orderly fashion. When an arc is labelled. A directed arc may or may not be labelled. Although a path for dataflow may exist between i and j and the node i may contain the data ready to flow. storage) Functional units like memory Denote flip-flops Denotes a two input adder (X+Y=Z) The nodes are labelled with appropriate register names. We make all these assumptions so that the dataflow graph can be simple enough for presentation in a textbook. This is not part of the data flow diagram. fetch cycle and execution cycle. the data path is necessary for data flow but it is not sufficient. 4. This is a preliminary dataflow graph to implement the instruction set S1. the data will actually flow only when the appropriate control signal permits such a data flow from node i to node j. JUMP. the previously stored data in OUTBUF is assumed to be already consumed so that the buffer is empty and ready to receive the data being moved. one after another in a sequence. set of flip-flops (i. INPUT. we will assume the existence of two registers called INBUF and OUTBUF such that the INPUT and OUTPUT instructions simply transfer the data between these buffer registers and their respective I/O devices. STORE. 2. With the above assumptions we have drawn Figure 7.

We will rewrite the above microprogram in 3 time steps as shown below. Each label represents the time of occurrence of a pulse. Each step is labelled with time T0. PC ¬ PC + 1     T2 : IR ¬ MBR In some cases. is used to denote that those micro-operations can be executed in parallel as one single step. data from two different sources cannot be simultaneously routed into the same destination register. There are 4 micro-operations in the above microprogram. etc. The designer’s objective is to make the microprogram as efficient as possible. machine code and semantics of load are: . Every micro-operation is executable if and only if there is a data path in the micro-engine that will facilitate that data flow. Every step in the microprogram is called a micro-operation. In this case the microprogram can be considered efficient if it takes the least number of steps. A micro-operation can be executed only when the control signal corresponding to that data flow or operation is asserted. Then we ask: “Can two or more micro-operations be executed in parallel without any conflict?” In the microprogram shown above. Data from two different sources cannot be routed into the same bus at a given time. The symbolic code. operations at T1 and T2 can be executed in parallel without any conflict. This sequence can be expressed in the form of a stepby-step program. It should be noted that in order to execute two or more micro-operations in parallel they should not be conflicting and it should be meaningful to perform these operations simultaneously. a parallel execution of two micro-operations can be done meaningfully only if two independent simultaneous data paths are available in the micro-engine. We introduced a hardware description language in Section 5.1: Microprogram for fetch cycle T0 T1 T2 T3 : : : : MAR ¬ PC Memory Read PC ¬ PC + 1 IR ¬ MBR. the MBR will contain the instruction to be executed. The time taken for an operation equals the clock period.1(a): Fetch cycle with parallel operations     T0 : MAR ¬ PC     T1 : Memory Read.3. Step at T2 does do not cause any explicit data flow. Program 7. The sequencing implies that an operation at T1 takes place after the operation at T0 is completed. T1. Similarly. Program 7. We will call such a program as microprogram. We will use the same notation to represent microprogram in what follows. Note that a symbol. multiple buses will be useful. Thus. We will now consider loading a register from memory.Central Processing Unit 231 Note that after the memory read operation is completed.

232 Computer Organization and Architecture Symbolic code: LOAD R0.4: Microprogram for the execution of JMIN Instruction T0: if (N) then PC ¬ Address part of IR. Such a conditional transfer can be achieved using AND gates. M OP Reg.2: Microprogram to load register R0 T0: MAR ¬ Address part of IR /*To read from memory*/ T1: Memory Read T2: R0 ¬ MBR.*/ The microprogram corresponding to the execution cycle of the JUMP instruction is the simplest of all: PC ¬ Address part of IR. Note that when the result of an arithmetic operation results in a negative number. However. Program 7. Addr Machine code: 02 0 012FF Semantics: Load in register R0 contents of Memory Address 012FF The microprogram to execute this instruction is given as follows. The microprogram to execute this instruction is given below: Program 7.*/ We give below the microprogram for store R0 in memory. R1 Machine Code: 01 01 0000 Semantics: Add contents of R1 to R0 and leave the result in R0 The microprogram is: Program 7. T1: Reset (N) /*The Flag N is reset. It will remain ON until it is reset. The add instruction is: Symbolic code: ADD R0.5: Microprogram to store R0 in memory T0: MAR ¬ Address part of IR /*To read the data from RAM */ T1: MBR ¬ R0 T2: memory WRITE . Program 7. for the JMIN (jump on minus) instruction the data transfer ‘PC ¬ Address part of IR’ will be conditional upon the N (Negative) flip-flop being in ON state. Having done these we can add.3: Microprogram to add R1 to R0 T0: T1: T2: T3: X ¬ R0 Y ¬ R1 Add control signal to ALU R0 ¬ Z /*The result stored in R0. the ALU will set the N flip-flop ON. /*contents read stored in R0 */ Observe that we have enclosed comments between /* and */ A similar microprogram may be written to load register R1.

In real life this may not be the case. As we have 16 registers in SMAC+ we need a decoder to select one of the 16 registers. However. Similarly. we could speed up the microprogram in Program 7. This will allow independent operation of ALU and RAM.18 there is only one bus which prevents this parallel execution. The hardware has to ensure that indeed the right input has been put in the INBUF when INPUT is executed. Else the microprogram should keep the system in a wait state till the data is ready in INBUF. For simplicity we will assume that when the INPUT instruction is executed.7: Microprogram for input instruction with 2 buses T0: MAR ¬ (b1) Address part of IR. OUTBUF. We will reserve those details to the chapter on I/O organization. We may redesign the microengine with two buses.9 CONTROL FLOW In a computer system. We can connect MBR.Central Processing Unit 233 We have considered only two registers so far. If we were to redesign a micro-engine with two buses instead of one. Program 7. when the OUTPUT instruction is executed we will assume that the OUTBUF is free to receive the data being transferred from the RAM into the OUTBUF. INBUF. the INBUF contains the right data ready to be transferred to the RAM. It means that whatever was put into the OUTBUF before has been consumed by the intended receiver and the buffer is free to receive new data. 7. This is done in our programs by appending the left arrow with bus label. Note that now we have two buses and we need to specify with each data transfer which bus is used. The two buses are named b1 and b2.6 is now rewritten using the two buses as Program 7. IR and PC to bus b2 and not change the connections to the existing bus we call b1. Working out the details of this is left as an exercise to the students.6: Microprogram for the execution cycle of INPUT instruction T0: MAR ¬ Address part of IR /*To store the data into RAM*/ T1: MBR ¬ INBUF /*Data is transferred to MBR for storing*/ T2: memory WRITE.5 can also be rewritten and will use only two steps with two buses. data flow and control flow are complementary to each other. Program 7. MBR ¬ (b2) INBUF T1: memory WRITE. Drawing the micro-engine with two buses is left as an exercise to the students. Program 7.6.7. In the above program the two micro-operations at T0 and T1 in steps 1 and 2 are independent of each other and they can be executed in parallel. in the micro-engine shown in Figure 7. An electronic signal is viewed as data or control solely based on the manner in . Program 7. Now let us consider the execution cycle of the INPUT and OUTPUT instructions.

there is no possibility of doing it partly. etc. where di denotes the data flow associated with the ith operation and ci is the corresponding control signal. we need to select one out of several I/O devices for an I/O operation.}. An integrated circuit called multiplexer (described in Chapter 3) is shown in Figure 7. In this abstract specification. REPEAT-UNTIL. The need for selection occurs very often in a computer system. Recall that the term asserted is used synonymously with setting a flip-flop to the ON condition. The simplest form of control is exercised in selecting one out of several options. JMP can be treated as atomic whereas the . subtract. compare. multiply. or about its granularity. We will examine granularity briefly in what follows: a0 a1 a2 a3 P = ai where i = 0. The granularity of HLL instructions such as those given above is much larger than the machine or assembly language instructions. Each machine or assembly language instruction such as LOAD A. 1. The subsequence. We need to select one out of several registers when there are multiple registers. Abstractly we can denote a sequence of operations as an ordered sequence of pairs < d1 c1 > < d2 c2 > < d3 c3 > … < dn cn >. its data di must be ready and its control signal ci must be asserted. An operation is said to be atomic if the operation is either completed in full or not started at all. The term control flow is used to signify a sequence of elementary operations that are individually controlled. Conditional or IF … THEN…ELSE (selection type control). Consider a High Level Language (HLL) like Java or C++ and the programming constructs available in it.20 A multiplexer. We will focus on three constructs: 1.20. we need to select one out of several ALU functions from a set such as {add. divide. 3. Iterative or FOR loops. we have not discussed about the atomicity of an operation. Assignment (sequential flow of one statement after another is implied). 2. (control is embedded in the computation either to repeat the loop or not). The sequence implies that operation i is completed before operation i + 1 is started. shift…. For an ith operation to take place.234 Computer Organization and Architecture which it is used or interpreted. d1 d2 … dn can be viewed as the dataflow and c1 c2 … cn can be viewed as the control flow. 2 or 3 as determined by the two control bits C1C0 MUX P Output is one of the selected inputs C1 C0 FIGURE 7. as a black box.

The partial execution of microprograms corresponding to machine instructions is excluded and hence the partial execution of an instruction is not possible. The instruction fetch part and the instruction execution part of an instruction are realized through their corresponding microprograms. When this expression is TRUE. we have divided the instruction decoding into three parts: Op-code decoder. In Figure 7. . We can define the following Boolean expression to capture the control needed for this data path. This is what is generally accomplished at the hardware level.Central Processing Unit 235 blocks (enclosed between begin … end in a high level language) are too big to be treated as atomic at the hardware level. For the JUMP instruction it is unconditional and for the JMIN instruction it is conditional.21 PC ¬ address in IR transfer for jump instruction. we have realized the above Boolean expression with the transfer-control signal which indicates the time instant within the instruction cycle when this data path is to be asserted. Let us suppose that the microprograms corresponding to the fetch cycle and the execution cycle of an instruction are executed without any interruption. N)) The Boolean variables JUMP or JMIN will be TRUE when the corresponding jump instruction is executed and the variable N (negative) would have been set to TRUE or FALSE by the ALU and stored in that flag. Transfer control signal JUMP IR JMIN N PC JEQ EQ FIGURE 7. In that case we ensure the atomicity at the instruction level. Transfer = (JUMP + (JMIN . This data path is activated for both the JUMP and JMIN instructions.22. Let us consider the dataflow while executing the data transfer ‘PC ¬ address part of IR’. This means the instruction is executed fully or not executed at all. An instruction execution is further divided into two parts: (i) instruction fetch and decode and (ii) instruction execution.21. From the above discussions we notice that Op-code decoding plays an important role in the generation of control signals. we need to activate this data transfer. addressing mode decoder and one effective address calculator for each mode of addressing. In Figure 7.

time spent for some steps and too little time for other steps (example memory READ and WRITE). let us suppose that a clock runs at 20 MHz and one step of a microprogram is executed in one clock time. Then a counter modulo ten can synchronously generate a control signal to start the instruction fetch cycle. IR Addressing mode decoder Index control Indirect address control signals Immediate type effect IR ® B data transfer FIGURE 7. and (3) the atomicity of operations at the instruction level and what it means at the microprogramming level. The instructions can be executed at a constant rate with a fixed instruction execution time of 500 ns per instruction. Also we introduced the basic organization and the instruction set view of a computer using a hypothetical computer SMAC in Chapter 6 that was later expanded to SMAC++.236 Computer Organization and Architecture In order to generate certain types of control signals. we need to keep track of the timings of various inter-related events. This is achieved with the help of a clock. As an example.22 Use of decoders. we list below 7 focus . Such a design is simple to implement but it is not very efficient. Three main aspects are: (1) what makes the CPU of a computer system. But let us ignore such fine details for now. In a real computer system both synchronous and asynchronous control need to be properly mixed for efficient implementations. In order to understand the CPU at the instruction set level. 7.10 SUMMARY OF CPU ORGANIZATION We have studied various aspects of a CPU in this chapter. From the study of the microprograms given above let us assume that the instruction fetch takes three clock cycles and the instruction execution of any instruction can be completed in a maximum of seven cycles. The resulting 50 ns may be too much Op-code decoder add sub etc. (2) the instruction set view of a CPU. the semantics of each instruction and how microprogramming is used to realize this semantics at the hardware level. one instruction after another.

For the convenience of students. ARITHMETIC. Instruction Format l What are the instruction formats used? l Review the calculation of effective address for each addressing mode using the specific registers of this computer. Memory l l l l l How large is the memory address space supported? What is the basic unit of addressability (byte/word)? Memory cycle time? Memory bandwidth? (number of bytes/access). etc. TEST. register reference. Those questions are intended to enable a student to make a detailed analysis or study of the CPU of a given computer system. based on operational characteristics JUMP. Evolution l Is the CPU (computer system) a member of a family of evolving set of CPUs (computers)? If yes. what compatibility objectives are maintained from generation to generation? . we list some questions. Instruction Set l Classify the instruction set based on different criteria such as single address. ease of its use. T2. Under each focus point. l How large is the instruction set? l Examine the instruction set to learn about its versatility. Support for Stacks l Stacks are very useful for programming. T5. in compiling and OS run-time operations. How are stacks supported in the design? T7. How is the effective address calculated to access the RAM? T3. Registers l How many registers are there? l What is the register length? l Are they general purpose registers or are some of them special purpose registers? l What are the special roles played by some registers? T2. speed of execution. T6. double address. we have labelled these focus points as T1. Addressing Modes Supported l What are the various addressing modes supported at the hardware level? T4. T1. etc.Central Processing Unit 237 points. T3 to T7. memory reference.

9. They are: base addressing. PC relative addressing allows jump within a small range of addresses. 4. address bus. 6. It is possible to design computers in which the number of bits assigned to Opcode is variable. Some instructions use a smaller number of bits allowing more bits for addressing memory. A larger number of CPU registers speed up computation. segmented addressing. segmentation is used by the Operating System. It is also possible to design instruction sets in which instruction lengths can be 1 word. The speed of execution of various operations is governed by the clock frequency. Buses are a set of parallel wires used to connect different subsystems of a computer. 2. there are several other addressing modes. Clocks are used in CPUs to synchronize operations. This will be described in Chapter 10. . PC relative addressing and indirect addressing. Three buses. etc. and indirect addressing is useful in subroutine linkages and parameter passing. data bus and control bus are common in all CPUs. 3. 5. half word and one and a half word. 10. This is called microprogramming. 7. real commercial computers over the years had a variety of instruction formats in their CPUs. 2 words. The control signals required to control and coordinate data flow among various sub-units of a CPU may be described using a high level hardware description language similar to the one we introduced in Chapter 5. This shows how the CPU of a small computer is designed. While SMAC++ provided a simple basic organization.238 Computer Organization and Architecture SUMMARY 1. Currently processor use clocks in GHz range. I/O devices are connected to CPU chip by external buses called I/O bus. Besides direct. indexed and immediate address. The number of CPU registers has increased in successive generations of computers with denser packing of ICs. Base addressing is useful to increase addressable space. In Chapter 6 the CPU organization of a hypothetical computer called SMAC++ was given. We have obtained microprograms for a subset of the instructions of SMAC described in Chapter 6. 8. They are short and can be driven at high clock speeds. It is primarily done to optimize memory utilization. Internal buses are fabricated in the IC chip. namely.

If SMAC++ is a byte addressable machine what will be the advantages and disadvantages? 3. 8. Design at least two possible instruction formats for this computer. Will stacks be useful in developing this program? 11. (ii) Index registers being a part of the GPRs in CPU and not as separate registers. (ii) Repeat for instruction set S2 of SMAC++. 6. 9. Explain how one would get the effect of indirect addressing on this machine. Give a set of instructions to address and manipulate bytes in word addressed SMAC++.Central Processing Unit 239 EXERCISES 1. develop a program to reverse a string (reverse of ‘abc’ is ‘cba’) and then to verify if the string reads the same left to right as well as right to left. For a computer system available to you list as many instructions as you can find in each of these categories. 2. What are the justifications you can state: (i) For introducing several index registers in a computer. addition time of 90 ns and a gate delay of 5 ns. The branch is to be made to an address 59B. Assume a memory cycle time of 250 ns. A PC-relative mode addressed branch instruction is stored in memory at the address 6AO. Show the address part of the instruction. Comment on the relative execution times of these categories of instructions. Make a table giving the time required to calculate the effective address in each of the different addressing modes. Explain in detail the advantages of PC being an addressable register in the CPU. There are 30 instructions which use 3 registers for operands and the others address main memory. In each case give the range of directly addressable memory locations. Assume any reasonable estimate of time for other operation (if you need it).3 we have classified the instructions into 9 categories denoted as C1 through C9. IBM 370 has no indirect addressing. 10. Distinguish between data flow and control flow in a processor design: (i) Give a complete data flow graph for the instruction set S1 of SMAC+. Some CPUs include two sets of registers of identical registers for fast switching. Explain all the assumptions made by you clearly. It has 32 registers. 4. . A computer has 48 bits word length. Using the instructions you have suggested for exercise 9. 7. 5. In what situations are such duplicate sets very useful? 12. In Section 7.

240 Computer Organization and Architecture 13. Consider a single addressed. Write data flow microprograms (such as Program 7. Let HYCOM have a small instruction set (LOAD. 16. (iii) Push and pop using R15. Compare this with SMAC+ micro-engine. 14. Obtain for SMAC+ a micro-engine with two buses.1 and 7. Obtain a single bus micro-engine for SMAC++. Draw a graph showing the essential data paths. The two buses must be configured to maximize parallel execution of micro-operations. We have shown a micro-engine for SMAC+ with a single bus. JUMP and JUMP ON MINUS).8. (ii) Save the contents of all the registers R0 to R15 in main memory. For the microprograms of Section 7.2. COMPLEMENT. 17. ADD. describe the control signals needed to maintain the control flow. of the text) to: (i) Swap the contents of registers R1 and R2. . Recommend a micro-engine for realizing its instruction set. STORE. single register (only accumulator) machine HYCOM. 15.

Although the Intel architecture is 241 . We have chosen Intel Pentium for studying the instruction set and the NASM assembly language syntax to introduce the basic concepts. How to write small assembly language programs using NASM assembler.1 INTRODUCTION In this chapter we will introduce the essentials of assembly language programming through which we will review and strengthen the understanding of certain aspects of computer organization that we have already introduced in the earlier chapters. Our objective in this chapter is not to prepare the students as expert assembly language programmers but to make them understand the power of an assembly language and its closeness to the hardware architecture. About Pentium processor using a ‘small subset’ of the Pentium instruction set. The basic notion of trade-off using the right instructions in a program design will be introduced with the help of small assembly language programs.ASSEMBLY LANGUAGE LEVEL VIEW COMPUTER SYSTEM LEARNING OBJECTIVES In this chapter we will learn: 8 OF â â â Computer organization at the assembly language level using Pentium as the running example focussing on its 32-bit structure. 8. and the instruction set of a computer.

We have not introduced a large variety of features available in Pentium and its compatibility with the earlier 8-bit and 16-bit predecessors. The low order 16 bits of the 32 bits are used for these registers and the high order 16 bits are left unused. Two bytes make a word and four bytes make a double word. registers are one of the essential components of a CPU. The four registers used for processing in Pentium are named eax. 16 bits are called a word and 32 bits are called double word. cx and dx respectively. In addition to the above registers used for processing purposes. The 32-bit long EFLAGS register stores the status or conditions that occur during the course of the execution of a program. This is shown in Figure 8. In Pentium’s terminology. In order to facilitate processing. 31 eax 15 part of eax 15 part of eax ah ax 7 al 0 0 0 FIGURE 8. an array or matrix of data items index registers are provided in Pentium. ebx. bx. . As much as possible we have restricted our discussions to the 32-bit architecture of Pentium. The addressable unit of memory is a byte. EIP is a 32-bit instruction pointer that points to memory where the next instruction to be executed is located.1 The eax register and its sub-registers. ecx and edx and they are 32-bit long. we have chosen it because it is widely available and thus we believe it will motivate the readers to learn. This is applicable for other registers as well. In this chapter we have restricted ourselves to a small subset of the instruction set of Pentium and a small subset of the features of the NASM assembler. For the sake of compatibility with the predecessors of Pentium parts of these registers can also be viewed as 16-bit registers. 8. there are other registers in the Pentium CPU.2 REGISTERS AND MEMORY As we have learnt in earlier chapters.242 Computer Organization and Architecture not simple to introduce to the beginning level students. Early microcomputers were 8 bits long and to be compatible with them parts of the 32 bits can also be addressed as 8-bit registers.1 using eax as an example. We have used the syntax of the NASM assembler as it is freely available to any student who wishes to download and install. The 16-bit sub registers are named ax. Two of them are known as EIP and EFLAGS. The two index registers are called esi and edi and they are 32 bits long.

Conditional branch instructions can test if such a flag is ‘Set’ or not and branch accordingly. Table 8. A segment is nothing but a sequence of consecutive memory locations (addressed as bytes) starting from the address that is called the segment base. a segment fault is said to have occurred which needs corrective action. special . A segment also has an associated length. The memory organization of Pentium can be viewed as a linear array of bytes. Another way to view the memory organization of the Pentium would be as a number of different segments.1 Sample Status Flags Name of the bit OF SF ZF AF PF CF IF Purpose Overflow Sign Zero Auxiliary Carry Parity Carry Flag Interrupt Enable Condition represented in that bit Arithmetic operation resulted in a number too large to store: True or False. and these binary conditions will be stored in EFLAGS in the individual bit places reserved for such binary conditions. With the use of segments. addressing becomes two-dimensional and is depicted as <segment. and the contents of the stack respectively. Carry out of most significant bit of the result is 1 or 0. Carry out of bit position 8 is 1 or 0. Result is zero: True or False. The reader should recall the concept of stacks introduced in Chapter 6. The conditions or status stored in EFLAGS are used by the programmer or the hardware to make changes in the control flow of programs. program segment and stack segment. The operation resulted in the sign bit being 1 (Negative) or 0 (positive). The regular flow of instruction sequence is interrupted if this bit is set by some privileged instructions. Parity check. the result being negative (or positive). The segmented view of memory helps programmers in the development of reliable programs. If the displacement is larger than the segment size. Three types of segments are worth mentioning here: Data segment. In order to define the segment base. TABLE 8. With 32-bit memory address register and EIP.1 shows some of the representative status flags to give an idea to the students. the instructions of the presently active program. As their names imply they are reserved to store the data declared by the programmer. Other conditions external to the CPU such as the failure of a parity check in memory operations or an interrupt from an external device can also be stored in EFLAGS. This is used in BCD arithmetic.Assembly Language Level View of Computer System 243 Several arithmetic and other instructions can result in a condition such as overflow. one can have a maximum of 232 (4 GB) bytes in the RAM of Pentium. displacement within that segment>.

They include: Boolean . Segment registers SS GS FS ES DS CS 32-bit offset in a general register Selector + CS Base address Descriptor table 32-bit memory address FIGURE 8. There are several segment registers in Pentium but we will deal only with three of them namely. CS. The contents of the segment register. There can be many code segments defined by system programmers and their base addresses are all stored in a table known as Segment Descriptor Table which is stored in the RAM. Pentium supports stacks in the RAM. Then the selected segment base address is added to the displacement (which is 32 bits in the case of Pentium) to get an effective address that is used to access the operand or instruction as the case may be. namely CS. 8. The area of memory allocated for the purpose of storing the stacks is called the stack segment.3 INSTRUCTIONS AND DATA Like many other digital computers. Pentium supports certain standard primitive data types at the instruction set level that is in the hardware. and SS representing the data. The segment descriptor table is managed by the operating system. The PUSH and POP instructions support these two operations.244 Computer Organization and Architecture registers are incorporated in Pentium and they are called segment registers.2 Use of segment registers in computing effective address. Note the effective address is also 32-bit long. DS. code and stack segments. Most computers support a data structure called stacks in the hardware that was introduced in the earlier chapters.2. Address of the top element of the stack is contained in a special register known as the stack pointer register and denoted by esp. is used to select one of the segment base addresses from the Segment Descriptor Table as the current code segment representing the program under execution. Stacks limit the access only to the top element in the stack and thus an item can only be either pushed into the stack as the top element or the top element can be popped out of the stack. This is depicted in Figure 8.

wage is the name of a data item that is stored in RAM .Assembly Language Level View of Computer System 245 data with TRUE or FALSE values.1: mov mov mov Some sample assembly language instructions ecx. floating point numbers using the IEEE 754 standard. . The Op-code part which specifies what machine operation has to be performed. move the contents of that location in RAM to ebx.1. standard ASCII characters. Pentium also supports processing some of these primitive data types in different lengths (8. binary integers represented in the two’s complement notation. A machine instruction or the hardware instruction generally has two parts. The symbols mov. increments the contents of ecx by one ecx. In this assembler the square brackets are . In addition to this. 3 . There are appropriate hardware instructions to process or manipulate such primitive data types. jump to the instruction labelled ‘again’ if the ZERO flag in EFLAGS register is TRUE again add inc cmp jnz The meaning or the semantics of each instruction is informally explained in the same line as the instruction. inc. The value associated with the data name wage is put in ebx eax. by subtracting 3 from ecx . one instruction at a time: Program 8. 0 . and the operand(s) part which specifies the operands needed for that operation. the second operand [wage] refers to the data in memory and it is addressed by the symbolic name wage. unsigned binary integers useful as pointers.ebx . sets the flag bits in EFLAGS register as negative. When mov ecx. compares the contents of ecx with 3. We have not shown which is that instruction named again. 0 . Consider the following example instructions presented as Program 8. Here wage is the symbolic name for a data item that is stored in memory and again is a symbolic name given to an instruction which is also stored in memory. In the case of the third ‘mov’ instruction. initializes the ecx with 0 eax. 32. zero or positive again . cmp and jnz are the Op-codes in symbolic form as defined in the NASM assembly language and we have to follow that syntax. one of them is in a register and the other is a literal constant. It is for the same syntactic reason that we have separated an instruction from the explanation by a semicolon. initializes the eax with 0 ebx. 64 bits etc) as needed. adds the contents of the register ebx to that in eax ecx .0 is executed the literal constant zero is moved to the ecx register. [wage] . 16. and binary coded decimal numbers using 4 bits per digit for specialized arithmetic needs. This informal semantics can be formally expressed using the register transfer language or RTL notation described in Chapter 5. Observe that the two operands or parameters of the first move instruction. add.

It is not difficult for an individual to solve such a problem on his own. wage would refer to 0400 and [wage] would refer to 1750. move is used in the sense of copy and the contents of the source of the move are left unaltered. The problem we consider is: “There are four numbers. By this instruction the value of the data named wage is moved to the ebx register. zero or positive. Program 8. By way of analysis of this problem. What name should we give to these numbers so that the program can refer to these numbers by that name and we can understand the program if we see the program at a later date? . indexed addressing. there is an obligation to record such analysis and decisions so that other team members are aware of them.ebx when executed would add the contents of the ebx to the contents of the eax register. First. Their semantics can be precisely described with no ambiguity so that hardware can be built to execute it. 8. base addressing. In summary we note that different instructions have different sets of parameters or operands. Are these four numbers integers or floating point numbers? 2. This comparison is achieved by computing the first operand minus the second operand and setting the flags of the EFLAGS register as per the result. when the problem is large and one is participating as a member of a big development team. indirect addressing. The compare instruction compares its two operands. the questions we ask are: 1. Find the average of these four numbers”. The add eax. In this section we will write a very small program. Here.4 CREATING A SMALL PROGRAM We will introduce different aspects of assembly language programming and computer organization at the level of machine instructions in a step by step manner. We analyze the given problem by asking appropriate questions. finding answers to them and making certain decisions based on our understanding. Note that the result could be negative.1 does not do anything useful. For example if the data name wage were assigned the memory location 0400 by the assembler and if an integer value of 1750 were stored in that memory location 0400.246 Computer Organization and Architecture used to denote that the reference is to the value of the data named wage and it is distinguished from the address of wage in memory. we need a problem to be solved for which we will develop a computer program. The operand or parameter of the jump-on-zero (jnz) instruction is not data but an address of another instruction that is named again. The inc ecx instruction increments the contents of ecx register by one. In fact it is incomplete and written only to illustrate some instructions. They were introduced in Chapters 6 and 7. It is essential to be able to address data and instructions that are stored in memory in a versatile manner. Thus all computer organization supports different addressing modes such as direct addressing. and relative addressing. Should we assume that these numbers are to be read as input or are they already stored somewhere? 3. However.

comparing the counter with 4. . . eax . instead we have used just one instruction. cmp. initialize the counter in ecx use eax to accumulate the sum load the contents of memory address in ebx add the next number to accumulate sum modify address to point to next number if not done branch addnext and repeat execution . incrementing the counter by one after every execution. Should we write the program as a generic one to add ‘n’ numbers and find their average. we have used direct memory addressing but symbolically using the name of the memory location named addnext. However we did not do so. in add eax.2. the program written will be specialized for this problem. Observe how we have added comments to the instructions so that anyone reading this program can easily understand the intent with which we have written these instructions to solve the given problem.2 [average4]. and the output will be stored in the RAM and called average4.4 addnext eax. With these assumptions we have written the sequence of instructions which form part of the program which is presented as Program 8. average4.4 we have used immediate addressing of the data. . [ebx] ebx. and branching (jumping) to addnext if the counter has not reached the value 4. add. This could be achieved by using a counter that is initialized to zero.4 four times. or only four numbers as specified in this problem? 5.2: Finding average of four numbers (partial) mov mov mov add add loop shr mov ecx. . and three symbolic names (numb. These actions we could very well have done by using the inc. the result is stored in destination addnext We need to understand the semantics of the loop instruction. We have used three different addressing modes in these instructions. .0 ebx. In mov ecx.2 we want to repeatedly execute the two instructions add eax [ebx] and add ebx. shr. in loop addnext. How will the output be presented? Let us suppose that answers to the above questions are: the numbers are double word integers each occupying 4 bytes: they are already stored consecutively in the RAM. In this program we have used four different instructions (mov.2. numb eax. From the address of the first number we know how to get the address of the next number of the four-number sequence.4 eax. addnext) two of them are the names given to data items and the third is the name given to one of the instructions for convenient reference. the memory address of the first number is called numb. Program 8. when finished divide by 4 by shifting right twice . [ebx] we have addressed the data through the content of register ebx and used the notation of enclosing the register within square brackets. and jnz instructions. loop). In Program 8. . The problem itself is not explained as part of Program 8.Assembly Language Level View of Computer System 247 4.

Reserves 1 double word (4) bytes and calls it ‘average4’ The reserved word resd is used to instruct the assembler to reserve double words and the number words to be reserved are indicated as a parameter. vect 100 times . 17 . The label tells what name should be assigned to the allocated location. It decrements ecx register by 1.248 Computer Organization and Architecture namely.76 and 17 dd 0 . 8. and conditionally branching. This one instruction is designed by the computer hardware designers to achieve three operations. if the result is not zero the execution control branches or jumps to the instruction whose label appears as the operand in the loop instruction. For this purpose. first byte is called ‘vect’ . then compares the result with zero. The key word ‘times’ is used to allocate large blocks . we have shifted the eax register right twice which is equivalent to dividing by 4.5 ALLOCATING MEMORY FOR DATA STORAGE Although Program 8. Finally the mov [average4]. Similar to resd one could use resw for reserving a word and resb to reserve bytes.44. A typical shift instruction could be executed in one-tenth of the time the fastest idiv instruction would take. the use of shift right twice instead of divide instruction results in saving of execution time. decrementing a counter. The loop instruction assumes that the ecx register is used as a downcounter instead of up-counting. This allocates 100 double words. eax moves the result in eax to the memory location named average4. The contents of the allocated memory locations are left as they were and not initialized. that is. These are not instructions to the hardware of the machine but they are directives to the system software assembler.2 is more complete than Program 8. average4 resd 1 .1. the ‘loop addnext’ instruction. Memory locations must be allocated for the data names numb and average4. yet it is not in a complete form. If we wish to initialize the allocated memory locations with data we could use the define directive instead of the reserve directive as shown below: numb dd 23. first byte is called ‘numb’ . When the loop terminates. Therefore. 76. Now we need to divide it by 4 to get the average. Instead of using the integer division instruction (idiv) available in Pentium. the eax register contains the sum of 4 numbers that are consecutively stored starting from the symbolic address numb in RAM. 44. Reserves 4 double words. comparing it with zero at the end of every iteration. All 100 items are initialized to zero. we use what is known as assembler instructions which are known as pseudo instructions. the 4 double words are initialized with 23. If the contents become zero the instruction next to the loop instruction is executed as the next instruction.

The following 3 instructions are used to terminate the program execution and return the control . reserve 1 word called average4 23. add the next number to accumulate sum . eax . NASM treats that line as a comment and does not process it. Data definitions are placed together in the data segment and the instructions are placed together in the code segment. [ebx] add ebx. initialize the counter in ecx . to the operating system. if not done branch addnext and repeat execution . . numb . _start: Finding average of four numbers . Similarly the reserved word ‘global’ in the code segment defines the label of the first instruction in the program with which the instruction sequencing will begin.data dd . segment .3: segment . modify address to point to next number . Observe that the contents of eax and . Observe the statements written in boldface which use reserved words that are understood by the assembler.2 mov [average4]. The segment statements in Program 8.0 mov ebx. load the memory address in ebx .4 loop addnext shr eax. We will not explain them now. A question arises: How will the programmer see the result and why did we not print it as we normally do in a higher level language like JAVA or C++? Program 8.3 are used to define two data segments and one code segment. 76. .bss resd 1 . 17 . segment . define 4 double words and initialize them . When the program shown in Program 8. numb add eax. when finished divide by 4 by shifting right twice .3 is executed it takes the data that are defined in the data declaration and finds the average of those 4 numbers and places the result in memory at the location called average4. In this chapter we use the conventions followed in NASM assembler. ebx are altered by these instructions.text global_start mov ecx. use eax to accumulate the sum . . the result is stored in destination addnext . 44. average4 . The code and data segments are given a name as shown in Program 8.3.4 mov eax.Assembly Language Level View of Computer System 249 Similar to dd one could use dw to define words and db to define bytes. When a line starts with semi colon.

6 USING THE DEBUGGER TO EXAMINE OF REGISTERS AND MEMORY THE CONTENTS Input and output are quite complex in the assembly language and for now we will postpone using them. What is the result we expect for the above program with the data defined in Program 8. the programmer requires what is known as the assembled output of NASM which shows where in memory the various instructions and data are stored. He uses the operating system to copy or rename the file. m = 2 for word type data and m = 4 for double word type. Similarly. the result will be present in memory location called average4 as a 32-bit binary number: 0000 0000 0000 0000 0000 0000 0010 1000 00000028 (the same number in Hex format) (in binary) In order to know which location in memory the NASM assembler allocated to the data name averag4. Assemblers or compilers are used to translate the program into an executable form and loaders are used to load the executable file and start the program execution. a programmer uses different software tools to accomplish his task.0 0x80 8. When a software engineer works in a team to develop large scale software he will use many more sophisticated software tools. Each data name encountered in the process of assembling is entered into this table. This name to location (address in RAM) association is stored in the symbol table. He uses an editor to enter the program into the computer or to modify the program as needed. The name of the array is bound to the starting address of this sequence of bytes allocated for that array.250 Computer Organization and Architecture mov mov int eax. he or she can interpret the bits displayed and verify the result.1 ebx. a sequence of n*m bytes are allocated where m = 1 for ASCII character. Another software . one or more memory bytes are allocated for that name as needed.3. if the programmer can display the contents of the various registers on the screen using an interactive ‘Debugger Software’.3. When an array of ‘n’ data elements are allocated space in memory. and the data name is bound to the memory address allocated to that data name. When the program executes. The answer should be 40 in decimal and 101000 in binary. Generally. While translating a source program into the machine language program. As the reader can imagine the symbol table is empty in the beginning of the assembly process and it increases in size every time a new symbol (data name or instruction name) is defined in the assembly process. For a simple program like Program 8. and finally to store it as a file. the assembler or the compiler dynamically generates a symbol table and uses it on the fly. every instruction name or label defined in the program is also entered into the symbol table and the address of that instruction in memory is bound to that program label.

Showing the various tabs of the DDD console Assembler dump (not explained here) Contents of various registers Debugger control such as Run.3 can be divided into four parts a. . View. The available tabs are: File. 3. Figure 8. Step and Next. 4. b. The debugger used with NASM is called DDD (data display debugger). In Figure 8. Status. Program. FIGURE 8. This debugger has a graphical user interface (GUI) which makes the life of beginning level assembly language programmers easy.3 Screen shots of DDD showing tabs. Source. assembler dump. c and d listed below: 1.3 we have shown a screen shot showing the various ‘tabs’ available in DDD.Assembly Language Level View of Computer System 251 tool that all programmers learn to use is called debugger that helps to find the ‘bugs’ and correct them. contents of registers and debugger controls. 2. and Data. Edit. The reader is advised to learn how to invoke DDD under the operating system used by him or her. Commands.

See the button labelled step in the two columns of control buttons. as the index value is varied 0. Another facility is to set break points. Three hardware features that support the manipulation of arrays are: (a) indexed addressing mode. but we know that the end of the array is indicated by a special character like ‘. This is a byte-array. If we know the length of the array in advance. In Pentium two special registers called esi and edi are provided to facilitate indexing. the effective address of the operand is obtained by adding the contents of the base value stored in ebx to the index value stored in esi multiplied by a constant k.7 HARDWARE FEATURES DATA TO MANIPULATE ARRAYS OF In problem solving. As an example. Indexed addressing mode was introduced in Chapter 6. With the expression ebx+esi*k.’ then we can repeat the iterations until the end of array character is reached. 2. when that is completed we need to access the next element and so on iteratively until all the elements are processed. 200 201 a Index i n e x 205 a m p l 8 e 210 0 1 2 3 4 5 6 7 9 10 FIGURE 8. in Figure 8. 1. and (c) appropriate set of machine instructions for the relevant manipulations. In the case of NASM we use a notation [ebx+esi*k] to denote indexed addressing of an operand stored in memory. floating point numbers or complex structures. we can use the loop instruction to repeat the processing that many times and then terminate the iteration. This requires a compare instruction and a conditional jump instruction like jump-on-equal or jump-on-not-equal. It is necessary to access one element of this array at a time for processing. We illustrate these aspects by means of a sample problem and the program fragment Program 8. The hardware organization of a computer system provides features to facilitate such iterative processing of array of elements. the various elements of the array can be accessed for processing one after another. integers. and then step the program execution to the next instruction. one can display the contents of the registers or that of selected locations in memory. . Using this expression. The index of the array is labelled as ‘i’ and the array is named ‘str’.4. These elements could be individual ASCII characters.252 Computer Organization and Architecture Using DDD.4 we have shown an array of ASCII characters that terminates with a period. 3. we often encounter an array or a matrix of elements. If we do not know the length. 8. It is also possible to execute the program one instruction at a time so that one can view the contents of the registers or relevant locations in memory. ‘k’ is called the scale factor that can be 1 or more. (b) index registers. The program will run until the break point is reached and will stop at that point to display the contents. etc.4 Array str with index ‘i’.

Table 8.1 explains some of the status Flags.’ .data ‘This is a sample text. the result being negative (or positive).Assembly Language Level View of Computer System 253 Let us suppose that we are given a byte-array of characters that terminates with a period. Observe that the ‘while loop’ in the algorithm is implemented by comparing the i-th character with period and jumping-on-equal to the instruction labelled done. Other conditions external to the CPU such as the failure of a parity check in memory read operation or an interrupt from an external device can also be stored in EFLAGS. ‘u’. If it is a vowel then increment ecx by 1 End-While End Algorithm. But that will not be a good practice when solving a bigger problem and when one is working as a member of a larger software development team.’ Get str[i] into ‘al’ register that is a sub register of eax Check if it is a vowel by comparing one after another with ‘a’. which will be stored in EFLAGS in the bit places reserved for them. ‘e’.. For a simple problem like this. ‘o’. ‘i’. Algorithm: Find the number of vowels Begin Initialize ecx and esi to zero and ebx to the base address of str While str[i] is not equal to ‘. In the CPU of the Pentium processor another register called EFLAGS is provided which stores the outcome of a compare instruction in its various bits. The 32-bit long EFLAGS register stores the status or conditions that occur in course of the execution of a program.4 we have shown the assembly language program to implement this algorithm.bss resd 1 . The conditions or status stored in EFLAGS are used by the programmer or the hardware to make changes in the control flow of the program sequences. Conditional branch instructions can test if such a flag is ‘Set’ or not and branch accordingly. As a first step in solving this problem. we make some simple decisions: (1) Use esi as the index register for indexing. Use the ecx register to keep count of the number of vowels in this array. reserve 1double word to store vowelcount . The next step would be to devise an algorithm to solve the given problem. Program 8. Our task is to count the number of vowels occurring in this string of characters. Several arithmetic and other instructions can result in a condition such as overflow.4: segment str dd segment vowelcount Counting vowels in a string . (2) Store the final count in a memory location named vowelcount (3) Use the ebx register to store the base address of the array which is named ‘str’. define a byte string called str . In Program 8. the programmer may keep the algorithm in his/ her mind and start coding right away. etc.

to the operating system. if equal jump to vowelfound. ‘e’ vowelfound al. mov eax. else . str al. . compare to see if it is ‘u’ . . We will not explain them now. ebx are altered by these instructions. ‘u’ join ecx esi . .254 Computer Organization and Architecture segment _start: . The following 3 instructions are used to terminate the program execution and return the control . if equal jump to vowelfound. if equal jump to vowelfound. compare to see if it is ‘e’ . else . . [ebx+esi] al. checknext . vowelfound join jmp checknext done mov [vowelcount].0 int 0x80 . compare to see if it is ‘o’ . else . compare to see if it is ‘e’ . ‘o’ vowelfound al. compare to see if it is ‘a’ .ecx . if not skip vowelfound .0 ecx. initialize ecx to keep count of the vowels . increment the index register to get next character . if equal jump to vowelfound. ‘i’ vowelfound al. compare with the end character . . initialize esi to 0 . . Observe that the contents of eax and .0 ebx. go back and examine next character . else . if end is reached jump to done. ‘a’ vowelfound al. get the character for checking to ‘al’ subregister . load the memory address of str in ebx . increment ecx because a vowel is found . ‘. else .1 mov ebx.’ done al.text global_start mov mov mov mov cmp je cmp je cmp je cmp je cmp je cmp jne inc inc esi.

address modification is quite useful. In a higher level programming language we would have used a case statement to achieve this goal. .4 that are ‘vowelfound’ and ‘join’. initialize ecx to 6.6 ebx. When we use indexed addressing.21 .71. This is known as address modification. vecB edi. [ebx+esi] eax.84.5: segment vecA vecB segment vecC segment _start: Adding two vectors .5 we have shown a short program to add two vectors each of length six to give a third vector called ‘vecC’. base address of vector B in edx . get ai into eax . Observe the two labels in Program 8. define and initialize vector A .22. loop count . [edx+esi] [edi+esi].data dd 17.58 dd 23. base address of vector A in ebx . ebx are altered by these instructions. A vowel can be any one of the 5 characters and we need to check one by one for each of the five cases. As another example in Program 8. store the sum into ci . In performing vector arithmetic or matrix operations. You are asked to follow the control flow embedded in this program and if it adheres to the algorithm presented earlier or not.44.1 mov ebx. addnext . Observe that the contents of eax and .Assembly Language Level View of Computer System 255 The reader is urged to observe the need for the unconditional jump instruction (jmp) when implementing the step in the algorithm that states “check if it is a vowel’. to the operating system.39. the address of the operand is modified in every iteration as we modify the contents of the index register. We will not explain them now.bss resd 6 . The following 3 instructions are used to terminate the program execution and return the control .65. Program 8.0 int 0x80 . eax addnext . vecA edx. . mov eax. initialize index ‘i’ to 0 . vecC esi.text global_start mov mov mov mov mov mov add mov loop ecx.53. add bi to it . define and initialize vector B . reserve 6 double words for vector C . base address of vector C in edi .0 eax.92.

Its length ‘n’ another integer whose value is put on the stack 3. Therefore. Conversely the POP instruction reads 4 bytes from the RAM starting from the address pointed to by the esp register. In this example the caller program is named ‘MAIN’ and the called subroutine is named ‘SUM’ which computes the sum of all the elements of the array given to it. This instruction saves all the registers in the stack and frees the registers for their use in the called program. The items pushed into the stack are assumed to be 32 bits long. The PUSH and POP instructions deal only with the top element of the stack using the esp register. In Figure 8. Let us suppose that SUM is written to take three inputs: 1. Base address of the array V of integers 2. Thus every time PUSH is executed the contents of esp is reduced by 4 (4 bytes give 32 bits) and the pushed element is stored at that address in memory.ecx.8 STACKS AND SUBROUTINES The stack feature in Pentium is supported through a stack pointer register (esp) and machine instructions like PUSH and POP.256 Computer Organization and Architecture 8. In every program. Stacks are extremely useful in structuring a large program into manageable subroutines. The use of ebp register will become clear when we discuss an example program. Stack is a dynamic structure in the sense that the stack size grows with every PUSH and shrinks with every POP. However.esi. At the end. The term control flow refers to how the various instructions are executed in an orderly fashion depending on the input data and how the control logic is embedded in the program. after which the esp is incremented by 4 pointing always to the top element of the stack. Following the logic embedded in the program. the stack grows towards lower addresses in memory.5 we have indicated the caller and called subroutines in the form of a block diagram.edi and ebp. the input data goes through several modifications finally resulting in the desired output. It saves eax. since the stack is stored in the RAM. The context of the caller subroutine can be saved by the called subroutine by using a single instruction PUSHA.ebx. the context of the caller subroutine can be restored by executing the POPA instruction which will restore the contents of all the registers that were earlier saved in the stack. Once the called subroutine has saved the contents of these registers. the assembly language programmer has the power to access any element stored in the stack through the use of move or other instructions. before returning control back to the caller. The stack in Pentium is stored in the RAM in a reserved area known as the stack segment.edx. it is the responsibility of the programmer to keep track of the state of the stack in their use of PUSH and POP. In the Pentium architecture. these registers are freely available for use by the called subroutine in its own computation. we are concerned about the control flow and the data flow. To facilitate this further. A programmer can manipulate the contents of the stack segment register (SS register) in defining the stack segment. Pentium provides the ebp register and PUSHA and POPA instructions. Address of a variable named TOTAL where the result will be stored .

the 3 push statements prepares the stack . return the control . The following 3 instructions are used to terminate the program execution and . .6 push ecx push V push TOTAL .534.bss resd 1 . and two by address . Program 8.data dd 172. In this example. using the address of the data name TOTAL the called program will put the result in that location.text global main mov ecx. call SUM . Observe that these agreements between the caller and called program constitute the protocol used in passing the parameters.Assembly Language Level View of Computer System 257 Responsibility is shared between the caller MAIN and the called SUM. . In turn the called program will retain the context of the caller intact and will return the result of its computation in a predetermined fashion. define and initialize vector V . The MAIN program will pass the parameters (1). The order in which these entities will appear on the stack is also known to both the programs.6: segment V segment TOTAL segment main: Subroutine call and return .220. (2) and (3) mentioned above that are needed for the computation by the subroutine. to pass the 3 parameters one by value . the MAIN will pass the base address of the vector V and the address of the scalar TOTAL whereas it will pass the array length by its value by placing that value on the stack. In the example under discussion.5 A block diagram of the caller and called subroutines.659. The programmer will use the TOTAL in some fashion in this part of the program .926. reserve 1 double word for TOTAL . SUM 2 1 CALL SUM 1 CALL SUM 2 CALLED CALLER FIGURE 8.585 .

eax.0 . . subroutine call in MAIN. this will load length of the array that is . in the stack . total is accumulated by adding the last .1 mov ebx. restore the contents of the four registers .6 mov eax. . We will not explain them now. ebx and ebp . [ebp+28] .esp . By saving these four registers and restoring them at the end. pop instructions which will modify the esp value dynamically. element of V first loop addnext mov [ebp+20]. push ecx push ebp . . [ebp+24] . saved pop ecx pop ebx pop eax . mov ecx. SUM: push push eax ebx .0 int 0x80 . save the contents of ecx. the name of the subroutine. . of eax and ebx are altered by these instructions. Observe that the contents . mov ebp. store the total in its destination . pop ebp . eax . reference to this base address while the subroutine may freely use the push and . mov eax. because SUM decided to use these . Subroutine SUM begins with the label SUM. registers internally. SUM preserves the . [ebx+ecx] . to the operating system. By keeping a copy of the esp in ebp we can access the parameters passed with .258 Computer Organization and Architecture . . initialize to create total mov ebx. caller’s context and keeps the contents of all registers unaltered due to the . base address of V is stored in ebx addnext add eax.

Figure 8. return. Observe that the esp at this time points to the stack top which contains the address of the instruction immediately following the call instruction in the MAIN program.6(a) through (e).6(b) shows the contents of the stack when the execution control enters the subroutine. In that case we will restore all the register contents by means of popa instruction. In Program 8. Figure 8.6 we have shown fragments of a program describing the details of call. Figure 8. stack that was created by the caller program which pushed 3 elements. The caller will push 3 parameters into the stack before calling Two of them will be addresses of data and one will be a data value itself The called program will preserve the contents of the registers The called program will also be responsible to clean up the stack by popping the three parameters passed through the stack. This will return to the caller and . 3. each.6(d) shows the contents of the stack just before executing the ret instruction. Observe that we could have very well used the pusha (push all) instruction instead of saving each register individually. 2. The protocol followed between the caller and the called programs in this example are: 1. ret 12 The states of the stack at various stages are shown in Figure 8. and parameter passing. addr next addr TOTAL addr V 6 empty stack ebp ecx +4 ebx +8 eax +12 +16 addr next +20 addr TOTAL addr V +24 6 +28 addr next addr TOTAL addr V 6 empty stack (a) (b) (c) (d) (e) FIGURE 8. Figure 8. into the stack before issuing a ‘call’ statement.6(c) shows the contents of the stack after the subroutine saves the registers which it plans to use in its computation.Assembly Language Level View of Computer System 259 . . 4 bytes .6(a) shows the stack being empty at the start of the caller program. 4. additionally . pop 12 bytes off the stack thereby cleaning up the parameter information on the .6 Depicting the stack contents at various stages of a call.

At this stage the stack is empty as in the beginning. We will not discuss the floating point add. (b) signed binary integers. In modern processors. In this section we will focus mostly on the first two types. multiplication and division.9 ARITHMETIC INSTRUCTIONS Arithmetic operations are vital in computations and problem solving. Signed binary integers in Pentium are represented using the two’s complement notation and floating point numbers are represented using the IEEE 754 standard. A typical multiply instruction in Pentium looks like: mul imul ebx ebx (for unsigned binary integer multiplication) (for signed binary integer multiplication) In both the cases the register specified with the instruction contains the multiplier and the multiplicand is assumed by the hardware to be contained in the eax register. The four common arithmetic operations performed on these data types are—addition. The hardware support for arithmetic processing exists in three forms: machine instructions for the four arithmetic operations on different data types. Pentium supports 16-bit and 8-bit data types in addition to the standard 32-bit data types. Based on the needs for compatibility with the previous generation of INTEL processors. Floating point arithmetic can be performed on single precision (32 bits) or double precision (64 bits) operands. in the case of Pentium. In Chapters 4 and 5 we have discussed the following data types: (a) unsigned binary integers. multiply and divide instructions in this text book and the interested readers may refer to one of the references[3]. subtraction. The high order 32 bits of the result are stored in edx and the low order 32 bits are stored in the eax register. and conditional jump instructions that can be used to modify the control flow following the arithmetic instructions. It assumes the dividend (numerator) to be contained in the register pair . flag bits being set or reset by the hardware based on conditions arising out of the arithmetic operations. (c) binary coded decimal integers and (d) real numbers also known as floating point numbers. In addition to this. FPU is integrated into the same VLSI chip as the CPU. The integer ‘add’ instruction has been already introduced and subtract instruction is similar to it except the subtract operation is non-commutative and hence the order in which the operands are specified is important. As floating point arithmetic is much more complex than integer arithmetic INTEL uses a separate hardware subsystem in the CPU called FPU or floating point (arithmetic) unit. For this purpose (edx:eax) pair is used.6(e) shows the contents of the stack after executing the return instruction that is when the control has returned to the caller. The product of a two 32-bit operands can be as long as 64 bits and hence we need two registers to store the result of the multiplication.260 Computer Organization and Architecture Figure 8. 8. we have another dimension for classifying the data types. subtract. The division operation is the dual.

two decimal digits are packed into one byte. Two major categories of bit oriented instructions are shift type instructions and logical operations such as AND. The difference between the two instructions is that the compare instruction arithmetically subtracts the second operand from the first operand and sets the EFLAG bits whereas the test instructions ‘ands’ the operands. the test instruction does not alter the values of the two operands. Thus. These instructions perform the specified logical operation on the corresponding bits of the two operands. The shift operations yield a quick way to multiply a number or divide a number by a power of 2. multiplication and division. based on the result being zero or non-zero etc. 8. The programmer can then use conditional jump instructions for testing the flag bits. In packed BCD format.. Shift right twice divides the number by 4 and shift left thrice multiplies the number by 8.10 BIT ORIENTED INSTRUCTIONS Using an assembly language. div idiv ebx ebx (for unsigned binary integer division) (for signed binary integer division) BCD arithmetic is done with one digit being in the AL register. NOT and XOR (exclusive OR). It logically ANDs the two operands. adjust instructions are provided for the programmer’s use to achieve BCD subtraction. This is compact for storage. To perform BCD addition. the EFLAG bits are set. This is useful for quick input and output. OR. simultaneously on bit positions 0 to 31 of the 32-bit register. If the BCD number were stored in a packed decimal format. Similar to adjust addition. Shift instructions would take much less time to execute when compared to multiply or divide instructions. BCD numbers can be stored in two different formats. first binary addition is performed using the add instruction. the assembly language programmer can judiciously use the shift instructions instead of the time-expensive multiply and divide instructions whenever possible.Assembly Language Level View of Computer System 261 (edx:eax) and the divisor (denominator) in the specified register. each BCD digit is stored in one byte. then daa (decimal adjust for addition) instruction instead of the aaa will do the job. After division the quotient is contained in the eax register and the remainder is contained in the edx register. For example if we add 6 to 7 the result in BCD should be 3 with carry 1 to the next higher order digit position. Just like the compare instruction. Another important bit oriented instruction is the test instruction which requires two operands like the compare instruction. . a programmer can manipulate individual bits of an operand stored in a register. In unpacked decimal format. It is followed by the execution of an instruction named aaa (ASCII adjust for addition) which will convert the binary 1101 (decimal 13) into decimal 3 (0011) and a carry.

The mnemonic for this instruction is sar and has the same syntax as other shift instructions.7(c)]. whether the number is positive or negative.7. shl shr shr eax. In Figure 8. That is. a negative number remains as a negative number. number of times to be shifted is contained in cl The arithmetic shift right is a special type of shift instruction. In the case of logical shift. but the result of this operation will be correct arithmetically as long as the sign of the number does not change due to the shift. Its value is divided by two for every shift. CF 1 0 0 1 0 1 0 1 (a) 0 1 0 0 1 0 1 0 Logical Right Shift Once 0 0 1 0 1 0 1 0 Logical Left Shift Once (b) 1 (c) 1 1 0 0 1 0 1 0 Arithmetic Right Shift Once (d) 1 0 0 1 0 1 0 1 0 Arithmetic Left Shift Once FIGURE 8. a bit position is ‘vacated’ at one end and one bit is ‘shifted-out’ at the other end. performs exactly same as the ‘logical shift left’ instruction. left or right.1 eax. . if we shift right arithmetic. by 2 bit positions .262 Computer Organization and Architecture The shift operations are of two types: logical shift and arithmetic shift. In the logical shift type operations the most significant bit (MSB) which represents the sign does not play any special role and is treated like any other bit while shifting.2 eax. It is simply a synonym to the shl instruction. When we shift the contents of a register once. the vacated bit position is filled ‘0’ and the shifted-out bit is stored in the CF position in the EFLAGS register. on the other hand. appropriate jump instructions can be used to test if the ‘MSB’ before the shift was zero or one. The arithmetic shift left (sal). The number of times shift must be performed is either specified along with the instruction itself or must be specified in the cl register. As a result of this. will shift the eax logically. The vacated bit position is filled with the sign bit that was there to start with [See Figure 8. will shift the eax logically by 1 position . As an illustration we have written Program 8. As a result.7 fragment to count the number of 1 bits in the eax register. a negative number should not become a positive number and vice versa. The arithmetic left shift of a signed number multiplies the number by 2. we have pictorially shown the effects of various shift instructions. The following examples depict this.7 Depicting various shift operations (underscore shows vacated bit position). cl .

their synchronization requirements with the CPU. Input-output is carried out with the help of different kinds of devices. microphone input.0 ebx. shift left ebx to examine the next bit loop examnext 8. its operation is somewhat analogous to the call instruction explained in Section 8. in the first iteration if the lsb of eax is 1 the result of ‘test’ instruction will be non . . and disks are some examples of I/O devices. _start: . I/O service programs are provided for the benefit of the assembly language programmers to perform many complex input and .7: A program fragment to count 1 bit in eax register . the computer architecture provides a feature called interrupts (explained in Chapter 11). all of which complicates the I/O. . display screen. the number of 1 bit are counted and .32 examnext test eax. CD or DVD readers and writers. scanners. the way they pack or consume the data. audio output. input and output deal with external entities. and thereby transfer the control flow to the execution of the specialized interrupt service program. They differ widely in their speed of operation. The execution of the int instruction invokes an ‘I/O program’ resident in the memory.ebx . 12345678h edx.8. joystick. pre-written by system programmers.1 . Keyboard. graphics output.text global _start . In order to coordinate the functions of the two independent subsystems (CPU and I/O) which operate independently at widely differing speeds. 1 ecx. In the case of Pentium there is also an instruction called int. When int is executed in a program. the bit was 1 so count up nextbit shl ebx. Unlike other operations on data which are local to the CPU. mouse. segment . initialize ecx with desired data use edx to count the one bit load the ‘lsb’ with 1 in ebx loop count is 32 mov mov mov mov eax. Recall that an interrupt signal can be initiated by the external device at any time to interrupt the program being executed.11 INPUT AND OUTPUT Input and output are complex operations in any computer system and the I/O organization of computers is explained in Chapter 11. By executing this instruction the software can create the effect of an interrupt signal.Assembly Language Level View of Computer System 263 Program 8. . stored in edx . zero in the second iteration the second bit is tested and so on jz nextbit inc edx .

4.4 ebx. In our examples we are assuming that the operation is always successful. segment .bss resb 25 ‘Please type the input string and hit return’. Starting address of the ‘buffer’ storage where the data item input or output is located in RAM. and edx respectively for the above four data items. ‘screen’ mov mov eax.text global_start Explains ASCII output displayed on screen followed by an ASCII string input read from the keyboard . The int instruction can have many different options which are indicated to the kernel by the parameter passed as part of the int instruction. 0xA $ -topline . ecx.264 Computer Organization and Architecture output functions. It is the responsibility of the assembly language programmer to load these registers with appropriate values before invoking the int instruction. Input and output using the kernel require four different data items. he should remember that an I/O operation may be a success or a failure. However. 3. if the operation is not successful an error code will be returned and the programmer can check this code and take appropriate actions. .data db equ . segment . We will restrict our discussions to one form of the options denoted by int 0x80 instruction and it is described in the program fragment (Program 8. . They are: 1. Traditionally. Similarly when int is called for an input operation the programmer can assume that the data is stored in the buffer.8). ebx. The length of the data item in number of bytes. Before calling int for an output operation the programmer should load the buffer and store its address in ecx and the length in edx.8: segment topline length .1 . ebx = 1 chooses default output device . 2. What is the code to be used to address one of the many I/O devices. Such programs are part of the operating system software and we call such a collection as ‘OS kernel’ or simply kernel. eax = 4 chooses the kernel for output . buffer . The int instruction uses the registers eax. _start: I/O programming using int instruction . Program 8. Which kernel function is to be invoked.

Another software feature called macro pre-processor is provided in support of writing readable programs in assembly language. board’ . a number like 1100 (decimal 12) stored in a register is to be displayed. a BCD number. . edx specifies the length of the string to be .Assembly Language Level View of Computer System 265 mov mov . displayed.0 ecx. it is essential to communicate the intentions and practices of one programmer in writing his part of the program to another programmer. ecx point to the buffer to be displayed . Such conversions of a general binary numbers to a printable ASCII form are straightforward but lengthy to program. First it is necessary to convert it into two digits 1 and 2 and then translate each digit into its equivalent ASCII form.6 we observe that we have saved the context of the caller by pushing 4 registers into the stack and at the end we restored them using four corresponding pop instructions. int mov mov mov int .12 MACROS IN ASSEMBLY LANGUAGE When programs become large. The body consists of a sequence of valid assembly language instructions. Subroutines and the associated CALL and RETURN instructions are hardware features provided in a modern computer system with this objective. In general the data could be an integer. For simplicity. we have used the macro definition facility in the assembler to define two macros called savecontext and retsorecontext. or a table of such data types. ebx = 0 chooses default input device ‘key . we have assumed the input and output data type to be a string of ASCII characters in Program 8.8. In team work. a vector of such data types.3 ebx. eax = 3 chooses the kernel for input . Pre-developed and tested subroutines are generally made available for such conversions in any computer system.9. Consider. it is essential to organize them to be tractable and readable. input string. . topline edx. a floating point number.length 0x80 eax. for a user the data is always presented in the form of ASCII characters. Recall that the ASCII code for decimal digit 2 is 0011 0010 (32 hex). Referring back to the program fragment Program 8. for example. ecx point to the buffer to be displayed . ecx. No matter what the data type is. In the program fragment Program 8. When the programmer . except when it is an audio or graphics. The macro body is contained between these two labels. 8. a binary number. The macro definition begins with the keyword ‘%macro’ and ends with ‘%endmacro’ in the label field. on return edx contains the length of the . buffer 0x80 .

The first parameter is referenced within the body of the macro as %1 and the second parameter is referenced as %2.9 we have not used any parameters. We will introduce the parameter passing feature in macros with an example. In that phase every occurrence of the macro name is simply substituted by the body of the macro. An assembler is called the ‘macro assembler’ when it supports the definition and use of macros. macros can also take parameters. During macro expansion. ebx and ebp . We have shown two macro calls with two different sets of parameters in Program 8. save the contents of ecx.esp . wherever %1 occurs in the body. In the program fragment Program 8. their assembly language programs become shorter and more readable. Please refer to Program 8. eax. When macro facility is used appropriately by the programmers. Program 8. we have defined this macro and have shown a typical use of the macro. %macro restorecontext pop ebp pop pop pop ecx ebx eax . the pre-processor will substitute the string which appears as the first parameter in the ‘macro call’ and so will be the second parameter.10. Like subroutines. restore the contents of the four registers . This is called macro expansion.9 to see how macros are defined and used.9: %macro Defining and using macros savecontext push eax push push push ebx ecx ebp . In macro assemblers a pre processor phase precedes the assembly phase.266 Computer Organization and Architecture needs to use the macro he simply names the macro like any other opcode. Observe that within the macro we have saved all the registers using pusha and restored them using popa instructions. In the example macro definition shown in Program 8.10. The two data names stand for the address of an error message stored in a buffer and its length respectively. The macro is intended to display the error message passed on the standard output device. saved %endmacro SUM: savecontext mov ebp. in the stack %endmacro . The number 2 in the macro definition line indicates to the pre-processor that there are two parameters to this macro definition. Let us suppose we want to define a macro called ErrorDisplay that will accept two parameters messaddr and messlen.

[ebp+24] . ebx = 1 chooses default output device . the address labels inside the macro will be duplicated with every macro substitution. element of V first loop addnext mov [ebp+20]. save all registers . additionally . edx should contain the length of the string . To circumvent such problems. ecx point to the buffer to be displayed . The body of a macro can implement any logic that the programmer desires. Multiple definitions of address labels are not acceptable. If a macro with internal labels for instructions gets invoked multiple times. ret 12 . stack that was created by the caller program which pushed 3 elements. the pre-processor will generate unique labels where %% appears so that duplicate label definitions are avoided. [ebx+ecx] . [ebp+28] .1 ecx. . base address of V is stored in ebx addnext add eax. each. initialize to create total mov ebx. %1 edx. . eax = 4 chooses the kernel for output .Assembly Language Level View of Computer System 267 . by keeping a copy of the esp in ebp we can access the parameters passed with . this will return to the caller and . store the total in its destination . pop 12 bytes off the stack thereby cleaning up the parameter information on the . restores all the registers saved .10: %macro . Program 8. [%2] 0x80 . mov ecx. eax . 4 bytes . .0 . In doing so. total is accumulated by adding the last . all address labels to instructions within the body of a macro definition is prefixed by %%. pop instructions which will modify the esp value dynamically. he/she may use labels for some instructions within the body of the macro and refer to them in jump instructions.4 ebx. Remember that the pre-processor will simply substitute the macro body for every occurrence of the macro invocation. Macro with parameters ErrDisplay 2 pusha mov mov mov mov int popa eax. restorecontext . Every time a macro is invoked. reference to this base address while the subroutine may freely use the push and . ‘screen’ . this will load length of the array that is 6 mov eax. into the stack before issuing a ‘call’ statement. This feature is not depicted in Program 8. to be displayed.10 but see Exercise 12 of this chapter.

mess1 length1 mess2 length2 db equ db equ ‘This is error Message one’.268 Computer Organization and Architecture . length1 …. features are provided for facilitating modular construction of programs in the form of subroutines. ErrDisplay mess1. 4. The instruction set that we presented can be summarized into the following categories: 1. Some other instructions mess2. 3. …. 2. 0xA $ -mess1 ‘This is error message two’. and coordination with asynchronous external subsystems are supported by such features as interrupts and buffers. the hardware designer provides different modes for addressing the operands and instructions. 0xA $ -mess2 …. MAIN: .. . %endmacro .. 5. Input and output. Load registers and store registers (move instruction in the case of Pentium) Arithmetic instructions on different data types String matching through compare or test instructions Conditional and unconditional jump instructions Bit oriented instructions for logical and shift operations . Besides the instruction set and the registers. he/she also has to understand the organization of the various registers and their roles in a given computer organization. length2 .13 INSTRUCTION SET VIEW ORGANIZATION OF COMPUTER An assembly language programmer views a computer system through the instruction set of that computer and the semantics of those instructions. second call of the macro Some other instructions 8. In order to develop programs. first call of the macro ErrDisplay …. In order to support programmers and software engineers in the development of large systems. Other instructions here …. In this chapter we have viewed a small subset of the design features of the Pentium processor and some of its organizational features. ….

a chat session may be going on. a computer music playing at the background. for further improvements or for maintenance. At the least we . Hence. The ultimate end users of computer systems are people in every walk of life. the compiler designers and operating system designers are very much concerned with the instruction set view of the computer systems. Thus a program written by one programmer can be read and understood with ease by another programmer or by the same programmer at a later time. etc. the basic software provided for programmers is the assembler. The instruction set of a computer is fixed at the time of design and manufacturing. However. The application software. if we take the view of the application software developers instead of the end users. Stack based instructions 7. They do not view computer system through the instruction set but through the ‘user interface’ to the application software that they use. However while developing the application software. with ease. As an example. the software engineer might wish to have additional instructions that will suit his task. in several applications.Assembly Language Level View of Computer System 269 6. they are concerned with a different set of features in a computer system. Subroutine call and return instructions 8.14 ARCHITECTURE AND INSTRUCTION SET Let us interpret the term ‘architecture’ as one would do in the case of a building architecture in civil engineering. because the compiler takes care of the translation to machine instructions. In that case the architectural view is concerned with how the building appears to the end user and is not concerned about the details of how the structure is supported and the strength of the building. The ultimate goal of computer organization is to provide a flexible and effective tool for software developers to create large and complex application software that can be used reliably and efficiently. 8. The instructions have mnemonics and addressing can be done using data names and instruction labels. starts several ‘processes’ that will be running simultaneously. Instruction in support of I/O operations Programming at the hardware level using the machine language would be very cumbersome as we will have to deal with long sequences of bits and codes in binary form. Programmers using higher level languages like JAVA or C++ are mostly insulated from the level of details that we have been discussing here. consider a computer system running one of the modern operating systems using Windows. Several windows may be open. In this introductory book we have not addressed many issues that are found in advanced textbooks that deal with operating systems or compiler writing and their needs regarding the architectural and organizational features of a computer system. However. etc. the assembler provides features like macros which act like an instruction but at the software level. typically. The assembler permits a programmer to use meaningful symbols instead of binary sequences. For this purpose.

In such cases the competing entities are assigned priorities so that simultaneous requests can be resolved smoothly. One of the most popular forms of distributed computing is based on the ‘client server architecture’. there are so many such real time external processes that are controlled by computers. We have not addressed the instruction set at that level of details in this introductory text. that is an error in one should not propagate to and destroy the other. All these processes have to share common resources such as the RAM space. As every computer has a master clock running at giga hertz speed. Examples include household appliances like microwave ovens. providing the timer is as simple as having a counter register. At the same time the communication network. space crafts. paper mills and textile mills. This would need a resource manager which operates at a higher priority level than the application software. capacity and cost. Examples include multiple processes competing for the CPU time or multiple I/O devices competing for the attention of the CPU. auto pilots and chemical processes like sugar mills. In ‘real time computing’ a computer system is used to monitor the conditions of an external process and control its parameters in the real time of the external process. we have ‘distributed computers’ and ‘parallel processing computers’. The central software process which sets these priorities is given different privileges to execute such instructions compared to competing entities. multiple processes timesharing a single fast processor is normally given a slice of time for each one in turn for its own computational needs. screen space and the CPU time for processing. Specialized instructions are provided to do this in any large computer. In modern applications. Based on how close the coupling between the various computers connected together is. As a result multiple computers and processors can be employed to function in parallel to solve complex problems. transportation systems like traffic lights. capable of connecting many such computers as a network also has improved dramatically in speed. Very often it becomes necessary not to keep the priorities static but to change them dynamically. Another architectural feature that we need is a ‘timer’. Parallel computers are discussed in Chapter 13. The segment registers of the Pentium processor provide an independent segmented view of the RAM space for protection. etc. Instructions dealing with such timer registers have to be privileged instructions that should be available for manipulation by certain highly reliable and responsible processes. . Time is an essential parameter that needs to be measured and used for controlling various events in a complex system. In a real computer system it is a common occurrence that multiple entities like processes or devices compete for the services of a central manager of resources. There should be specialized instructions to load and store such registers.270 Computer Organization and Architecture want these processes to be protected mutually from each other. We would need hardware elements to store the priority codes of the different entities. and hardware circuits to recognize the priorities of the entities requesting for services and for resolving which entity gets the service. The cost of a single processor and the computer based on such processors has come down dramatically these days. For example.

9. Compare and contrast the register structures of SMAC++ with that of Pentium. It stores the one-to-one association between symbols and the RAM addresses allocated for them in the assembly process. 4. 8. An assembler is a software and assembly is the process that the software implements to create an equivalent machine language program. its registers and the operations supported and executed in the hardware. Stacks are quite useful to link subroutines. What types of instructions are common and what types are different? 3. In some computers there are instructions to LOAD specified registers from memory and to STORE a register into memory. Before learning an assembly language of a computer we need to learn about the registers and the memory organization of that computer. 7. Pentium uses 15-bit long segment registers. 6. Symbol table is a dynamic table constructed and used by the assembler. Compare this table with the instruction set of SMAC introduced in Chapter 6. Create a table of all the instructions of Pentium that you have learnt in this Chapter. instructions. 3. Assembly language programs are much lengthier and harder to understand than higher level language programs. 2. EXERCISES 1. 2. Macros and subroutines are useful features in the development of large programs. Also they are machine-dependent. Macros can have zero or more parameters and differ from subroutines in several ways. Assembler instructions or pseudo instructions are like hardware instructions but are executed in the software.Assembly Language Level View of Computer System 271 SUMMARY 1. . 10. Understand the way they are used in memory addressing and answer ‘what limitations the 15-bit length imposes’ with regard to the calculation of effective address of the memory for instruction or data access. In the case of Pentium this is achieved with the MOVE instruction. Assembly language makes use of symbolic addressing for data. What are the advantages and disadvantages of these? 4. Learning the assembly language amounts to learning the syntax and semantics of various instructions. 5.

Write a macro called MAX with two parameters X and Y. Will the use of subroutines simplify the development of your program? 9. Using the stack. Write a program to check if the number of zeros are greater. rm). Write a program to compute the dot product of two vectors. In what ways do the two features ‘macros’ and ‘subroutines’ resemble each other and in what ways do they differ from each other? 11. 6. 10.s). .r. s and call the macro first to compute pm = Max (p. write a program to check if one is a substring of the other. Then write a program to read p. third to compute large = Max (pm.272 Computer Organization and Architecture 5. A string S is called a palindrome if it reads the same from left to right and from right to left. If this program were to be executed over several thousand 32-bit words. less. The loop instruction in Pentium provides a convenient way to execute a set of instructions iteratively for a pre determined number of times. If S and T are two strings.q). second to compute rm = Max (r. Write an efficient subroutine to multiply the two given matrices A and B and store the result in C. 8. write a program to test if a given string S is a palindrome or not. what result do you expect for this comparison and why? 12.q. or equal to the number of 1s in a given 32-bit word. that will return the larger of X and Y. 7. In a single use of this facility what are the savings obtained? Express it as the number of instructions saved due to the use of loop instruction.

For storing the bulk of data needed in digital computation. Read Only Memories (ROMs) and their applications.5D organizations of memory systems using IC chips. a Random Access Memory (RAM) is used. A memory cells as the building block of a semiconductor memory. A memory unit consists of a large number of ‘binary storage cells’. each cell storing one bit. Integrated circuit chips to fabricate memory systems. 2D and 2. Besides the billions of storage cells. Dual ports in a RAM and concurrency in memory access. Registers are normally used for temporary storage of a few items. 9. Dynamic and Static Random Access Memories (RAMs). a memory has a small number of registers to facilitate storage and retrieval of data in units of bytes or groups of 273 . Importance of error detection and correction in memory systems.MEMORY ORGANIZATION 9 LEARNING OBJECTIVES In this chapter we will learn: â â â â â â â â The different parameters of a memory system.1 INTRODUCTION In Chapter 3 we saw how flip-flops could be organized to form storage registers.

it must be small so that the total energy dissipated by the memory is small. 5. it must have the following desirable characteristics: 1. 8. They are usually denoted as 0 and 1. When a read signal is sent to memory unit. 7. 3. In order to read a word from memory. There is also another type of semiconductor storage cells used in memories called Read Only Memories (ROMs). cells used in ROMs do not lose the data stored in them when power is switched off. It should be possible to switch between the two stable states an infinite number of times. A word is always treated as an entity and moves in and out of memory as one unit. Each binary cell should occupy very little space. A word to be written in the memory unit is first entered in a register called the memory buffer register (MBR). In order that a physical device is usable as a binary storage cell in a memory unit. A typical word consists of 4 bytes or 32 bits. The two typical registers used in a memory system are Memory Address Register or MAR and memory data register which is also known as Memory Buffer Register or MBR. The movement of words in and out of memory is controlled by signals called write and read signals. When power is turned OFF. The time taken to read data from a group of cells (word) or for storing data in them must be small. 2. the cell should not lose data stored in it.274 Computer Organization and Architecture bytes called word. It must have two stable states. A write signal is now initiated. This is also known as cycle time. it should not consume any power. the word is copied from the specified address and placed in MBR where it remains until it is transferred to another register in CPU. If it does consume power. and magnetic surfaces on a disk or tape and pits/lands on a laser disk used in secondary memory. The cost of each cell must be low. The data stored in a cell should not decay with the passage of time. Semiconductor RAM cells lose the data stored in them when the power is turned off.2 MEMORY PARAMETERS A set of binary cells are strung together to form a unit called a word. 4. These are called non-volatile memory cells. On the other hand. From our discussions above it is clear that a memory unit is organized in such a way that it has a number of addressed locations. 9. each location storing a word. Binary cells which are currently popular are semiconductor ICs used in RAMs (Random Access Memory) as the main memory of a computer system. Other non-volatile memories are magnetic surface recording and laser disks. This is called a volatile memory. 6. the address from where it is to be read is entered in MAR. While it is in one of the stable states. The word in MBR is copied into the address specified in MAR. .

1K words) with 32 bits per word. The address where the number is to be stored is entered in MAR. The read signal is initiated by the control unit. The time required for this combined read and write operations is known as the memory .Memory Organization 275 The addresses normally start at 0 and the highest address equals the number of words that can be stored in the memory and is called its address space. Flip-flop binary cells. If reading data from memory is destructive. Reading from binary cells made with capacitors is destructive. Figure 9.1 depicts the block diagram of a memory system. This is called non-destructive read out. then its address is entered in MAR. are non-destructive.1 Block diagram of a memory. Whether read out is destructive or not depends on the device used as a binary cell in the memory. 1024 is usually abbreviated as 1K.. then this number is placed in MBR. in this example.e. In order to store the value of the address. the MAR register. if a memory has 1024 locations. If a number 64 is to be stored in location 515. for example. The time interval between the initiation of a read signal and the availability of the required word in MBR is known as the access time of the memory. This signal replaces the current contents of the location 515 by the contents of MBR. should have 10 bits. then it is necessary to write it back in memory. If the contents of some location are to be read. The time interval between the initiation of a write signal and the storing of the data in the specified address in the memory is called the write time. The contents of the specified location are copied into MBR and whatever data was in the selected location is left undisturbed. 32 bits word 10 0 515 1023 MAR Read MBR 32 Memory control Write FIGURE 9. In this case. whatever was read from the selected memory cell must be written back to preserve its contents. This memory is assumed to store 1024 words (i. For example. The write signal is then initiated. Reading data from a memory must be non-destructive for the memory unit to be useful in a data processing system. then the address ranges between 0 and 1023.

2 Read/Write time. SRAM cells are faster compared to DRAM cells. Memory systems may be constructed with IC flip-flops in such a way that the access time is independent of the address of the word.276 Computer Organization and Architecture cycle time. however. A dynamic cell uses a capacitor whereas a static cell uses an RS flip-flop fabricated with transistors to store data. preferred for fabricating main memories as it is possible to realize an order of magnitude more memory cells per chip compared to SRAM. The DRAM is. if the binary cells are on the surface of a magnetic tape or disk.2 illustrates these terms. This time is the cycle time of the memory. semiconductor storage elements have replaced magnetic cores. We will first see how the dynamic and static cells function. Even if reading from memory is non-destructive. In contrast to this. Figure 9. Memories made using dynamic memory cells are called Dynamic Random Access Memories (DRAMs for short) and those fabricated with static cells are called Static Random Access Memories (SRAMs for short). The cost per cell is thus an order of magnitude lower in DRAM.3 SEMICONDUCTOR MEMORY CELL In early days (1955–70) magnetic cores were used as the storage elements of the main random access memories of computers. The method of accessing memory depends on the particular device used to construct the binary cell and how the devices are interconnected to form a memory. then the access time would depend on its actual physical location. Memories fabricated using SRAM cells have an access time of around 15 ns whereas DRAMs have an access time of around 80 ns. 9. There are two types of semiconductor storage elements. Such a memory is known as a random access memory or RAM. One is a dynamic memory cell and the other a static memory cell. . With rapid development of integrated circuits. the time that should elapse between two successive references to memory read or write is larger than the access time. t0 Access time t1 Rewrite time t2 time Cycle time t0 : address in MAR t1 : word in MBR t2 : ready for next access FIGURE 9.

2 Static Memory Cell A static memory cell is essentially a flip-flop.1 Dynamic Memory Cell Figure 9.4(a)) is nearly 0. This switches on the pass transistor T. This is called refreshing.1 picofarad (pico = 10–12) and can hold a very small charge when it is charged. the voltage of the bit/sense line will tend to go up to V and if a 0 is stored in the cell. To read a cell. It is therefore necessary to rewrite the data periodically. The cell works as follows: The flip-flop consisting of transistors T1 and T2 can be in one of the two stable states.3. The direction of change of voltage in the bit/sense line is sensed by a sense amplifier. In this figure we have simplified the circuit to its essentials in order to explain the working of the cell.4(a) within a circle. It incorporates a transistor T (called a pass transistor) which controls the charging of a capacitor C. the address line is selected and a voltage V is applied to it. then if C is charged it will discharge and a 0 is stored. T Address line C Bit/Sense line FIGURE 9. it will tend to go down to 0. In this case the voltage at P would be V (logic 1) and that at Q will be 0 (logic 0). If a 1 is stored in the cell.3 A dynamic storage cell. The capacitor is of the order of 0. reading data from the cell is a little tricky. The opposite state is with T2 conducting and T1 non-conducting. The voltage at Q is thus at V. A cell is selected for writing or reading by applying a voltage V to the address line. This switches on T and C is charged to voltage V. Thus the charge stored in C when a 1 is written will slowly leak away (in a few milliseconds) and the data will be lost. Capacitance C is not an ideal capacitance.3). In this state the memory cell is said to be storing a 0.3. As the charge stored in C is very small. 9. The cell is illustrated in Figure 9. The pass transistor is connected to an address line and a bit/sense line (see Figure 9. A positive change is taken as a 1 and a negative change as a 0. In this state . It has a very large but finite leakage resistance. If 0 voltage is applied to the bit/sense line. Observe that the read operation is destructive and a write should follow a read. One stable state is with T1 conducting and T2 non-conducting and other is with T2 conducting and T1 non-conducting. This voltage is applied to the gate of transistor T2 which keeps it switched off.3 illustrates a simple dynamic storage cell. To write a 1 in the cell a voltage V is applied to the bit/sense line.Memory Organization 277 9. When T1 is conducting the voltage at the point P (Figure 9.

T1 will be switched off and P will go to V.3. 9.4(a)] and take point Q to 0. The voltage on the bit wire B0 is taken to 0. we connect this cell to a pair of bit wires B0 and B1 and a word wire W via two transistor switches as shown in Figure 9. then Q would have been at 0.3 Writing Data in Memory Cell In order to use this memory cell in a memory system. This will switch off transistor T2 taking point Q to V which will turn on T1 and P will go to 0. then the voltage at P would have been at V. One state with the voltage P = V is (arbitrarily) called the 1 state and the other state with P = 0 is called the 0 state. When Q goes to 0 transistor. the cell will have a 0 stored in it. Thus. The transistor switches are necessary to select a cell for reading or writing. If a 0 is to be written in the cell. Figure 9. If the cell had a 1 stored in it. the bit wire B1 is taken to 0 volt.4(a).4(b) summarizes the writing of 0 or 1 in the memory cell. If a 1 is to be written in the cell. To summarize we observe that the two transistor memory cells can be in one of the two stable states.278 Computer Organization and Architecture Memory cell V Bit/Sense wire B0 T3 T1 T2 T4 P Q Bit/Sense wire B1 W Word wire FIGURE 9. The two-bit wires B0 and B1 are normally kept at a voltage V and the word wire W is kept at 0. If the cell originally was in the 1 state. If the cell had a 0 to begin with then taking B0 to 0 will tend to keep P at 0 and thus the state of the memory cell will not be altered. the memory cell is said to store a 1.4(a) A static MOS cell. the memory cell will be in 1 state. and taking B1 to 0 would not change the state of cell. regardless of what is currently stored in the cell. . Thus. This keeps the two transistor switches T3 and T4 open. This will turn on T2 and Q will go to 0. then the cell is selected by applying a voltage V to the word wire W. This will switch on T4 [Figure 9. Applying a 0 to B0 wire will switch on transistor T3 and the point P will be forced to 0 volt. If B0 now goes back to its normal voltage V then T3 is switched off and the voltage at P will remain at 0.

Memory Organization W V 0 B0 V 0 P V 0 B1 V 0 t 1 0 1 0 t t t Cell Selected 279 FIGURE 9. . The sense amplifier output will be stored as a 1 in the memory buffer register. If P is at V.5 Reading with MOS cell. Figure 9. The sensing will be non-destructive since the word pulse only senses which state the cell is in without changing the state of the cell. and this will be detected by the sense amplifier. 9. In this case the small voltage applied to W would cause transistor T4 to conduct and the voltage on bit wire B1 will dip slightly.3. If the point P is at 0 volts (in other words if the cell stores a 0) then a current flows through T3 bringing down the voltage on line B0 slightly. This makes both transistors T3 and T4 tend to conduct. This wire is connected to a sense amplifier which will detect this change and store it as a 0 in the memory buffer register. then Q would be at 0 volts. a small voltage is applied to the word line keeping the voltage on the bit wires equal to V.5 illustrates the W V 2 P V t B0 V t B1 V 0 t B1 V 0 t t B0 t P t W t FIGURE 9.4 Reading the Contents of Cell For reading the contents of a cell.4(b) Writing 0 or 1 in a MOS cell.

is one bit in size and is individually addressable. The data input pin and the data output pin are separate in the case of this chip (see Table 9. Each of these cells. we need 18 address pins (labelled A0 to A17) to select one out of 256K cells. Each cell when read or written stores one bit. One of the limiting factors of such packaging has been the number of different pins which connect the internal circuitry of the chip to the external world. We will use this block diagram to describe memory organization in the next two sections.7.7 256K ´ 1 RAM chip pin out representation. The block diagram of an example IC chip is shown in Figure 9. in this example.4 IC CHIPS FOR ORGANIZATION OF RAMS The IC manufacturers package millions of memory cells into a chip which forms the building block for the organization of large size RAMs whose capacity can be in the order of several hundreds of MB or a few GB. . In this figure we have shown a chip that has 24 pins and internally contains 256K addressable memory cells.1) but in some other cases the same data pins are A12 A13 A14 A15 A16 A17 VCC VSS E D Q W A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 256K ´ 1 Chip FIGURE 9.280 Computer Organization and Architecture reading operation.6 Block diagram of a static memory cell.6. 9. Thus. in this chip. B0 B1 Cell W FIGURE 9. The memory cell may be represented by the equivalent block diagram of Figure 9.

This chip’s access time is around 15 nanoseconds.1 Pin Assignment for 256K ´ 1 SRAM Chip Pin names A0–A17 E W D Functions Address input Chip enable Write enable Data input Pin names Q VCC VSS Functions Data output +5V supply Ground The truth table for selecting a chip when it is used as a part of a large memory and for reading and writing is given in Table 9. In addition to this.Memory Organization 281 used for both input to as well as for output from the chip. The reader should note the convention used to indicate the multiple lines by means of a single line with a ‘/ ’ mark and an associated integer.7 as shown in Figure 9. As an exercise the reader is asked to draw an abbreviated diagram for an IC chip that has 16M addressable elements where each element is a 4-bit group (half of a byte). VCC for power and VSS for ground. Read/Write in SRAM Chip E High Low Low W Don’t care High Low Mode Not selected Read Write Output High impedance Data out High impedance Cycle — Read Write .2 Truth Table to Enable.8. a typical memory chip consists of other control pins such as CS (chip select or chip enable). it is customary to abbreviate Figure 9. Write Enable Address 18 Data in Data out Chip Enable FIGURE 9. This memory chip is packaged as a 24-pin integrated circuit.2. For the sake of drawing the organization of a large size RAM using such chips as building blocks. TABLE 9. TABLE 9. R/W (read or write operation on the enabled chip).8 Abbreviated representation of 256K ´ 1 SRAM chip. Typically we denote such a chip as 16M ´ 4.

The decoder output will select one (exactly one) of the 16 rows. each chip contributing one bit of a byte. then we can construct a 256 KB memory system. In a 2D organized memory system the chips are laid out as several rows and several columns. one output for each row.282 Computer Organization and Architecture 9. The 8 bits coming from the 8 chips (one bit from each chip) in the selected row are routed to the 8 bits of the MBR of this memory system. This 256 KB memory will need a 18-bit address.5 2D ORGANIZATION OF SEMICONDUCTOR MEMORY Let us suppose that we want to organize a 4 MB memory using the chips 256K ´ 1 described in the previous section. The decoder will select one out of the 16 rows and all the chips in the selected row will be enabled for reading or writing. . the 16 decoder outputs are connected to the 16 rows of chips. Hence.9 illustrates the organization of what we described above in the form of 16 rows and 8 columns. Since we need 4 MB sixteen such rows can be used together to make up the total memory. A memory with 4 MB capacity will need a 22-bit address. How many such chips will we need? Conceptually. if we put 8 such chips in a row. The high order 4 bits of the 22-bit address can then be used to select one out of 16 rows in the memory organization. Figure 9. as 222 = 4 MB. A 4-bit decoder is used to decode the high order 4-bits of the MAR. A0 A1 18 R/W A17 A18 A21 1 18 R O W D E C O D E E 18 2 18 8 18-bit address bus Row address Chip internal address 9 E 10 16 121 2 E 122 2 128 2 Data in out MBR FIGURE 9.9 A 4 MB memory constructed using 256K ´ 1 chips. The address lines can be fed as A0 to A17 to each of the 8 chips in a row so as to select one out of 256K elements.

that is. Our first step is to determine the number of chips required. These 16 chips will be organized in the form of a matrix with 4 rows and 4 columns as shown in Figure 9.Memory Organization 283 Let us consider another example. In this example we are asked to organize a memory 128M ´ 32. The principle of operation in this example is similar to the previous example. FIGURE 9. We will assume that we are given the IC chips of size 32M ´ 8 or 32 MB which has 32M addressable units with each unit being a byte. it should have 128M addressable units where each unit (word) is 32 bits long. In this case we will need 16 IC chips. The IC chip will have 25(225 = 32M) address pins and 8 data pins shared for both input and output. The 2 higher order bits of the memory address will be used to select one of the 4 rows and the 4 chips in the selected row together will supply the 32 bits of that word. The memory system will need a 27-bit address to address one of the 128M words where each word is 32 bits long. .10.10 A 128M ´ 32 memory system constructed using 32M ´ 8 chips.

11 illustrates the way the cells are arranged in the memory. Another organization called 2. The row bits are decoded by a decoder which has 4 inputs and 16 outputs in this example. The electrical signals through the word wires perform the function of chip selection. the address decoder can become too complex. This unit does multiplexing in one direction and demultiplexing in the other direction. The bit to be written is input to the MUX/DEMUX unit.. Obviously the number of word wires in a memory system will depend upon the chip size given and the size of the memory system required.11 A semiconductor 2. In this case the bits to be decoded are divided into row decoder bits and column decoder bits. We will illustrate this organization by examining a 256 words. The 16-bit word address required for the 256 words is split into 4 bits for row address and 4 bits for column address.284 Computer Organization and Architecture 9. 1 bit/word memory. A3 4 ´ 16 Row Decoder 16 16 ´ 16 Cell array 16 A4 . When the number of word wires runs into hundreds and thousands. CE 4 A0 . The bit will be routed only to the appropriate column in the selected row as determined by the 4-column address bits that are used as the selector bits for this multiplexer/demultiplexer unit. During the WRITE operation. The number of word wires in 2D organized memory systems is equal to the number of words in the memory. A7 Write FIGURE 9. 1 b/w). All the 4 bits in the selected row will be passed to the column MUX/DEMUX unit. the row address bits select one of the 16 rows.. Again the 256 words are organized in the form of a 16 ´ 16 matrix. This .6 2.5D memory (256 W. Let us call the decoder outputs the word wires. A7 Read 1 Data out 16 ´ 1 MUX 16 1 ´ 16 DEMUX 1 Data in A4 .5D (read as ‘two-and-half D’) organization allows the use of simpler decoders. The memory system output is the one bit output from this unit. The 4-column address bits from the 16-bit address of the memory system will be used as the selector bits for the MUX/ DEMUX unit. Figure 9.5D ORGANIZATION OF MEMORY SYSTEMS In the 2D organization an address decoder is used to select one out of W addressable units which we called words.. During the memory READ operation the 4-bit row address will select one of the 16 rows.

Memory Organization 285 form of selection of a word is also known as coincidence selection. . In the above example.. Similarly A8 to A15 is applied simultaneously to all MUX/ DEMUXes. A A8 15 256 ´ 1 MUX 1 Read MBR D0 256 8 1 ´ 256 8 DEMUX A8 .. .. we need 256 ´ 8 MUX to which a 8-bit column address is fed. Writing is by placing 1 bit to each of the eight (8 ´ 256) DEMUXes. To select one of these. When more than one bit per word is required. . if a read signal is initiated all the 256 outputs in this row will be read. A7 8 ´ 256 Row Decoder 256 256 ´ 256 Chips 256 256 ´ 256 Chips 256 8 . Having selected this row. 256 ´ 256 bit chips organized as a 2.5D memory.12 A 64-KB memory using eight. The 64-KB memory is made up of an array of eight 256 ´ 256. The organization uses a 8 ´ 256 row decoder to select one of the rows. 1 bit output memory chips (see Figure 9. This can greatly reduce the complexity of the chip by reducing the number of pins but will increase the memory cycle time. A15 1 256 256 ´ 1 MUX 1 1 256 1 ´ 256 8 DEMUX A8 . In commercial applications further reduction in the number of pins can be achieved if we can time multiplex the same number of pins for both row selection and column selection. we have used a trivial case of 1 bit/word. We will illustrate this by showing how 64 KB SRAM is organized as a 2.5D memory. The organization is illustrated in Figure 9. . CE Chip 1 Chip 8 A0 . Observe that 256 outputs of each decoder are simultaneously fed to all eight 256 ´ 256 chips.12). A15 1 W R W . thus resulting in a trade-off. D7 FIGURE 9.12. Observe that there are 8 such arrays to get 1 byte output.. identical matrix and selection units can be modularly constructed as one unit per bit. Then READ/WRITE will be done as a two-step process: In step 1 the pins will be used to carry the row address bits and in step 2 the same pins will be used to carry the column address bits.

Commercially available memories are packaged with multiple IC chips on a printed circuit board in such a way that all the pins form a single line. we can use 2 chips of 32K ´ 8 as opposed to 8 chips of 64K ´ 1. up to 256M bit-cells have been realized on a single chip and this number keeps increasing as the technology advances. 1 power supply and 1 ground line for a total of 25 pins. They can be easily plugged into the motherboard of a computer system connecting the memory system to the system bus. They are known as SIMM or single in-line memory module (see Figure 9. This chip is a 1Mbit (1M ´ 1) CMOS DRAM.286 Computer Organization and Architecture The design considerations in memory systems using IC chips are chip count. 1 power supply line and 1 ground line for a total of 22 pins. The main differences between DRAM and the staic RAM or SRAM are as follows: 1. DRAM is a destructive read-out memory. The access time of this chip is 70 ns. 1 read/write line. 8 bidirectional data lines. 1 read/write. It is quite often advisable to use chips such as 64K ´ 1 or 256K ´ 1 which give a 1-bit output. We will consider a typical commercially available DRAM chip and how it is organized (Motorola MCM51100A). In contrast if a 64KB memory is to be built. 2. a data in line and a data out line. In contrast an 8K ´ 8 chip which has the same memory capacity of 64K bits has 13 address lines.3. 4. DRAM needs periodic refresh (to keep the contents from being lost). pins per chip and ease of expansion.13). 1 30 FIGURE 9. . 3. 1 chip enable. 9. For example 64K ´ 1 chip has 16 address lines. 1 chip enable.13 30 pin single in-line memory module (SIMM). DRAM is much cheaper than SRAM and allows denser packing of bits per chip. Due to its lower cost DRAMs are commonly used as main memories in PCs and workstations. It has 18 pins whose assignment is given in Table 9.7 DYNAMIC RANDOM ACCESS MEMORY In this section we will examine the organization of a typical Dynamic Random Access Memory (DRAM). It would thus involve less soldered connections and hence higher reliability to use 64K ´ 1 chips. In fact. DRAM has a slower cycle time than SRAM.

The page mode allows fast successive data access on all 2048 cells of a selected row in the two banks of the 1M bit DRAM. the data in D buffer is referenced by column address line and written at the end of strobing row and column by RAS and AS signals. When AS transitions to inactive the output goes to high impedance state.3 Pin Assignment of 1M ´ 1 DRAM A0–A9 D Q W Address bits (Row and column addresses multiplexed) Data input Data output Read/Write enable Row Address strobe Column Address strobe +5V power Ground Test function enable RAS CAS VCC VSS TF Observe that 20 bits are needed to address 1M bit whereas this chip uses only 10 bits for address.14. The data output is tristate to enable tying together outputs of many chips to allow design of larger memories. Active transition of RAS is strobed by clock #1 and 10 row address bits are selected. The bit is read out into the buffer. For writing. Page mode operation keeps RAS active and toggles AS between VHI and VLO while selecting successive columns on the selected row. (Observe in Figure 9. The other modes are Read/Write cycle in which write follows read and a page mode.) For reading.) . Following this CAS transitions and is strobed by clock #2 and the 10 bits appearing in A0 to A9 are used to select the column address. We call clock #1 and clock #2. Thus. W is set high and the row and columns of the bit to be read are strobed. The chip has two internal clock generators. These bits are used to apply 10 row addresses and 10 column addresses by time multiplexing these signals.14 that 1 bit of row address buffer is diverted to the column decoder.Memory Organization 287 TABLE 9. Observe that the bits are organized into 2 banks of (512 ´ 1024) bits. The 10 bits used to select a row and the 10 bits used to select a column will address one bit in the (1M ´ 1) memory. (We will see this use in greater detail later in the book. Page mode is useful to read a chunk of a program or data from DRAM and store it in SRAM (called cache memory) from which individual words can be accessed faster. This is done to reduce the number of rows which facilitates refreshing as we will see later in this section. there are 512 physical rows and 2048 columns. The block diagram of the organization of this memory is given in Figure 9.

.288 Computer Organization and Architecture W CAS Data-in buffer #2 clock D A0 A1 A2 A3 A4 A5 A6 A7 A8 10 Column address buffer 10 Data-out buffer Q Column decoder Refresh control 9 10 Refresh counter 2048 1 Sense AMP 1/O gating and MUXes 10 A9 Row address buffer 9 Row decoder 512 Memory array RAS #1 clock Substrate bias generator VCC VSS FIGURE 9. Bit/Sense lines (column address) FIGURE 9.14(a) A dynamic random access memory.14(b) Row address Typical arrangement of bits in memory array.

If the number of bits in a memory unit is i (for a 32-bit word i = 32) we add k error detecting/correcting bits to i information bits and write (k + i) bits in each addressable word in memory. 9. However. The k bits read are used to find out whether there were any errors in the i information bits. write or read/write operation will also refresh all the 2048 cells in the selected row. Observe also that a normal read. Typically it is a few milliseconds. a single parity bit is sufficient. When data read has an error. the (k + i) bits stored in it are read. Thus for a single error correction in a 32-bit . With 13 pins we can address only 8192. When a word is read. The k bits are called parity bits. When a row is selected for refresh all 2048 cells in that row are refreshed.8 ERROR DETECTION AND CORRECTION IN MEMORIES Memory is an important part of a computer. it should be detected and if possible corrected. In the simplest case. The memory is internally organized in 16 planes of 8192 rows and 512 columns. A more recent chip is Infineon’s 256 Mb Double Data Rate SDRAM (DDR 400A). This is called double data rate (DDR) as two data transfers take place per clock. giving a maximum of 226 = 64 Mb address space. There are 4 such planes which can be accessed to give 4-bit output per address specified. Most of the commonly used techniques are based on the principles of the Hamming Code (discussed in Chapter 2). These bits are used for 13 row and 13 column addresses by time multiplexing these.8 ms maximum average refresh interval (8192b row refresh). periodic recharging of capacitors known as refresh cycle is necessary. This memory has 7.Memory Organization 289 Refresh cycle: As data is stored as charge on capacitors in DRAM. The particular configuration is selected by using 2 bits available to external pins. if we want to detect and correct a single error. If the number of rows is larger it is difficult to cycle through all of them within the specified time and that is the reason the memory cells are arranged as two (1024 ´ 512) banks. The clock rate is 200 MHz. 32M ´ 8b or 16M ´ 16b. to detect a single error in the data read. It has 66 pins out of which 13 pins are for address. There are other modes of refreshing which we will not discuss. The maximum time allowed between memory cell refreshing depends on the specific design of DRAM. Various error detection/correction techniques have been used in designing memories. It is thus essential to ensure that the data read from a memory do not have errors. Can you imagine what will happen if one bit of an instruction or one bit of important data stored in the memory is incorrect? Sometimes data read/stored in a memory are corrupted due to noise or hardware faults. we have shown that the number of parity bits k required must be such that (2k – 1) must be greater than or equal to (i + k). The 256-Mb chip can be configured as 64M ´ 4b. The DRAM is refreshed by cycling through the 512 rows within the specified refresh time (which is 8 ms in the MCM511000A DRAM).

The contents of the ROM of Figure 9. A 3-bit input. Writing of data into a ROM memory is impossible after it is stored. The output bits are programmed by blowing fuses selectively in the OR array. Switching the power OFF in a computer system does not erase the contents of a ROM. 4-bit output.9 READ ONLY MEMORY Read only memory. In ROM the AND array of the PLD is fixed and the OR array is programmable.4. For a 64-bit word the number of parity bits needed for SECDED is 8. 9. . For this at least 7 parity bits are needed when 32 bits are used in a word. the number of parity bits needed is 6. 3 MAR Row Decoder 8 CMOS gates MDR FIGURE 9.15 A Read Only Memory (ROM). These can be thought as addressing 8 words in the ROM. This amounts to an overhead of 21. Data stored in a ROM can be accessed as fast as in RAM made with similar storage cells. abbreviated ROM. ROM is a specific case of a Programmable Logic Device discussed in this book. is one in which data is permanently stored. Read Only Memory (ROM) is shown in Figure 9. The overhead goes down if the word length is larger.5 are shown in Table 9.8%.15. Observe that all the 8 minterms corresponding to the 3 input bits are generated.290 Computer Organization and Architecture word. Usually in memory systems one uses a code for Single Error Correction Double Error Detection (SECDED). There are four types of ROMs which we will describe later in this section and in some of them the user can erase and rewrite the contents of a read only memory in a special manner. resulting in an overhead of 12.5%.

it is permanent. In this the factory supplies the PROM with fusible links. the link may be fused by the user by sending a high current through the link. Thus programming has to be done with care. this is moderately expensive but also the most flexible. TABLE 9.5 Types of ROMs Type Factory Programmed ROM PROM UVEPROM EEPROM Cost/bit Very low Low Moderate High Programmability Once only at factory Once by end user. This procedure is called ROM programming. Software-assisted programmers are used to program these ROMs. Thus.4 Contents of ROM of Figure 9.5 summarizes the classification of ROMs. The primary advantage of this type of ROM is its excellent reliability. Links can then be grown selectively by the user by selecting a word and applying a voltage on the bit wire. This is also known as a writable control store and is used to design microprogrammed control unit of CPUs. Cannot rewrite Ultraviolet erasable. This ROM is supplied by the manufacturer with links which can all be disconnected by shining intense ultraviolet light.15 MAR I2 0 0 0 0 1 1 1 1 I1 0 0 1 1 0 0 1 1 I0 0 1 0 1 0 1 0 1 03 1 0 1 0 0 1 0 1 02 0 1 0 1 1 0 0 1 MBR 01 1 1 1 1 0 0 0 0 00 0 1 1 0 1 1 0 1 We pointed out in the beginning of this section that there are 4 types of ROMs. The last type of ROM is electrically erasable and reprogrammable (EEPROM).Memory Organization 291 TABLE 9. Unless a ROM is to be mass produced with identical content. Table 9. Once a link is fused. It is also the least expensive of all the ROMs. Programmable several times High voltage erasable and programmable several times . Among all the ROMs. Wherever a 0 is to be stored. The four types of ROMs differ in the way the link is programmed. the links are placed during fabrication and cannot be altered later. The second type of ROM is called user programmable ROM (PROM). in this ROM the stored data is permanent. In the factoryprogrammed ROM. this would be uneconomical. The third type of ROM is called an ultraviolet erasable programmable ROM (UVEPROM).

may be displayed by selecting a set of lamps and turning them ON as shown. unlike a RAM.4. then a letter B.15 implements the truth table given in Table 9. Code converter: This application is the most obvious use of a ROM. The 6-bit code for a character is fed as the input to the ROM. In other words it is a non-volatile memory. The cost per bit of ROM storage is cheaper than RAM storage. m output combinational circuit. In general a ROM with an n-bit MAR and m-bit MBR can realize any n input. Character generators: Character displays in dot matrix form use ROMs for decoding and activating the display. for instance. 64 KB ROM chips are now commercially available. Other code converters may be made to order. Function generators: Tables for commonly used functions such as sine. then a character may be displayed by setting a subset of 35 bits to 1. ROMs for common code conversions such as NBCD code to 7-segment display code are readily available. the data stored in it is not lost when no power is applied to it . Another advantage is the higher density of packing memory cells in it. If each of the dot positions is a lamp. Thus computer programs can be safely stored in it. Consider a 5 ´ 7 dot matrix shown in Figure 9. the ROM of Figure 9.16. The output is a 35-bit number which selects the correct dots to be lighted. ROMs are very useful in designing digital systems and some applications are given below. . 0 ROM 64 1 2 3 4 5 Column select FIGURE 9. Usually a 5 ´ 7 matrix is used for displaying characters from a 64-character set. cosine and arctangent may be stored in ROMs. For instance. The arguments are entered in MAR and the function values appear in the output register.16 A ROM character generator. If ON lamp corresponds to 1 and OFF lamp to 0.292 Computer Organization and Architecture The primary advantage of a ROM is that.

energy efficient and can be battery operated. . The main reasons for this is they are non-volatile. Flash memory: A variant of EEPROM which uses a special CMOS transistor (which has a second gate called a floating gate) is called a flash memory. It is fast replacing floppy disks in PCs.16).5 cm2). digital signal processing in which data are received simultaneously for many devices and stored in a common memory for processing. The control logic acts as an arbitrator and allows one of them access. we require that only 7 bits belonging to a specified column of the dot matrix be available. 9. has two sets of addresses.10 DUAL-PORTED RAM The RAM we have discussed so far in this chapter has a set of memory cells which can be addressed using a memory address register (MAR) and the data stored in the specified address appears in a memory data register (MDR). particularly those which are for dedicated applications. In PCs they are used by plugging them to Universal Serial Bus (USB) ports. The column address in such a case is fed as an additional input to the ROM (Figure 9.17). data and read/write control signals. instead of requiring all 35 bits to be retrieved in parallel. Currently (2006) flash memory capacity has reached a maximum of 32 GB. one of them is given access first (normally the request identified first gets access first). Each set of memory controls can independently and simultaneously access any word in the memory including the same word. and even hard disks in portable devices such as portable music players and laptops. Among the applications of dual-ported RAMs are CPU to CPU communication in multiprocessors. The read time of flash memories is tens of nanoseconds per byte and write time is several microseconds. They are compact and are made in several shapes such as pens (a few centimetres long) and flat disks (2. They are very commonly used in digital cameras. To summarize. Thus. It has only one address port and one data port. The main advantage of dual-ported RAM is that it can be used both as a working storage and as a communication interface. Even if both access requests come simultaneously. In microcomputers ROMs are used to store programs. two devices can store and exchange data using this memory. Flash memory has become extremely popular (because their storage capacity has increased and cost has remained moderate) since the late 1990s. it is emphasized that a number of functions performed by combinational switching circuits may now be delegated to systems which incorporate ROMs. Use of this special CMOS gate makes flash memory non-volatile and allows erasing and rewriting almost a million times. on the other hand.Memory Organization 293 The ROM may be simplified if. A dual-ported RAM. each of which accesses a common set of memory cells (see Figure 9. such as washing machines and motor cars.

4.17 Block diagram of a dual-port memory. The organization used is to arrange several cells to make a word which is addressable. It stores 0 or 1 when it is in a stable sate. A word is written in the memory by placing its address in a register called Memory Address Register (MAR) and data to be written in a register called Memory Buffer Register (MBR) and a write command is issued. the memory is called a Random Access Memory (RAM). A memory device which loses data stored in it when no power is supplied to it is called a volatile memory. 6. that device is called a destructive read out device. There are three major devices used in memories. If the data stored in a device is lost when the data is read. . They are: a capacitor. 5.294 Computer Organization and Architecture Data CPU 1 or Device ‘‘L’’ Address R/W L Data I/O R Data I/O Data CPU 2 or L address decode Dual-Port RAM cells R address decode Address R/W Device ‘‘R’’ Busy OR Interrupt Control logic Busy OR Interrupt signal FIGURE 9. 8. 7. A memory made using flip-flop is called a Static Random Access Memory (SRAM). reading from it is non-destructive. Millions of addressable words are assembled to make up a main memory. The data is placed in MBR by the memory circuits. To read. SUMMARY 1. the address is placed in MAR and Read command is issued. Several million storage devices are organized as the main memory of computers. 2. a flipflop and magnetic surface. If the Read/Write time is independent of the address of the word accessed. For a physical device to be usable as a memory device ideally it must have two stable states which are called 0 and 1. It is volatile. 3.

re-write data stored in it. 16 bits/word SRAM memory. There are several types of ROMs. What is the difference between a volatile and a destructive read out memory? Are destructive read out memories necessarily volatile? 3. that is. 13. Is DRAM or SRAM more expensive per bit of storage? Justify your answer. EXERCISES 1. Once data is erased. Once data is written. A computer memory has 8M words with 32 bits per word. Flash memories of size 256 KB to 32 GB are now available. 11. What is the advantage. Memory cells are arranged as 2D or 2. Further as capacitors lose their charge. 6. Illustrate 2. Draw a detailed diagram of a small DRAM which has 16 words 4 bits/word. There are ultra violet erasable and electrical erasable ROMs. The cheapest ones have data written in them permanently. Standard chips such as 256K ´ 1 bit and 1M ´ bits are available. Read Only Memory (ROM) is random access memory which is made using non-volatile. 8. 16 bits/word SRAM memory. Illustrate 2D organization of 2M words. A memory made using capacitors and MOS switches is called a Dynamic Random Access Memory.5D organization over 2D organization? 10. 9. non-destructive read out devices. of 2. new data can be stored in such ROMs.5D organization of 2M words. it is difficult to alter it but the data can be read. 12. Several of these chips can be organized to construct memory systems of required size.Memory Organization 295 9. A variety of electrically erasable ROM called flash memory has recently become popular. it is necessary to periodically refresh the memory. 14. What is the difference between access time and cycle time of a memory? Which is larger? 4. What is ‘refreshing’? Which type of memory needs refreshing? How is refreshing done? 7. . How many bits are needed in MAR if all the words are to be addressed? How many bits are needed for MBR? How many binary storage cells are needed? 2. Show the configuration of dynamic cells used and the detailed layout of cells in rows and columns. DRAM is volatile and the read out from it is destructive. 10.5D organizations inside IC chips called RAM chips. What are the differences between a static memory cell and a dynamic memory cell? Which of these cells can be non-destructively read out? Which technology allows larger memories to be fabricated? Which of these is faster? 5. if any.

15. 32 bits/word SRAM may be constructed using 1M ´ 1 chips as a 2D organization. How would you simulate a FIFO memory using dual-ported RAM? Design (4K ´ 8) FIFO with 4K ´ 1 dual-ported RAM. 22. If yes. Can you use a dual-ported RAM to construct a LIFO memory.5D organization 2M words. Design a (2K ´ 8) LIFO memory using shift registers. 20. Show how they are organized as a SIMM module. Illustrate with a block diagram 4M ´ 1 DRAM organization using a 8192 column 512 row cell array. 21. Illustrate 2. 12. . Draw a block diagram of a (2M ´ 4) dual-ported memory constructed using (1M ´ 1) DRAM chips. Show how a ROM may be used as a 8421 to excess 3 code converter. 19. Illustrate 2D organization 2M words. Illustrate how a 16M. 16 bits/word dynamic memory cells. 16.5D organization. 17. 16 bits/word DRAM memory. 14. 13. How many shift registers do you need? 23. 32 bits/word SRAM may be constructed using 1M ´ 1 chips as a 2. Draw a block diagram of a dual-ported RAM of (1M ´ 8) capacity.296 Computer Organization and Architecture 11. Why do the memory cell arrangement in DRAMs use cell organization with more columns and smaller number of rows as in Exercise 14. Draw the block diagram of a 2M ´ 8 DRAM which uses sixteen (1M ´ 1) chips. 18. Illustrate how a 16M. explain how you would do it.

How to use the locality property to organize hierachical memory systems.CACHE AND 10 VIRTUAL MEMORY LEARNING OBJECTIVES In this chapter we will learn: â â â â â â a smaller main memory. Locality in memory references when programs are executed.1 INTRODUCTION In the last chapter we described the technology used to design main RAMs of computers. 297 . Performance of cache memory systems. DRAMs also need periodic refreshing. Another device used as secondary memory (not the main RAM) is magnetic disk. The need for addressing space larger than the main memory address. but their access time is in millisecond range. on the other hand. are designed using flip-flops. We saw that DRAMs have access times of around 50 ns while their cost/ bit is lowest in the semiconductor RAM family. They have very high capacity (around 100 GB). Their access time is lower (around 10 ns) and they are 10 times more expensive compared to DRAMs. How to design a virtual memory by combining large capacity disc with 10. Cache memory organization and three different mappings. SRAMs.

The cache memory idea is to combine a small expensive fast memory (such as a Static RAM) with a larger. This is achieved by combining DRAM and a disk in such a way that the memory cycle time of the combined system is closer to DRAM cycle time while a large storage space is provided by the disk. is to provide a user with a large logical memory address space to write programs even though the available physical memory (DRAM) may be small. This objective is fulfilled by appropriately designing computer systems as a combination of SRAMs. The virtual memory idea. DRAMs and disks.1 Memory Technology Cost vs Speed Technology Typical Access Time 10 ns 50 ns 10 ms Economical Size Approximate relative Cost per byte x (x/10) (x/1000) Static RAM (for cache) Dynamic RAM (for main memory) Disk 1 MB 1 GB 100 GB . speed and typical capacities for three popular memory technologies. TABLE 10.2 ENHANCING SPEED AND CAPACITY OF MEMORIES There are two methods used to meet the objectives explained in the last section. Both these are dependent on the principle of locality which will be described in the next section. the relative sizes and cost ratios have remained almost constant. Thus. on the other hand.298 Computer Organization and Architecture Their cost is 1000 times lower than that of DRAMs. as fast as possible and keep the cost as low as possible. slower. 10. The cost per byte of the combined system is slightly larger than the DRAM cost. Table 10. While a cache memory provides a higher speed memory. Thus. Over the years the capacity of all these memories have increased (semiconductor memories’ size doubles every 18 months and disk capacity doubles every 12 months) with no increase in cost. cheaper memory (such as Dynamic RAM) in such a way that the combined system has the capacity of the DRAM and the speed of SRAM at a cost slightly higher than DRAM. In this chapter we will explore how this is achieved. The cost-effectiveness is an important engineering design aspect in the design of both cache and virtual memories. a memory system designer’s challenge is to provide as large a size of memory as possible. They are cache memories and virtual memories. Over the years the applications of computers have become very complex and diverse. a virtual memory provides a higher capacity memory. Many applications are also online requiring fast response time which in turn demands faster memories. With this increase in complexity.1 shows the cost. application programmers have been demanding higher capacities of memories. In order to appreciate this.

16. 128 or 256 bytes) when we consider rather long sequences of addresses in their address traces. the memory addresses read are in the following sequence: 2001.15. etc. A program loop is used for the iterative execution. A Remarks Load register R3 with 99 (63) Hex Load the first number in R1 .13. Thus. Let the data set be stored from 2000 to 2099 in memory. etc.Cache and Virtual Memory 299 10. after the branch instruction at the address 0014 the control branches to 0015 or to 0016 in alternative iterations of the loop.13. For convenience all instructions and data are assumed to be of one unit in length. 2096. …. This program is written in the language of SMAC++ to find the largest of one hundred numbers stored in memory. the memory addresses read are in the following sequence: 10.14.13. 2002. This comparison is iterated with other numbers of the data set. 60. 50. 12. 12. that is. For the same program different traces will be generated when different data sets are used.1 would be as follows: When instructions are accessed. Recall that memory is accessed both for instruction fetch and for operand or data fetch. let us consider a sample program given in Program 10. we have shown an example trace in which two distinct clusters are seen: one around the memory address where the instructions are stored and the other around the memory address where the data set is stored. 12. Instruction address 0010 0011 Instruction LDIMM R3. let us assume a trivial data sequence of 100 numbers in which the two integers 60 and 50 alternate as: 60.14.12. To illustrate this point.14. At location 0014 the conditional branch instruction will transfer control either to address 0015 or to 0016. 2097. For the sake of presenting the trace. The address trace for Program 10.1.11. 2098.13.13. as a straight line on which the x points denote the neighbouring addresses in memory..g.14. A program trace can be visualized as a graph in one dimension. Program 10.16. The sequence of memory addresses accessed by a program for a given data set is known as its trace.1 and its trace. 99 LOAD R1. 50. 50.16.16. The first data item at the address A is compared with the last data item at A+99 and the larger of the two numbers is kept in R1.14.1:  To find the largest of 100 numbers stored in memory from the address A.15. 2099. starting at the symbolic address A. ….3 PROGRAM BEHAVIOUR AND LOCALITY PRINCIPLE Let us suppose that we execute a sample program and observe the sequence of memory addresses accessed by the CPU. …. In arriving at this trace we have used decimals and omitted leading zeros. It has been found that many programs exhibit good or strong locality by dwelling in small parts of the memory address space (e. When data are accessed.16 12. In Figure 10. 60.

we can effectively increase the speed of memory accesses.1 the program block following the conditional jump instruction. A BCT R3.1 Program trace represented as clusters. the compilers can play an important role in how storage is referenced. The programming language and its compiler. JMIN included only one instruction. C(R1) is larger than C(R2) Here. consider the problem of summing all the elements of a large two-dimensional matrix. As an example. Instruction address cluster Data address cluster Memory addresses accessed FIGURE 10. C(R2) ¬ C(R2) – C(R1) Here. The data set on which the program is run. 2. If we can keep a copy of that part of the memory pertaining to a cluster of addresses in a high speed buffer and access data/instructions from it. LOOP HALT Load the last number in R2 when C(R3) = (63) Hex. and the data addresses are accessed sequentially one after another. R3.300 Computer Organization and Architecture 0012 0013 0014 0015 0016 0017 LOOP LOAD R2.14) followed by either 15 or 16 are repeatedly addressed 100 times for the instruction fetch. that is at the address 015. Similarly. A . R3 is the index register. a CALL instruction when executed could branch to a far away address and thus create a jump from one cluster to another. Normally there will be several clusters in a typical program trace. one pertaining to instructions and the other to data. 3. The knowledge of the average cluster length could be used to determine how large such a buffer memory should be. In some cases the programmer can control the locality by proper programming. It is also obvious that the control or access to memory during program execution will jump from one cluster to another. We can conveniently present this address trace in the form of a scatter diagram as shown in Figure 10. In Program 10. In this figure along the linear scale of memory addresses.1. The locality principle implies that there are clusters. R1 JMIN * + 2 LOAD. In large programs such a program block can be wider and thus separating the instruction cluster into multiple clusters. Each cluster has an associated cluster length. Suppose the matrix is stored in column major form that is one column after another. R3. A SUB R2.13. The nature of the algorithm and the data structures used to solve the problem. R1. Empirical studies have shown that the clustering of addresses produced by the address trace of a program depends on the following factors: 1. Since a majority of programs are written in a high level language. C(A indexed by R3) is larger decrement R3 and branch if zero In the address trace described above. 4. we find two clusters. The programming style employed by the programmer. we notice that the addresses (12.

The following programming techniques are recommended by Hellerman and Conway [29] to obtain good locality: 1. Placing data near the instructions that reference it. it will tend to be referenced again soon. 1. See the BCT instruction in SMAC++ and its branch address. 2. Organizing programs into modules and specifying the frequencies of use of these modules and the module interaction patterns. 10. 101. slow. 402. 102. 3. Placing the instruction of the frequently called subroutines in line rather than using a CALL. Ai and Ai+1 normally occupy two adjacent locations. 401. The important point is that their relative cost difference per byte has remained almost invariant. 2. Spatial locality: If an item is referenced. Their absolute values continuously increase with technology. The latter program would give rise to an address trace in which the references would jump between two clusters in a systematic way (e. . 404. Programs during execution access a relatively small portion of their address space in small intervals of time. 104. The principle of locality in programs can be summarized as follows. This property can be exploited in designing what is known as two-level memories. 103. They are a very large but slow memory combined with a smaller fast memory. 4. small and fast are relative.Cache and Virtual Memory 301 program written to sum the elements column by column will exhibit more locality than the program written to sum the elements row by row. Memory sizes are currently doubling almost every 18 months at the same absolute cost per byte. Such details can then be used by a linking loader to place the modules in a preferred order to enhance locality. Separating exception handling routines (like error routines) from the main section of a program. The adjectives large.g. This is a good example of temporal locality. there is a good chance that items whose addresses are near it will tend to be referenced soon. Temporal locality: If an item is referenced.4 A TWO-LEVEL HIERARCHY OF MEMORIES Temporal and spatial locality are properties of sequential programs. This is a good example of spatial locality. In an iterative loop an instruction inside the loop is repeatedly accessed. In two-level memories a memory system is designed by combining memories of two types. In the following section we will learn how to make good use of this principle of locality to increase the speed of a memory system with only a small increase in cost. Consider the addition of components of a vector A.). we can notice two types of localities. 403. whenever possible. In programs. During the addition process the adjacent locations are accessed in an orderly fashion as a sequence.

The opposite of hit is miss. The hit and miss can be expressed as a percentage of the total number of memory accesses in the execution of a program and will be normalized to fall in the range of 0 to 1. If the required item (instruction or data) is found in the cache. Thus. When an item is not found in the cache memory. A hit item will have two copies existing. In this organization the main RAM is combined with a fast small cache memory. RAM will be accessed which is much slower than accessing the cache.302 Computer Organization and Architecture We will now examine how a two-level memory system is able to exploit locality of memory reference. it first looks for it in the cache. When the computer accesses an address in RAM. If h is the hit ratio. it is taken from there. then C and S1 ¹ C1  S 2 ¹ C 2 units/bit S1  S 2 It is clear that: 1 ³ h ³ 0 T = hT1 + (1 – h)T2 units/byte access Cost C1: Per bit Level 1 storage (Faster) Volume: S1 bits Access time: T1 units/byte Storage management (Hardware/Software) Cost: C2 < C1 Level 2 storage (Slow) Volume: S2 > S1 Access time: T2 > T1 FIGURE 10. This condition will be called hit.2 A model of a two-level storage. one copy in the RAM and the other in the cache. Let C and T be respectively the effective cost per bit and effective access time of the combined two-level system. .2 is the schematic representation of a two-level storage model. we can define hit ratio as: h number of times an item is found in the cache (sufficiently large) number of addresses referenced Hit ratio (h): 1 ³ h ³ 0        Miss ratio = (1 – h) Figure 10. Let the system software store a duplicate copy of the contents of the RAM pertaining to a cluster in the cache. Let us suppose a cache memory of 256 bytes besides a RAM in a memory system.

if there is a clear benefit . No one would choose a two-level storage system when r = 1 and hence this case is only of academic value. In a memory system based on a 10 MB RAM and cache of various sizes. The hit ratio is dependent upon many factors such as the size of the storage in level 1. TABLE 10. When r = 1. There is no reason why we cannot extend this to multiple levels. a particular type of mapping between the contents of RAM and that of the cache called direct mapping (to be discussed later) gives the results shown in Table 10.2 Cache Size vs Hit Ratio in Direct Mapped Cache for a 10 MB RAM Cache Size 8 16 32 64 128 KB KB KB KB KB Hit Ratio 91.Cache and Virtual Memory 303 Let us define two more terms access time ratio r. and access efficiency e as below: T2 (ratio greater than 1) r T1 e T1 (desired to be 100%) T Substituting for T e T1 hT 1  (1  h )T 2 T1 T 2  h(T 1  T 2) 1 r  h(1  r ) In Figure 10. the efficiency is 100%. the access efficiency is very poor if the hit ratio goes below 80%. efficiency is 100% for all values of h. when h = 1.3 the access efficiency e is plotted as the function of the hit ratio h for different values of the access time ratio r. the cache memory size is of the order of KB whereas the RAM size is in the order of GB.3% 93. For all values of r. Typically in modern computers. This could be used as a good rule of thumb in designing a two-level storage system. In all other cases.5% So far we have presented a memory system with a two-level hierarchy.2 for different sizes of the cache. But h = 1 is not obtainable in practice because the size of the faster memory will be many orders of magnitude smaller than the slower memory. the access pattern or the locality exhibited and the mapping between the two levels of storages.2% 95% 96% 97.

Each instruction requires one access to memory for fetching the instruction and zero.4 0. makes the use of cache effective in reducing the access time to main memory without unduly increasing the cost of the memory system. real time and interactive applications. cache is a very high speed small capacity memory introduced as a buffer between the CPU and the main memory (RAM). The reader should observe that as we increase the number of levels in a hierarchy. one or more memory accesses to fetch the data for execution of that instruction. the complexity in accessing increases and the designer will not choose this complexity unless the benefits outperform and justify such a complexity. The current trend is to have two separate small caches (around 16 KB) inside the CPU chip one for data and the other for instructions called L1 cache and another larger unified cache of size around (1MB) also inside the chip called L2 cache.2 0. The locality property in memory accesses. the speed with which memory can be accessed is important for reducing the execution time of an instruction.3 Access efficiency vs hit ratio. 10. Millions of instructions are executed every second in a computer system. and networking of computers. Thus. If the .8 1. during program execution. Parts of the current program (and data) are copied from the main memory into the high speed cache.5 CACHE MEMORY ORGANIZATION As we saw in the last section. The ubiquitous use of computers also demands that the cost be as low as possible. it is fetched from the cache if it is already there. On the other hand. e 100% r=1 80% 60% r=2 40% r = 10 20% r = 50 r r = 100 0.6 FIGURE 10. the demand for large size memory is ever growing with more sophisticated operating systems.0 h 0. When the CPU refers to an instruction (or data).304 Computer Organization and Architecture at the system level.

However. If the . this mapping will be many-to-one. CPU Control Cache Control Cache controller Control Data Control Main memory Data bus System bus FIGURE 10.Cache and Virtual Memory 305 required item is in cache.4 Configuration of a cache controller. See the path a-b-c in data flow diagram shown in Figure 10. Secondly when the program control jumps from one cluster to another. the address is automatically checked by the hardware to find if its contents are already in the cache. Since there will be many more memory blocks than the number of cache lines.4). If it is not in the cache. During a memory WRITE operation if there was a hit. Hence. Cache memory organization deals with solving such problems. The clusters vary in length and are dependent on several factors. In a cache memory organization. When the probability of hit is very high the CPU will be reading an instruction or data most of the time from the cache. The term cache line is used instead of cache block primarily to distinguish it from a memory block even though a block in a cache line is of the same size as a memory block. If the required item is not in cache. If so the word is read from the cache. When a memory READ is initiated by the CPU. there are two problems. This is sometimes called flushing the cache to main memory. Address bus Addr. One of the management issues is to decide which memory block should be mapped to which cache line. there is a need to write the contents of cache back in main memory before it is reloaded with new contents from the main memory. to facilitate management. it has to be fetched from the main memory. it is called a hit. there is a need to flush the cache (that is save it in main memory) and reload it with the contents of this new cluster from RAM. reloading a memory block into its cache and determining if an access to memory is a hit or miss (see Figure 10. The memory design is such that the existence of cache is ‘transparent’ or invisible to a programmer. There should be enough cache memory to store a large enough ‘cluster’ about which we discussed in the previous section. the contents of the cache memory would have been changed and its copy in the RAM would not have been updated.5 (ii). the available cache is divided into equal sized blocks called cache lines and the RAM is also divided into memory blocks of the same size. then the contents of the specified address is read directly from the main memory (path d-b-c) and the word so read (or a block containing it) is also stored in the cache (path d-b-a) for future reference. A hardware unit called a cache controller does cache management operations such as flushing a cache line.

If Te and Tm are the cycle times of the cache and main memory respectively. Some authors call the altered bit dirty bit. it will be found in the cache leading to a hit. it is known as store through or write-through method [see Figure 10.5 Data flow in cache: (i) write-through. As stated earlier the cache capacity is much smaller than that of the main memory and could be smaller than a user’s program.306 Computer Organization and Architecture same address is referenced again in the near future. something else already in the cache has to be copied back into the main memory and its space vacated. Data is moved between the cache and the main memory in integral number of blocks. We have defined hit and miss ratios in the last section. Whenever a cache line has to be reloaded by a newly read . the effective cycle time T would be 14 ns. Thus. Cache a Read CPU c b Write d Memory (i) Cache a Write CPU c Read b (ii) d Copy back Memory FIGURE 10. both the cache contents and the memory contents have to be changed. and (ii) copy back. If both are updated simultaneously. The alternative method updates the cache location only and flags the corresponding cache line by setting a flag called altered bit. effective memory cycle time is given by: T = hTe + (1 – h)Tm For a case with h = 90% and Te and Tm are 10 ns and 50 ns respectively.5(i)]. the additional cost of cache and that of the associated cachemanagement-hardware is worth the speed improvement. This is a substantial improvement in memory speed. Assume that a command is received to write a word in the memory. Thus. every time something new has to be stored in the cache. A hit ratio of 90% is usually achieved with strong locality exhibited by programs. Block size of 128 or 256 bytes is normally used. If that word is in the cache.

otherwise it is simply over-written [see Figure 10.5(ii)].6 wherein we have assumed the following block sizes and the unit of memory to be a word. The disadvantage of this method is that there are time periods when the cache line is updated but the corresponding memory block is not updated. It could very well have been bytes: Block size: 16 words (4 bits to address a word inside a block) Main memory size: 64K words (4K memory blocks. 212 = 64K) Cache memory size: 1K words (64 cache-lines. Consider Figure 10. . This inconsistency between the two copies (one copy in the RAM and another copy in the cache) is called cache coherence problem.Cache and Virtual Memory 307 memory block. the altered bit of that cache line is checked. In the above paragraph we have used the term ‘corresponding memory block’ more than once. If the bit is ON. the cache line is copied back to the corresponding memory block to permanently store the updates. This inconsistency may be unacceptable in some operational environments. We will illustrate the concept of mapping between the main memory blocks (denoted as MB) on to the cache lines (denoted as CL) through an example.6 Direct mapping of cache. Each cache line stores one memory block Address length: 16 bits (216 = 64K) FIGURE 10. 26 = 64).

308 Computer Organization and Architecture The 4096 memory blocks are mapped into the 64 cache lines and hence it is a many-to-one mapping. there will be too many cache misses. Unfortunately if both of them map into the same cache line. The sequence of actions that takes place in accessing the memory can be summarized as follows. even if the cache block CL-1 were free. the required memory block is already in the cache and it is a hit. Step 2: If the two match. The required word (1 out of 16) is selected from the cache using the word field of the address. even if another cache line were available we cannot use it due to the rigidity of the direct mapping. Observe that. Memory access with cache Operation: Read a word from main memory.6). Referring to Program 10. This constraint is ‘fully’ relaxed in the following method of mapping. Step 3: If the two tag bits do not match. . Note that these two memory blocks map into the same cache line namely CL-0. Suppose we need two main memory blocks to execute a program. the BLOCK-NO field of the address is used to select a cache line and then its tag bits contents are compared with the TAG field contents of the given address. Step 1: First. we could not map the memory block 128 into CL-1 according to the direct-mapping method. The 10-bit cache address itself is further divided into a 4-bit word address (because there are 16 words per block) and a 6-bit block address (64 cache lines is the total cache size and we need 6 bits to address one of these). Simultaneously with these operations the selected word is passed on to the CPU. the cache line is first written into the corresponding memory block. The tag bits of the cache lines are set to the value contained in the TAG field of the address and the altered bit is reset to zero. The desired memory block is read from the main memory and loaded into this cache line. Dividing the 4096 memory blocks equally into 64 groups yields 64 memory blocks in a group. Address is given. A cache line is selected using the BLOCK-NO field of the address and the altered bit of the selected cache block is checked. the required memory block is not in the cache and it is a miss. suppose that its two clusters are stored into memory blocks 0 and 128 respectively. This is called direct mapping.1. the 16-bit memory address is first divided into two parts: 6-bit tag field and 10-bit cache address (Figure 10. Hence a main memory read has to be initiated. Any one of the 64 memory blocks belonging to a group can map into one cache line and which one of these is resident in the given cache line at any one time is indicated by the 6-bit (26 = 64) tag field associated with each cache line. Hence. If it is ON. we note that a given memory block can map into only one specific cache line. In the foregoing mapping method. When the successive memory addresses alternate between the two memory blocks 0 and 128.

the time taken will be too much and will make the method impractical. namely. CL2. In this sense.8 we have chosen four cache lines per set. the fully associative mapping provides complete freedom in placing a memory block into any available cache line whereas the direct mapping provides no freedom at all. we have to compare the tag field contents of the given address with the 12 tag bits of every one of these 64 cache lines. the search or comparison with 64 entries must be done in parallel using 64 different comparators. we find that 256 memory blocks can be mapped into one set of cache lines. In Figure 10. Let us number them from 0 to 15. In this way it resembles the associative mapping. Then we need 12 tag bits with each cache line to identify which of the 4K memory blocks is presently contained in that cache line. Dividing the 4096 memory blocks equally among these 16 sets. This is called associative searching which makes this method costly.7. In this case a 16-bit address will be divided into a 12-bit tag field and a 4-bit word field as shown in Figure 10. 16. Intermediate methods exist between these two extremes. 64. This type of mapping is known as fully associative mapping. Note that in this figure the number of tag bits is determined as: . Block-set-associative mapping The block-set-associative mapping is a combination of both the direct mapping and associative mapping. To make the associative mapping worthwhile. the mapping resembles direct mapping. However. 12 Address partition Tag 4 Word FIGURE 10. cache lines are grouped into a number of sets and the number of blocks grouped in a set is a design parameter. we have to first find if the access is a hit or a miss. or CL3 indicated in the figure. These are known as block-set-associative mapping. In order to find if the required address is in the cache or not. …. Since a memory block could be in any one of the 64 cache lines and we have no way of knowing which one it is. 32. In the case of direct mapping we do not have this problem of searching because a given main memory block goes into a fixed cache line. 80. namely CL0. 4080.7 Address partition for fully associative mapping. In the example above the memory block n can possibly go only into the cache line m where m = n modulo 64 because there are 64 cache lines. 0.Cache and Virtual Memory 309 Fully associative mapping A flexible method of mapping should allow a memory block to reside in any cache line. Hence.8 shows those memory blocks that are mapped into set 0. 48. In this case. CL1. If sequential comparison of the 64 entries is performed. Figure 10. for the above example there are 16 sets of cache lines. But the 256 memory blocks that are mapped into set 0 can be placed into any of the four cache lines within the set 0.

8 Four way  block-set-associative mapping. the access is a hit. Then one of the four lines of the set is chosen according to a replacement rule. one of the 16 sets of cache lines is selected by the 4-bit SET field of the given address. the contents of the chosen line are first written into the main memory. If the match fails with every block of the selected set. Tag bits = address length – (number of bits needed to address a word in block + the number of bits needed to address one of the sets or groups of cache lines) When accessing the memory. In Figure 10.310 Computer Organization and Architecture Main memory 64 K 16W/Block MB0 8 bits CL0 CL1 CL2 CL3 CL4 CL5 CL6 CL7 SET0 MB16 SET1 MB32 SET15 CL60 CL61 CL62 CL63 8 4 Set 4 Word MB4080 Address partition Tag FIGURE 10. If a match occurs with the tag bits of any one of the 4 lines of the set. In this example there are 4 cache lines in a set and a parallel comparison with 4 entities is relatively simple. The 8-bit TAG field contents are matched with the tag bits of every block in that set of cache lines. As a preparation for re-load. if it was altered since it was brought from the RAM last time. Then the desired memory block is read from the RAM and placed into that cache line of that set.9 we compare the three methods of mapping memory on to a cache. . it is a miss.

The cache size is 64KB = 26 ´ 210 bytes. 1. We consider one more example of direct mapping below: EXAMPLE 10.1 A computer has 16MB main memory and 64 KB cache. How many cache lines does the computer have? 2.9 Comparing the three methods of mapping memory on to a cache.Cache and Virtual Memory 6 Tag 6 Block 4 Word 12 Tag 4 Word 311 0 1 Tag Tag Data Data 0 1 Tag Tag Data Data 63 Tag Data 63 Tag Data 6 4 Direct Mapped Cache 8 Tag 4 Set 4 Word 4 12 Fully Associative Cache 0 1 Tag Word Tag Word Tag Word Tag Word Sets 15 8 Block 1 4 Block 2 Block 3 4 Blocks per set 4 way block associative cache Block 4 FIGURE 10. Block size is 16KB. How many blocks does the main memory have? 3. Give the starting address of memory blocks which are directly mapped to cache lines. 1. Here. 4. Thus number of cache lines = 216/24 = 212 = 4K . Explain how a given address is retrieved from the memory system. The block size is 16 bytes.

3 Mapping of Main Memory Block in Cache Cache Line Address 000 001 002 4K Cache Lines Block Numbers (Block depth is 16 bytes) 00000 00001 00002 00003 : 0000F 01000 01001 01002 01003 : 0100F 02000 02001 02002 02003 : 0200F FF00F FF000 FF001 FF002 FF003 003 : 00F : : FFF 00FFF 00 01FFF 01 02FFF 02 FFFFF FF Tag Total 256 blocks Thus in cache line address (000)Hex one of the 256 blocks can be mapped as shown in Table 10.312 Computer Organization and Architecture 2.2 Repeat the example given above for 2 way block-set-associative mapping. However. 4. 12 bits 3 Hex Byte address 4 bits 1 Hex Given the address. . As there are 2K sets of cache lines (with 2 cache lines/set) where a memory block can map. The 8 bits of tag are then matched with the starting address in the block stored in cache line. the number of bits to indicate cache line sets will be 11 bits. The number of blocks in main memory = 224/24 = 220 3. TABLE 10. The main memory address is divided into three parts shown as follows: Memory address bits = 24 Tag 8 bits 2 Hex Cache line No. Addresses are given in Hexadecimal. If it matches it is in the cache and is retrieved Otherwise it has to be retrieved from the main memory. EXAMPLE 10. the number of blocks mapped into each cache line = 220/212 = 28 = 256. As there are 4K cache lines and 1M blocks.3. (Remember that at any time only one of these blocks will occupy a cache line).3. The mapping is given in Table 10. the 12 bits cache line number is used to locate the cache line where the memory block is mapped.

Figure 10. As we have seen in earlier chapters. One can consider using two separate caches.4 Caches in Some Representative Computers IBM360/95 Block size Cache size Memory size Mapping 64KB 32KB 4MB 4-way block set associative (4 cache lines/set) 80486 16B 32KB 16MB Direct Pentium 2 32B 8KB data 8KB instruction 256MB 2 way set associative (2 cache lines/set) Power PC 603 32B 16KB data 16KB instruction 256MB 2-way set associative The hit ratio (or miss ratio) depends on many factors such as the cache size. TABLE 10. When the cache size is reasonably large (8K or 16K). A good design will exhibit low sensitivity or variation in the miss ratio and also a very low value for the miss ratio. memory is accessed both for instruction and data. The main memory address is divided into three parts shown as follows: Tag 9 bits Sets 11 bits Byte 4 bits To access a byte. small cache sizes like 512 bytes are not effective.4 gives some specific implementation of caches by computer manufacturers. Pipelined processors require access to both caches simultaneously and thus they are independently controlled. the choice of block size is not too critical. 11 bits of sets of cache lines are accessed. The 9-bit tags of these are simultaneously matched with the tag of the word. The number of blocks mapped into a cache line is 220/29 = 211. The values shown in this table show how technology has progressed. Table 10. In Pentium processors.10 shows how the choice of block size affects the miss ratio. block size. .Cache and Virtual Memory 313 remember that the cache size is same but it is divided into two sets. one as data cache and the other as instruction cache. replacement rule and program locality. Because of high miss ratios. Thus 9 bits are used as tag. that is (1–hit ratio) for various values of the cache memory size. If there is the hit it is retrieved else it is a cache miss. two separate 8KB caches are used. Obviously. Experience with such twin systems has shown that patterns in memory access differ between the memory access during instruction fetch and during data fetch or store. Of these 1 bit selects the cache lines in set 0 and the other in set 1. As memory technology improves the cache access time decreases and the cache size used in computer systems is much larger but it will always be a fraction of the main memory size. then we need two separate cache controllers.

L1 has separate data and instruction caches whereas L2 is a single unified cache. one first looks up L1 cache whether it is in it. With the reduction in cost of memories.1 8 KB 16 KB 16 32 64 128 Block size (bytes) 256 FIGURE 10. For instance Pentium V processors have two 8KB caches in CPU one called instruction cache and the other data cache. Because of this internal . If not then L2 cache is examined. Besides this.10 Effect of cache size. Usually there is 99% probability of the required data being found in either L1 or L2 cache and thus the average access time is quite small–a few nanoseconds. Thus when an instruction or data is to be retrieved. Also. in Pentium (R) processor we note the following adaptations: 1.3 2 KB 0. For example. these neat categories are combined in an effective manner with much more variations and adaptations in them. The current trend is to have both L1 and L2 caches integrated in the chip.2 4 KB 0. CPU chips have reached very high level of integration allowing hundreds of millions of transistors on a chip and the chip sizes have increased.314 Computer Organization and Architecture Miss ratio 0. There are two caches referred to as L1 cache and L2 cache. The access time to L2 is larger compared to L1 due to internal bus structure in the chip. This is in turn connected to the DRAM main memory. The L1 cache and its controller are embedded into the CPU chip. In real life computer systems like the Pentium processor. Accessing this cache is very fast. This increased capacity has led to incorporating 8KB data and 8KB instruction caches in the CPU chip itself. Only if there is a miss in the second cache also there is a need to access the main memory.4 512 Cache size (bytes) 1 KB (1 – h) 0. on the CPU board there is another SRAM memory of around 512KB capacity called L2 cache. Cache memory in Pentium processors [34] The above discussions on cache memory organization are broken into neat little categories for pedagogical simplicity. main memories have become very large (around 1GB is now common). Both of them are labelled as Level 1 (or L1) caches.

apart from choosing the right block size and cache memory size. when a memory WRITE takes place. . The external cache is referred to as L2 cache and it is relatively slower to access because it uses the bus external to the CPU chip. But in the case of other two mappings. Intel uses separate caches for instruction and data. 3. 5. When a cache ‘miss’ happens leading to a main memory access. The L1 cache is faster to access than the L2 cache that is external to the CPU. 4. This rule is simple to implement but not the best. The cache block (known as cache line in Intel’s terminology) is 32 bytes long and there are 128 cache lines in a 4K unit. The implementation of LRU algorithm is not that simple. the question of which cache line should be replaced becomes important. does the system update only the copy in the cache leaving the copy in the main memory to be un-updated (inconsistency) or does the system update both the copies? This is called write policy. embedding the cache size is limited to 8KB. When the data is updated in the cache but not in the main memory. Similarly. when the data in main memory is updated but not in the cache line its data is called stale. In the case of direct mapping. 2. The write policies on Pentium are software controllable by the system programmer.Cache and Virtual Memory 315 2. It is larger (512KB).6 DESIGN SYSTEM AND PERFORMANCE OF CACHE MEMORY In the design of cache memory systems. Their size is 8KB each. Upon hit. Keeping the cache consistent without being stale or dirty is an important responsibility of the cache controller hardware. the data in the cache is called dirty. 10. Such a block is called LRU block (least recently used) and the algorithm used to select a LRU block is called LRU algorithm. The best replacement rule would be to select the block in that set that has not been referenced for the longest time. Shared access to other processors (S) and Invalid (I). Exclusive for the processor (E). The simplest replacement rule is to select a block within the set randomly. The consistency of the two copies (one in the cache and the other in the main memory) in Pentium is assured by following a protocol known as MESI protocol. The mapping used is 2-way set associative mapping. which cache line will be replaced? This is known as the replacement rule or policy. This protocol is so named after the four different states that the cache line in Pentium can reside which are: Modified (M). a main memory block can be loaded only to the pre-determined cache line and the question of choice does not arise. two policy decisions are relevant: 1. The 8K internal cache is divided into two 4K units and referred to as way-0 and way-1.

When the program is started the caches are properly loaded. Let us suppose that the memory READ is a miss. using approximate models. There are two options. In what follows. 7. Let us assume that the memory access is a hit and the operation is memory WRITE. a benchmark of programs and data sets are needed to obtain an average of such numbers so that one can compare the performances of two different cache designs. The computer is word organized. An experimental study requires typical programs and typical data from which address traces can be generated. the contents of the cache are updated and the processor is allowed to proceed without waiting.316 Computer Organization and Architecture There are two common write policies.3 1. The operating system has allocated one cache line for instructions and one cache line for data and they are called CL0 and DL0 respectively. 5. we will use the program and the data set given in Program 10. The selected cache line is reloaded after this.3 let us assume the following: EXAMPLE 10. as Example 10. In the second case. when a memory WRITE is performed. 4. This will slow down the processor. one for instruction and one for data. Recall that the hit ratio would vary from one program to another and again within a program from one data set to another data set. CL0—contains the contents of memory from 10 to 14 DL0—contains the contents of memory from 2001 to 2004 By observing Program 10.3. the program and its data will occupy 4 blocks . Therefore. One policy is known as write-back policy. In this case the cache controller is more complex as it has to handle memory write operation separately. There are two separate caches. For Example 10. A similar analysis holds good during the memory READ operation. 2. Direct mapping is used for cache design. In this case. The cache controller will initiate a main memory write later to update the copy of the block residing there. the main memory block is accessed. Cache performance can be studied experimentally or empirically. According to this policy the data is updated both in the cache line and in the memory block at the same time.1 earlier and compute the hit and miss ratio. The other policy is known as writethrough policy. For a given cache memory design.1 we note that there are 18 instructions loaded in the address range 10 to 17 and there are 100 data items loaded in the address range 2000 to 2099. The block size for cache is 5 words (so that we can highlight important points). One instruction occupies one word. This is called look through policy. This policy is known as look aside policy. using such address traces hit ratio and miss ratio can be calculated. 3. According to this policy. as the desired item is read from the main memory block it is immediately sent to the processor. One is to transfer the memory block into the destination cache line and then forward the required item from the cache to the processor. 6. As the block size is 5.

TABLE 10.13. 12.98.16. 2089. The program trace for this program was described earlier and is reproduced as follows: For instruction access the memory addresses READ are in the following sequence: 10.16. The reason for this is explained in Table 10.11.14.5 Hit and Miss for Program 10. When these memory addresses are accessed there is a ‘miss’.16): 196 50 times (12.6 in the above example is a pathological case and not a realistic one.  12.13.16): 4 49 times (12. 2095.14.14.11: 2 First time (12.13.14.15.13.14.  12.5. there will be no misses during instruction accesses and there would be only 10 misses during the 100 data accesses giving a hit ratio of (553 – 10)/553 = 0.16): 250 One time address 17: 1 Data accesses: 100     Total 553 Hit ratio = (553 – 219)/553 = 0.13. 2091. 12.14.14. 2097. 2093.16.….13. This simple example was chosen to show how a program trace can be used to calculate the hit ratio. 2090. 2098. The total number of memory accesses and the total number of ‘misses’ can be easily calculated by observing the sequences and the addresses which cause the miss. In this example if we were to make a moderate improvement of an increased cache line size of 10 words instead of 5 words.Cache and Virtual Memory 317 for instructions and 20 blocks for its data in the main memory. Note the bold and underlined addresses indicated in the above sequences. 2096. 2092.) . Total number of memory accesses One time addresses 10. 2002. 2099.13.15.14.1 Instruction Addressing ® Followed by ® Data Addressing 10 11 12 13 14 16 CL0 CL0 CL0 CL0 CL0 CL0 hit hit hit hit hit miss [loaded 15 to 19] — 2001 2099 2099 DL0 DL0 DL0 hit miss [loaded 2095 to 2099] hit (Contd. For data access the memory addresses READ are in the following sequence: 2001.13.15.6 Number of misses: Number of misses: Number of misses: Number of misses: Number of misses: Number of misses:     Total misses 0 1 98 100 0 20 219 The hit ratio of 0. … etc.16.  12. An important point for the reader to observe from this example is the factors which cause a cache miss and how a programmer can minimize the chances of causing a miss. 2094.16.

there is a need to store data that runs into hundreds of giga (109) or terra (1012) bytes. Consider the following scenario. and similarly the main memory is one level closer to the CPU than the disk memory. The cache is one level closer to the CPU than the main memory. Using the same idea. A program or data (object) to be handled is too large to fit into a given main memory.1 (contd. Using the notion of virtual memory system.318 Computer Organization and Architecture TABLE 10. stores them on a disk. With the advent of multimedia and the World Wide Web. there is a need for storing and addressing very large amount of data. the disk memories are cheaper. A virtual memory system resembles the cache memory system. The programmer normally divides such an object into segments so that no segment is too large for the main memory. In a similar situation. during program execution. large application software can address and . as needed.5 Hit and Miss for Program 10. In comparison to the cost and speed of main memory.7 VIRTUAL MEMORY—ANOTHER LEVEL HIERARCHY IN In large computer applications. Virtual memory is a method used in many computer systems to give a programmer the illusion that he has a very large addressable memory at his disposal although the computer may not have a large main memory. larger and slower by a factor of 1000.) Instruction Addressing ® Followed by ® Data Addressing 12 13 14 15 16 12 13 14 16 12 13 14 15 16 12 13 14 16 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0 miss [loaded 10 to 14] hit hit miss [loaded 15 to 19] hit miss [loaded 10 to 14] hit hit miss [loaded 15 to 19] miss [loaded 10 to 14] hit hit miss [loaded 15 to 19] hit miss [loaded 10 to 14] hit hit miss [loaded 15 to 19] 2098 2098 2097 DL0 DL0 DL0 hit hit hit 2096 2096 2095 DL0 DL0 DL0 hit hit hit 10. dynamically loads and overlays the segments. we can define another level of hierarchy that is between the main memory and the very large disk space. earlier we exploited the ‘locality’ property to design a cache memory system.

there is one important difference. It frees the application programmer from the chore of dividing an object into segments and managing the transfer of segments between the disk and the main memory. It is not economical to have a main memory as large as 4GB. (This is similar to cache access). The physical address space is then 512 MB and 4 GB is called virtual address space. It helps to develop programs that are independent of the configuration and capacity of memory systems because the virtual memory management is part of the system software. Yet.7. A programmer can use an addressable memory as large as the virtual address space. This memory can be used as though it is all main memory. The fastest method would be to have a table giving what VM pages are in MM and search this table associatively. as if it were fully available for program or data in main memory. The VM page and MM page are assumed to be of the same size. It permits efficient sharing of memory space among different users of a multiuser system. It provides the user with a large virtual memory which is a combination of memory space in a disk and main memory. Consider a computer system like 80486 that has a 32-bit effective address. the address spaces are partitioned. Let us suppose the memory system designer has chosen to have a 512MB (229) main memory. 10. As the number of VM pages is much larger than MM pages the translation is many to one. In what follows we will focus only on paging. Note that we must have a disk that is at least 4 GB in size to store the objects. The application programmer is freed from the management of virtual memory system. The virtual memory management system will take care of the mapping between these two address spaces. When partitioning is based on logical reasoning leading to variable size partitions. Given the virtual address to be accessed we have to find its VM page address and from this the corresponding main memory page address and retrieve the required word if it is there.Cache and Virtual Memory 319 manipulate objects stored on disks as if they were stored in main memory.1 Address Translation We will use the terms VM page and MM page to refer to the partitions of the virtual memory and main memory respectively. The cache management is done in hardware and it is completely transparent to all programmers including the system software developers whereas the virtual memory is managed by the system software that is a part of the operating system. The concepts introduced in paging are quite similar to what we studied under cache memory systems. Assuming that the VM is 4 GB and MM is 256 MB and page size . The virtual memory system provides the following benefits to the application programmer: 1. 3.11. 2. it is called segmentation. The conceptual mapping between them is shown in Figure 10. It can support a maximum memory capacity of 232 or 4GB. For the sake of this management. When this partition is into arbitrary but of equal sizes it is called paging.

In this method a page table is used to map a VM page to MM page. Virtual address from CPU VM Page No. Each page table entry contains some . Thus there are 1M entries in this table (see Figure 10. The page table contains one entry corresponding to each VM page. the page size is assumed to be 4 KB (needs 12-bit address). (20) WORD (12) 12 Page table base register 20 Displacement from base 20 MM Page (16) WORD (12) Physical address to main memory Page table stored in main memory 2 20 16 entries Control bits MM page addr FIGURE 10.11 Conceptual mapping from VM pages to MM pages.320 Computer Organization and Architecture 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 MM pages VM pages FIGURE 10. is 4 KB there will be (232/212) = 1M pages in VM and 64 K pages in MM. This table is too large to search. Thus another method of virtual address to main memory address translation is used.12. An example of organization and virtual memory address translation are shown in Figure 10.12 Virtual address translation.12. This is shown in Figure 10.12). In this example. The page table will have 1M entries with at least 18 bits/entry.

the TLB access time is practically negligible when compared to the address translation time.Cache and Virtual Memory 321 control bits and a 9-bit MM page address where the corresponding VM page is stored. The required VM page is then read from the disk and loaded into the vacated MM page. Finally.9%. Design experience has shown that a TLB size of as small as 64 entries could give a TLB hit ratio ranging from 90% to 99. if its altered-bit was set to ON. the address of the MM page is immediately found. if the required VM page is not found in the main memory. most computers use high speed buffers to store part of a page table which we will explain in the next subsection. If this same virtual address is accessed again in the near future. In the example we are considering we can have a small 64 location TLB with 64 bits each with each entry having the following fields. namely direct. Upon page fault the page replacement algorithm selects which MM page can be rolled out from the main memory. 10.7. The selected MM page is copied on the disk. etc. if it is presently stored in the main memory. the page-access frequency. TLB entry contains: VM Page address: MM Page address: Valid bit (present/absent): Dirty bit: Other bits (for example usage frequency etc.12. a small fast buffer is used which stores the relevant part of a page table. it is necessary to access the main memory twice to translate virtual address to main memory address. Because TLB is a fast and small buffer which is a part of the hardware system. Let us call this buffer TLB (Translation Look aside Buffer). Recall the principle of locality (temporal locality). whether the page was altered after it was loaded into the main memory. The control bits are used to indicate if the VM page is in main memory or not. associative. it would reduce the effective memory speed by a factor of two.): TLB word length: TLB size: 20 bits 16 bits 1 bit 1 bit 10 bits 48 bits 64 entries . The mapping methods discussed earlier. the page-table-entry bits are properly set for further reading.2 How to Make Address Translation Faster We saw that if the page table is stored in the main memory. a page fault is said to have occurred. and set-associative mappings are applicable also to virtual memory management. When the CPU accesses memory. we could look into the TLB and obtain the MM page address without going through the slow process of address translation and page table access. To reduce this access time. If the required VM page entry is in this buffer. This sequence of actions are similar to those we had discussed in the section on cache memories. we find that for every memory access one access to the page table is needed. From Figure 10. The page table itself can be relocated in the main memory anywhere by using a page table base address register. Hence. If the page table is stored in main memory.

If it is in TLB. In order to achieve this speed.13 TLB use in address translation. Usually a simple strategy like random replacement is used. FIGURE 10. All these activities are completed in 1 clock cycle. the given VM-page is compared simultaneously (by appropriate hardware) with all the 64 entries of TLB. Thus. Such a fully associative search cannot be done if the TLB size is very large. The TLB controller determines what is to be replaced. even when we use a large page size like 4KB the page table size becomes quite large.13). we need a replacement strategy for the TLB entries. 10. when a VM page address is given as an input to the TLB.322 Computer Organization and Architecture In the above example. the corresponding MM-page address is read (see Figure 10.3 Page Table Size In computer systems having a large virtual address space (4 GB or 232). From one task to another the page table mapping will be different and hence the page table has to be stored and retrieved as tasks or programs are .7. 220 or 1 million entries. Recall that TLB is used in a way similar to the main memory cache. it must be matched with all the VM page address entries to determine if it is in TLB.

(d) MIN policy: It is the best policy but is impractical. The reader is referred to advanced textbooks for details of this type. a user task can be prevented from writing into the page tables but permitted to read them. does not change when the program behaviour changes. It is based on the assumption that the least recently used page is the one least likely to be referenced in the future. This could be very expensive needing both. each MM page can be associated with a counter called age register. The access time ratio of disk to RAM is much higher (104 : 1) than that of the RAM to cache (10 : 1). When replacement is required. There are other ways of implementing the LRU policy through the use of hardware/software counters. however. more time in process switching as well as large memory.Cache and Virtual Memory 323 switched from one to another. Periodically all age registers are decremented by a fixed amount. At any time. With hardware and OS interactions such as this. such knowledge is not available during program execution. we could implement memory protection using the virtual memory and paging. Unlike the cache memory system. the MIN algorithm gives a useful comparison for replacement algorithms being considered. This would require the interaction between the system hardware and the operating system software. Its implementation can be supported by suitably designed hardware. 10. One way to reduce the page table size is to include the page table entries for a limited address space and let this limit be allowed to grow dynamically. (b) Random page replacement policy: This is the simplest of all the policies to implement. Thus. Although the virtual address space is as large as 4 GB. The following is a list of commonly studied page replacement policies: (a) FIFO policy: Replaces the first VM page that came into the main memory. Although available in simulation and design studies. MIN considers each page with respect to the address trace of that program and selects that page for replacement whose next reference is furthest in the future. the least recently used block is the one whose age register contains the smallest number. a miss is very costly in the VM system because a disk access is required. Whenever an MM page is accessed its age register is set to a predetermined positive integer. (c) LRU policy: The least recently used policy selects for replacement the VM page that was least recently accessed by the processor. For example. It. The computer system designer has to take steps to minimize this cost. a single task may not be using this large address space. However in experiment and design. This policy is very easy to implement using a queue data structure.8 PAGE REPLACEMENT POLICIES A major objective of a good page replacement policy is to maximize the hit ratio. .

324

Computer Organization and Architecture

(e) Working set policy: Earlier we have introduced the notion of address trace. It is a sequence of addresses corresponding to reference to memory locations. Since each address falls into one MM page, the address trace can also be represented using MM page numbers instead of memory addresses. Let us assume such a method is used. Working set is defined over a window T at a reference point t and is denoted as W(t, T). It is a set of all unique page numbers appearing in the window T pages, looking back in the address trace from the page at reference point. Figure 10.14 gives an example of a working set. Since the working set is a set of page numbers, we can refer to its cardinality denoted as |W(t, T)|. The mean working set size is obtained by taking the average of |W(t, T)| over t and it is denoted as W (T ) .
Time t Address Trace: (Page Numbers) 1 21 2 22 3 26 4 27 5 26 6 22 7 26 8 25 9 31 10 32 . . . .

Working Set at t = 8 for T = 6 W (t = 8, T = 6) = {25, 26, 22, 27} (Observe that page 26 repeats and thus included only once) IW (t = 8, T = 6)I = 4

FIGURE 10.14

Explanation of working set.

The mean working set size is a function of the window size T. It has been found that the mean working set size possesses the following properties: (a) 1 < W (T ) < minimum of (T, N) where N is the number of MM pages in the system. (b) W (T ) … W (T  1) (c) W (T  1)  W (T  1)  2W (T ) (implies concave down) The above properties yield the general shape of the curve shown in Figure 10.15. Smaller the working set of a program, better it is from the point of view of the virtual memory system. It is the locality property of programs that makes their working sets much smaller than the respective program sizes. In the working set policy of page replacement, pages that are not in the working set are replaced to release their MM pages. Such replacements are done at page fault times. The choice of window size is a design parameter that one optimizes by keeping the general shape of the curve (Figure 10.15) in mind. Because the working set model is based on the dynamic behaviour of memory references, this kind of policy can be expected to be better than other methods. Also working set measurements are useful in comparing programs from the point of view of their suitability for virtual memory systems.

Cache and Virtual Memory
W(T )

325

W(T ) = T

0 0 T Window size

FIGURE 10.15

Shape of working set curve.

Space-time product (SP) is sometimes used as a measure of goodness for evaluating various page replacement policies. It may be defined as follows: SP where M (n  fT 2 )

M = Primary memory allocated to that program. (This will not be constant for working set policy). n = No. of memory references made by the program. (This is an indirect measure of time). T 2 = Average time required to transfer a VM page to an MM page. (Time should be expressed in the same unit as n). f = No. of page faults.

It is easy to visualize that when M increases, f should decrease. The general trend of the relation between M and f is shown in Figure 10.16.
f

Thrashing region

Page fault

Mc

M

FIGURE 10.16

A typical parachor curve.

326

Computer Organization and Architecture

Graphs of this shape are known as parachor curves and they are very common in design problems. From this figure we note that if primary memory allotted for a program is less than Mc, page faults would increase rapidly. When page faults are too many, more time will be spent in page replacement than in useful computation. This phenomenon is known as thrashing and one should avoid chances of thrashing. In a particular performance evaluation study, eight different benchmark programs were used and the average space-time products were computed for the following three page replacement policies [30]: Policy MIN Working Set LRU Average SP 12.84 units 15.63 units 16.82 units

10.8.1

Page Fetching

Let us consider the problem of fetching a VM page from the high speed disk. There are two strategies for fetching a VM page. These are Demand Paging and PrePaging. In demand paging, a page is brought only when a page fault occurs. Until the virtual memory operating system completes page replacement, the process has to wait. Pre-paging is a technique in which a page is pre-fetched in anticipation. Since a pre-fetched page occupies an MM page, one will not choose this strategy unless he is reasonably sure about his predictions on the use of the pre-fetched page. In the processing of large arrays, one may predict with reasonable accuracy that a page required would be the next page. Throughout this section we have seen many similarities between paging and cache memory concepts. Difference between them are summarized in Table 10.6.
TABLE 10.6 Differences Between Cache and Virtual Memory
Description Access time ratio, r (typical) Memory management by: Typical partition size Processor’s access to slower memory Cache 10:1 Special hardware 100 bytes Can access directly Virtual Memory 10000:1 Mainly software Several kilo bytes Can access only through main memory

10.8.2

Page Size

Page size is yet another design parameter in virtual memory systems. Small pages lead to large page table and increase the complexity of memory management. On the contrary, they minimize what is known as internal fragmentation. Suppose a program occupies m complete pages and one byte in the (m + 1)th page. If the page size is p bytes, (p – 1) bytes are unused because of this program. It cannot be

Cache and Virtual Memory

327

allotted to another program. This type of main memory fragmentation is known as internal fragmentation. When the page size is small, every page replacement brings from the disk only a small amount of code/data. This also affects the hit ratio. Large page sizes have the converse properties. As a compromise between these two opposing factors, a designer has to choose an optimal page size. We will not go into the details of the design. The page size used in some of the commercial computers popular during the 1990s was 4 KB. Current computer systems favour much larger page size of the order of 16 KB or 64 KB. Let us now summarize the various decisions to be made by an operating system designer in the design of a virtual memory system using paging: 1. 2. 3. 4. 5. 6. 7. 8. Select a page size. Choose a suitable mapping for VM to MM. Implement the page table search efficiently using hardware/software. Support the page replacement algorithm with suitable hardware, if the support function is more suitable for hardware than for software. Decide upon the page fetching strategy. Devise a rule to decide the minimum number of pages that would be allotted for a program to avoid thrashing. Introduce software counters for measurement and evaluation of the chosen policies and strategies. Select a representative set of benchmark programs for testing, evaluation and tuning of the implementation.

10.9

COMBINED OPERATION MEMORY

OF

CACHE

AND

VIRTUAL

We have discussed two different enhancements of main memory. The first was to increase its speed using a cache and the second to increase its address space using a virtual memory. In any computer both of them will be used. The question that arises is how do they work together. The major coordination is performed by a Memory Management Unit which interacts with both cache and main memory as shown in a simplified block diagram of Figure 10.17.
Physical address

MMU CPU Logical address TLB Physical address Cache Control

Main Address memory

Disk

Data

Data

FIGURE 10.17

Combining cache and virtual memory.

328

Computer Organization and Architecture

A step by step description of the operation is given as follows: Step 1: CPU generates a logical address of required data/instruction Step 2: TLB is searched to see if the address is there. If it is there, a physical address is generated and sent to the cache controller and the system proceeds to the next step. Else go to step 4. Step 3: The cache controller initiates a search in cache for the specified address. If it is in the cache then data is retrieved and sent to CPU. If it is a miss, the cache is updated from main memory and the value is sent to the CPU. Step 4: The page table is searched. If found, the physical address is generated and the TLB is updated, go to Step 3. Else continue Step 5: As the required address is not in the page table, the page is obtained from the disk and placed in the main memory. Page table and TLB are updated. The data is delivered to CPU from the main memory. Many of these complex operations are controlled by the operating system and the memory management unit hardware. An application programmer will not be aware of what is going on ‘behind the scenes’.

SUMMARY
1. Memory system for computers uses a hierarchy of memories appropriately organized to provide applications cost effective overall memory with high capacity and fast access. 2. The main RAM uses DRAMs which are low cost, high capacity solid state memory. Their access time is around 50 ns. To get a faster access it is combined with a small SRAM memory called cache memory. Cache is smaller but faster than DRAM main memory. 3. Programs usually follow a pattern of memory access in which instructions stored in a small range of addresses are frequently referenced. The same is true for data accesses. Instructions accesses and data accesses are clustered in two distinct ranges of addresses. This is called locality of reference. This fact is used to keep frequently referenced instructions and data in small fast memories called caches. Normally there is a data cache and an instruction cache. The size of caches has increased to 32 KB over recent years and caches are integrated in the CPU chip. Besides these, many new computers have second level unified caches called – L2 (level 2) caches also inside the CPU chip whose sizes are around 1 MB. They are usually SRAMS. 4. When CPU wants to access an instruction or data, it first accesses the cache. If the data is there, it is called a cache hit. In the case of a hit the access time is that of cache. Else the access time is the main memory access time which is around 10 times larger. Thus a designer tries to increase the hits.

Cache and Virtual Memory

329

5. Cache hit ratio is defined as : h = (the number of times a required data is found in cache/total number of data accesses to main memory). The maximum value of h = 1. Usually a good cache design and policy of placing data from main memory in it gives h > 0.95. 6. A cache is much smaller than the main memory (~ 1000 times smaller) because its cost per byte is high. Therefore, we need to follow a policy which anticipates the requirement of data and places it in cache. The data must be removed from the cache and put back in main memory when it is not needed. 7. The cache can store only a small part of the contents of main memory. When CPU specifies an address from which data is to be retrieved, one should be able to quickly find out whether the data in that address is in cache or not by searching the cache. To expedite this search blocks of memory addresses should be systematically mapped to blocks of cache addresses known as cache lines. This mapping is an important design decision. Three mapping strategies known as direct mapping, associative mapping and set-associativemapping are available. Figure 10.9 in the text summarises these three strategies. 8. Two other design decisions are: (1) When is data updated in the cache? Should it also be updated in the main memory immediately? If it is updated immediately it is called a write-through policy and if it is done later it is called write back policy. (2) When the required address is not in the cache what should be removed from it and replaced with the required addressed data from the main memory? Several policies have been used. The most popular one is to replace the least recently used cache line (called LRU policy). Use of cache memory is now the standard in all computer organizations. 9. Another commonly used method in the design of memory systems is called virtual memory. A virtual memory provides a user with a large logical memory address space to write programs even though the available physical main memory (DRAM) may be small. This is achieved by combining a DRAM and a magnetic disk memory in such a way that the memory cycle time of the combined system is closer to DRAM cycle time while a large storage space is provided by a magnetic disk. 10. Virtual memory design also uses the locality principle and is designed broadly using ideas similar to cache design. However, the details are quite different. In other words, the ideas of mapping, and replacement strategies between DRAM and disk are used but the detailed methods differ significantly. 11. In any computer both cache and virtual memory will be used. Thus a computer system should be able to use both effectively. Hardware and operating system cooperate to achieve this.

330

Computer Organization and Architecture

EXERCISES
1. Write a program to add two 5 ´ 5 matrices. Let the matrices be stored column by column in the memory and let the program address the elements of matrices row by row. Obtain the address trace for your program. Comment on the program locality. Could this program be bad for a cache memory system? If so, under what conditions? 2. A block-set-associative cache consists of a total of 128 cache lines with two lines per set. The main memory contains 4K blocks with 16 words per block. Draw a figure explaining the mapping and show the partitions of an address into TAG, SET and WORD. 3. A computer has a main memory of 1GB and two caches each 8KB, one for data and the other for instructions. Assume a block size of 2K and answer the following questions: (i) How many lines are there in each of the caches. (ii) Assuming direct mapping which memory blocks will get mapped to the cache lines in instruction cache and which get mapped to the cache lines in data cache. 4. For the data given in Exercise 3 describe the mapping in— (i) a two-way set associate cache. (ii) a four-way set associate cache. (iii) What are the advantages and disadvantages of a two-way set associative cache compared to a four-way set associate cache? 5. For the data given in Exercise 3 describe how fully associative mapping is done. What are the advantages and disadvantages of associative mapping compared to (i) direct mapping (ii) two-way set associative mapping. 6. What are the policies available to update cache memory when there is a read or write miss? Compare the policies which you have enumerated. 7. A computer’s memory system has a direct mapped cache. If there is a hit to the cache the access time is 30 nsec. If there is a miss the access time goes up to 100 nsec. If the probability of a hit with this cache is 98%, estimate the speed up due to cache. 8. A computer designer runs a set of typical benchmarks with and without a cache. He finds that the speed up due to cache is 60%. If the cache has an access time of 50 nsec, estimate the access time of the main memory. 9. In order to implement LRU policy of replacing a memory block in a cache line what data is needed to be maintained by the cache management unit? Explain how this data will be maintained and used. 10. What is the disadvantage of write-through policy in a cache memory system? Similar to write-through, there is a notion of read through in which case CPU

Cache and Virtual Memory

331

11. 12.

13.

14.

15.

16. 17.

18.

19.

20.

is given the accessed word without waiting for the entire block to be written into the cache. Will you recommend read through in a system? Explain. Explain clearly, as a designer when will you recommend the virtual memory system? A virtual memory system has a page size of 2K words and contains 8 VM pages and 4 MM pages. The page table contents at some instant of time is as shown below: What addresses will result in page faults? VM page: 0, 1, 2, 3, 4, 5, 6, 7, MM page: 3, 0, –, –, 1, –, 2, –, (– means not in main memory). Consider the address trace given in the text for Program 10.1. Let the page size of a virtual memory system on which this program is run be 32 words. The program is allocated one page for instructions and one page for code. Calculate the space-time product SP for (a) FIFO, (b) LRU, (c) MIN replacement policies. Comment. Would you recommend prefetching the ‘data page’ in the above problem? How will your recommendation differ if the data array is accessed iteratively for some computation? A virtual memory is designed using a TLB. The main memory access time is 70 nsec. When virtual memory is accessed TLB hit ratio is 80% and TLB access time is 5 nsec. What is the effective access time to main memory assuming there is no need to access the disk. Repeat Exercise 15 assuming that the page table hit ratio is 98% and the time to access a page from the disk is 5 msec. A computer has a main memory of 1 GB and virtual address is 64 bits long. The page size is 4KB. (i) Give the lay out of TLB you would design. (ii) Give the page table layout. How many entries will you have in the page table and why would they be needed? Give the length in bits of each entry. A computer system has a memory hierarchy consisting of a cache memory, main memory and a disk. In a set of benchmarks it is found that cache hit ratio is 98%, TLB hit ratio is 90% and page table hit ratio is 95%. Given the access timer of cache is 25 nsec that of main memory is 100 nsec and that of TLB is 5 nsec estimate— (i) the performance degradation with no cache. (ii) If TLB is also absent how much will performance degrade. When a set of benchmarks were run on a computer it was found that 5% of the time the disk had to be accessed due to the required page not being in page table. If the page table is redesigned so that only 1% of the time the disk has to be accessed, find the improvement in performance. Assume main memory access time of 50 nsec and disk access time of 5 msec. Explain how would you use LRU policy in a virtual memory system. In what way does it differ from that used in a cache memory.

INPUT-OUTPUT ORGANIZATION

11

LEARNING OBJECTIVES
In this chapter we will learn:

â â â â â â â

How Input/Output (I/O) devices are interfaced with CPU and memory systems of a computer.
influences I/O organization.

How the speed mismatch between CPU speed and I/O device speeds

How the three major methods of I/O data transfer to memory called program controlled, interrupt controlled and Direct Memory Access based are designed and their major differences. Why special I/O processors are needed and their connection to the
computer system.

The need for buses to interconnect CPU, Memory and I/O devices and
details of some standard buses. different geographic locations.

The need for serial data communication between computers located at Methods of interconnecting computers within a small geographic area as a Local Area Network.

332

Input-Output Organization

333

11.1

INTRODUCTION

When I/O devices are to be connected to CPU and memory, we should remember the following important facts: 1. There is a vast variety of I/O devices. Some devices such as keyboard and mouse are very slow (nearly 100 bps) whereas others such as disks and flash memory are much faster (nearly 108 bps). 2. The method of data transfer differs from device to device. There is no uniformity. 3. All I/O devices transfer data at a rate much slower than the rate at which CPU and memory can process them. 4. The number of data bytes and their formats differ from one device to another. For example, data from a keyboard is sent byte by byte asynchronously whereas that from a disk is a sector which is usually 512 bytes and is sent as a stream. It is thus necessary to have an I/O interfacing unit which acts as an interface between I/O devices and CPU/Memory. In Figure 11.1 we sketch this interconnection.
CPU Memory

System bus I/O interface unit

Device

FIGURE 11.1

I/O interface unit to connect a device to CPU/Memory.

The main functions of this I/O interface unit are as follows: 1. To receive instructions from CPU to Read/Write, interpret them and send appropriate commands to the specified I/O device. 2. For a write command, store temporarily the data received from the main memory. The specified device is then informed and made ready to accept the data. When the device is ready, it accepts the data from the buffer. 3. For a read instruction received from the CPU, send a command to the specified device. When the requested data is received from it, put the data in a buffer and inform the CPU that the data is ready. CPU should now start accepting data from the buffer and store it in the memory.

Figure 11. we will assume a dedicated processor to be resident in a computer keyboard device that constantly scans the rows of keys in a keyboard and does the following: 1.2 DEVICE INTERFACING Let us suppose that an I/O device has an integrated programmable unit. Control all I/O operations using information such as read/write priority sent by CPU. 5.2 depicts the sequence of operations listed above in the form of a state diagram. Determines if any key is pressed. During write whether the interface buffer is ready to accept data.2 A state diagram for the keyboard scanner. The I/O interface unit should communicate the following to the CPU/Memory system for correct transfer of data: 1. The set of keys in a keyboard is organized in the form of a matrix Scan keyboard A key is depressed Receiver empties the buffer Store its code in buffer Notify receiver and wait Ready to inform FIGURE 11. during data transmission. . 2.334 Computer Organization and Architecture 4. 4. It is the responsibility of the interface unit to monitor errors. send an appropriate signal. In this chapter we will describe the design of I/O interface unit and the means of communicating to and from I/O devices to CPU/Memory system. When data is ready in the interface unit buffer to be read by the memory/ CPU system. 11. If a key is pressed. 2. Requests the CPU to read the buffer. Waits till the buffer is read and then goes back to scanning the keyboard. if any. then stores the corresponding ASCII code in a ‘buffer storage’. correct them (if possible) and inform CPU if data is corrupted. 3. As an example.

3. we introduce two terms: device controller and interface unit. As I/O devices differ widely in their characteristics and in their operational principles. interface unit should have the following features (see Figure 11. An I/O organization based on this division of responsibilities is shown in Figure 11. . I/O Interface Data bus Data Register Device controller logic Status Register Control Status Data Device 1 Address bus Control bus Device address decoder and control logic Device controller logic Control Status Data Device n FIGURE 11.3.3 Communication between CPU and peripheral devices. In order to perform its functions. The scanner sub-system contains device dependent operations that perform the sub-operations such as scan one row. it is convenient to separate device dependent aspects from device independent aspects.Input-Output Organization 335 with several rows and columns. check each column of that row and move to the next row. performs the logic of communicating with the receiver of the input read or with the sender of the output to be displayed. The interface units of the various I/O devices are connected using a system bus to the rest of the computer system as shown in Figure 11. on the other hand.4): Memory CPU System bus Interface unit Interface unit Interface unit Controller Device-1 Controller Device-2 Controller Device-n FIGURE 11. A computer system contains many I/O devices and each of them will be assigned a unique address for identification. The device controller performs device-dependent electro-mechanical or electrooptical functions. The interface unit. Thus.4 Parts of an I/O interface.

Besides address decoding.336 Computer Organization and Architecture A device address decoder and control logic: Each device has a unique address. In the case of faster peripheral devices such as magnetic disks. This register is vital to synchronize the operation of devices with that of the CPU. the status flip-flop in the I/O interface is set indicating that the job is over. It may be seen that the . Special I/O instructions need not be defined as a MOVE to or from one of these addresses will. The CPU continuously monitors this flip-flop and when it is set. this will be just a flip-flop. part of the memory address space is used to address I/O devices. 11. When the CPU wants to send data to a device or receive data from it. A data register: In a data register the data to be moved to or from the memory is stored. They are as follows: Program controlled data transfer In this case when an I/O instruction is encountered in a program the CPU issues an I/O command to the I/O interface module and waits. This address is decoded by the address decoder in the interface unit. A status register: A status register is used to specify when the data from the device is ready to be read by the CPU or whether the device is ready to receive data from the memory. In such cases the I/O interface is a VLSI chip called an I/O or device controller. by implication. the logic also sends control signals to devices to read/write and monitor their response. In our discussions we will mostly use the more general term interface unit and describe its working.3 OVERVIEW OF I/O METHODS When data is to be transferred from I/O units to or from memory. The I/O interface performs the I/O operation and when it is completed. three methods are possible. In some microcomputers. it takes the data from the I/O interface buffer (for read) and goes to the next instruction. the complete control of all devices and I/O operations is delegated to the I/O module. In large computers with powerful CPU. In the simplest case. In the case of write. such as a character printer. In some devices. In smaller machines the CPU does fair amount of the device control functions. This has the advantage of allowing access to I/O devices as though they are memory locations. it places the device’s address on the I/O bus. The I/O interface unit varies in complexity depending on the number and type of I/O devices it controls. move data to an I/O device from memory or from an I/O device to memory. this may be a one-byte register. it places the data to be written in the I/O interface buffer and proceeds to the next instruction. In such a case the I/O module is a full fledged special purpose computer which is usually called I/O processor or channel. it would be a one-word register and there will be another register to store the number of words to be moved in or out of the device.

Data is stored in memory via CPU. The I/O interface assembles in its buffer register the data read from the device. It has no further role in data transfer. Interrupt driven data transfer In this case when an I/O instruction is encountered. it sends an interrupt signal to the CPU. the CPU issues an I/O command to the I/O interface.6. When the interface is ready with the data needed by the CPU. We thus see that in the first two cases data is transferred to memory via CPU whereas in DMA it is directly transferred to memory. The assumption is that the program does not require the data to be read from the device immediately. It then continues with the next instruction in the program. retrieves the data from the I/O interface and places it in memory. CPU Issue Read/Write command to I/O module I/O module I/O module sends instruction to device CPU No Is status flag set ? Yes CPU Read/Write data from I/O module CPU ® Memory Read/Write in main memory FIGURE 11. We will now describe these three methods in greater detail. Observe that in this case CPU does not wait.Input-Output Organization 337 CPU waits till I/O is complete. When data is ready it transfers the data directly to the specified address of memory. Direct memory access In the previous two methods CPU is involved directly in I/O operation. The CPU now suspends the program it is executing. . This method is summarized in the flow chart given in Figure 11.5 Program controlled I/O transfer. This wastes CPU time. It then returns to program execution. In DMA the CPU issues I/O command to I/O interface unit and proceeds with the rest of the program. This method of reading data from device is summarized in the flow chart of Figure 11.5.

3 the information transfer between them can take place in several ways. The interface unit has the following registers: IBR: Data-Ready: Busy: Interface Buffer Register (one byte) Flip-flop that indicates that data is ready in the IBR Flip-flop that indicates that the I/O interface unit is engaged (or busy) . (The state diagram shown in Figure 11. As we saw in Section 11. This is one of the major aspects of I/O organization.4 PROGRAM CONTROLLED DATA TRANSFER Let us view the I/O operation as a process and the main computation performed in the CPU as another process and call them I/O process and computation process respectively.2 depicts the I/O process. such as data. In this section we describe in greater detail program controlled data transfer between I/O process and computation. coded commands.) These two processes communicate with each other in order to exchange information.7 we give a block diagram of the major units of a computer.6 Interrupt controlled I/O transfer. 11.338 Computer Organization and Architecture FIGURE 11. control signals or status signals. In Figure 11.

A data ready flip-flop in the interface unit is reset to 0. The CPU sends on I/O bus the device address and command to be executed. As soon as the data is ready in IBR the data ready flip-flop of the device interface is set to 1. In this method synchronization of the I/O device and CPU is achieved by making CPU wait till the data is assembled by the device. device address. namely.) The interface unit commands (for a read command) the concerned device controller to assemble a word in the interface buffer register (IBR) and at the same time it turns on an interface busy flip-flop. It gets out of the loop as soon as the data ready flip-flop is set to 1.7 Details of CPU-I/O-memory communication. The device controller fills the IBR. The various buses connecting the units are also shown. the interface will not entertain any other I/O requests. Step 5: Step 6: This method of transfer of data from I/O to CPU is called program controlled transfer. The instructions may be of the form: I/O operation code. read or write.Input-Output Organization 339 MAIN MEMORY MAR Address bus MBR Data bus Interface unit READY Address decoder Control bus CPU FF I B R Device and controller FF Device address bus Data bus BUSY FIGURE 11. When this happens the contents of IBR are transferred to the specified CPU register through the data bus. As long as this busy flip-flop is set. A series of steps used for actual data transfer is given as follows: Step 1: Step 2: Step 3: Step 4: An I/O instruction is encountered in the program being executed. The device address on the I/O bus is recognized by the address decoder of the desired I/O device interface unit. As I/O devices are . (It is assumed that the programmer knows that the device is free before issuing the command. device command. The CPU continually interrogates the data ready flip-flop in a wait loop.

boiling milk. She may have been boiling milk when the bell rings. the interrupting process is now interrupted by a higher priority event. the method becomes complicated when the number of devices available is large. Normally. Observe that in this case. namely. We will consider this aspect in detail in the next section. In the case of I/O transfer between CPU and I/O.340 Computer Organization and Architecture normally much slower compared to CPU. then the data would be available when the read command is encountered and CPU would not have to wait. This method is called program controlled interrupt transfer. 11. If many devices are active. the CPU completes the current instruction . When the CPU is interrupted. Method 2:  In this method also the programmer issues a start reader command well ahead of the need for the data and continues executing other instructions. There are two processes. one is the interrupted process and the other is the interrupting process. There are two ways in which the CPU waiting time can be reduced. Concurrently the interface assembles in IBR the data needed. If the programmer had correctly estimated the time needed to assemble data. An interrupt signal is sent by an I/O interface to the CPU when it is ready to transfer data to (or from) the memory. We will expand on this idea in this section. This method is better than the previous method as a programmer need not carry out detailed timing calculations. After receiving the visitor and attending to him or her. She would then shut the flame and go to the door to attend to the call.5 INTERRUPT STRUCTURES In the last section we introduced the idea of transferring data between peripheral and CPU using an interrupt. it jumps to a special sub-routine which reads the contents of IBR and stores it in the specified location in memory. then the CPU will receive many interrupt signals and a method should be evolved to handle them systematically. The interrupt process may be compared to the action of a housewife when she hears a doorbell while she is working in the kitchen. The interface proceeds to assemble the data in IBR. We can extend this analogy recursively. she will return to the original job she was doing. After giving this command. if the telephone rings she rushes to pick up the phone. As soon as the data is ready. the CPU is the interrupted process and the I/O interface is the interrupting process. A start reader command is issued early in the program several steps ahead of the need for the data. This method requires the programmer to know the correct instruction timings. the interface sends a signal to the CPU and interrupts it. CPU executes other instructions in sequence. instead of the CPU trying to find out whether the data is ready the CPU is told that the data is ready by the interrupt signal. Method 1:  The programmer estimates the time required to read data from a specified device. When she is interrupted while boiling milk and she is walking towards the door to open it. this method of synchronization is not desirable as it wastes CPU time. That is. When CPU actually needs the data a read command is given. In addition to this problem.

Figure 11.9 depicts the status of memory during interrupt processing. CPU completes current instruction.1 Step 1: Step 2: Step 3: Step 4: Step 5: Single Level Interrupt Processing Interrupt signal is received by CPU. The last instructions in the service program restores the contents of the general purpose registers. CPU sends acknowledge signal to I/O interface and jumps to an interrupt service routine.8 Sequence of events in interrupt processing. In order to restart computation. The interrupt request is serviced by executing a pre-written interrupt-service program. The interrupt service program carries out the required transfer of data from or to I/O. Interrupt signal received Complete current instruction of program Store PC value in memory Store GPRs in memory Execute the interrupt service program Restore GPRs before returning to program Return to interrupted program FIGURE 11. processor status word and other general purpose registers in a reserved area in the main memory or in a stack. If we assume there is only one interrupt. These steps are summarized in Figure 11.Input-Output Organization 341 it is executing and then attends to the interrupt. . 11. Restarting computation can then be achieved by a simple process-switching.5.8. the sequence of steps involved for interrupt processing is as shown in the following subsection. CPU stores the contents of program counter. the CPU will return and restart the interrupted computation. the CPU should store (possibly in a stack) the process state before branching to the interrupt service program. After the interrupt is serviced. The content of the program counter is then restored to that stored in step 3 and CPU returns to execute the original program it was executing.

When a sensor detects that power is about to fail. The order of importance or priority of these interrupts are different. This routine would store all important registers in non-volatile memory so that the program may be . An example is the occurrence of an emergency condition such as power failure. Besides interrupts from device interface to transfer data from external devices.9 Memory map in interrupt processing.342 Computer Organization and Architecture L Area to store registers when interrupted L+1 L+2 L+n x Current program running in CPU x+1 x+2 Interrupt x+k JUMP to service routine y y+1 y+2 Interrupt service routine y + m Restore Regs & Return RETURN from service routine FIGURE 11. Interrupt levels: The assumption that only one interrupt occurs is not realistic. In practice CPU should handle a variety of interrupts from many I/O devices. it sends an interrupt signal to the CPU. other interrupts are also generated. The CPU initiates a power failure service routine whose execution may be supported by a small battery backup power.

accumulator overflow or use of an illegal operation code. The CPU then jumps to a fault service routine which determines if it is a correctable fault (such as when error correcting codes are used) and carries out the correction and resumes the user’s program. Another type of interrupt is caused if the hardware of a computer develops a fault.10 Handshaking between CPU and I/O interface unit. A simple model of this is shown in Figure 11.5. The relative importance given to an interrupt is called its priority. Another class of interrupts. Such interrupts are necessary to ration time to different users and to throw out jobs stuck in endless loops. operation is suspended after printing an appropriate message.2 Handling Multiple Interrupts All interrupt systems use similar procedures to request CPU attention. 11. Two flip-flops are used in the I/O interface unit. if during an interrupt processing another interrupt occurs. Further. it FIGURE 11.Input-Output Organization 343 resumed when power is restored. a procedure to attend to this is to be formulated. . If it is a noncorrectable fault. can occur when there is a mistake in a user’s program. As there are many types of interrupts it is necessary to have a method of distinguishing types of interrupts. The CPU may then resume the program if the mistake is not a serious one or suspend execution in case it is meaningless to continue with the interrupted program. When the error detecting circuitry detects a fault. the CPU is informed of the location and the nature of the fault. which usually prints out the place where a trap occurred and the reason for the trap. it interrupts the user’s program and branches to a trap routine. Interrupts caused by I/O devices and emergency events outside CPU’s control are called external interrupts.10. Interrupts generated by a real time clock in a computer are used to regulate allocation of CPU time to users. Common mistakes which lead to traps are: attempt to divide by zero. The number of types of interrupts that can be distinguished by a CPU is called interrupt levels or interrupt classes. The interrupt request flip-flop R is set by the device controller when the device needs the attention of the CPU. When such a mistake is detected by the CPU. When the CPU is ready to accept the interrupt. internal interrupts (also known as traps).

1 Software Polling Polling is a method in which a process periodically checks the status of an object. They are as follows: 1. It is thus necessary to group sets of devices and service devices within each group. 11. Use of multiple interrupt lines We will describe each one in turn. it is impractical as bus widths are limited and computers normally have a large number of I/O devices.11 (a) and (b) illustrate polling. the device can transfer data to the CPU. In this case INTR of each group is ORed. The model shown in Figure 11. When the status bit changes. Thus. A smaller number of lines are thus used to connect to CPU. The status bit = 0 during . 3.6 INTERRUPT CONTROLLED DATA TRANSFER There are two ways to connect INTR line(s) to CPU. This procedure of exchanging signals between processes to communicate each other’s state is known as handshaking. Within each group one has to assign priority for servicing. a specified action is taken. A simple method would be to use one INTR line and one INTA line per device and connect them to the CPU. CPU should find out which device interrupted and service it. Usually the status is that of a flip-flop indicating a status bit.11(a) shows the process of filling the buffer of an interface unit from a device controlled by it. Then two problems arise: l How to connect the INTR (interrupt request) lines of these devices to the CPU? l How will the CPU determine which of the many devices has sent the interrupt so the INTA (interrupt acknowledge) can be given to the right device? 11.344 Computer Organization and Architecture sends a signal to the interrupt acknowledge flip-flop A and sets it. In this case only one signal is received on INTR line.6. The flow charts of Figure 11. 2. 4. The flow chart of Figure 11. Even though handling interrupts using the first method is simple. There are four major methods of servicing interrupts. Let us suppose that several I/O devices of the same interrupt class are involved. the second method is essential and we will describe it now. Software polling Bus arbitration Daisy chaining which is a hardware arrangement Vectored interrupt handling (which uses one component of a vector of interrupt routine addresses) 5. If several devices interrupt simultaneously. As soon as A is set. The other is to OR all the INTR lines and send only one INTR line to CPU. the CPU should decide in which order (or priority) the devices should be serviced.10 is simplified and shows only one device.

If the instruction (Is buffer full bit = 1?) takes 200 ns to execute. a program would set the mask bit to 1 if it does not want to attend to an interrupt from that device immediately. then the unit with a higher priority. The CPU may choose to ignore an interrupt by setting an interrupt mask bit to 1 in CPU using an appropriate instruction. the polling rate is 1/(200 ´ 10–9) = 5 million per second. To reduce the time.11(b) shows how the status bit is continuously checked by the polling routine and when it is set to 1. When the buffer is full. a method called bus arbitration is used. namely. When there are several I/O devices capable of interrupting the CPU. This is done by polling INTR flags of all the I/O units one after another cyclically to find which unit’s INTR flag is set. Buffer full bit ¨ 0 No Fill Buffer No Is Buffer full ? Yes Buffer full bit ¨ 1 Interface unit process (a) Is Buffer full bit = 1 ? Yes Read Buffer Polling process (b) FIGURE 11. The interrupt service routine reads this address thereby uniquely identifying the device. Read Buffer. when the bit is set to 1.Input-Output Organization 345 the process of filling the buffer. Observe that the instruction (Is buffer full bit = 1?) returns to itself as long as the buffer full bit = 0 and goes to the next instruction. The sequence in which the INTR of units are polled is based on its priority.2 Bus Arbitration Polling is time consuming.12). the status bit is set to 1 in the interface unit. 11. It can choose when it wants to service an interrupt by resetting the mask bit to 0. a method should be found to arbitrate among them and service one of them. This scheme is quite flexible. should be serviced first. Thus. The flow chart of Figure 11. . If more than one INTR is set.6.11 Polling of interface unit. the buffer is read. In this case an interrupting unit gains access to the data bus and posts its device address in a device address register in the CPU (see Figure 11.

. PI D1 PO S1 PI D2 PO S2 PI Dn INTA INTR Address bus CPU FIGURE 11. The daisy chained INTA line also terminates in CPU. The INTR lines from all the I/O interface units are ORed and a single INTR line is terminated in CPU. (It is assumed that all switches S1.346 Computer Organization and Architecture ORed CPU ORed INTR Address/Data bus DAR INTA Device-1 Device-2 DAR: Device Address Register FIGURE 11. needs interrupt services it sends a signal on INTR line. 11.13).3 Daisy Chaining Daisy chaining is a hardware method used to assign order of priority in attending to interrupts. In this scheme the interrupt acknowledge lines (INTA lines) of the I/O interface units are serially connected. say D2.13 Daisy chaining.) Step 1: When a device. This serial connection of INTA lines is called daisy chaining.13 I/O interface D1 has the highest priority and Dn the lowest priority. A controllable switch is incorporated in each of the blocks (see Figure 11. It opens the switch S2 so that no lower priority device can send a signal on INTA line.6. The CPU recognizes INTR and sends INTA signal which is captured by D2.12 Use of address bus for device identification. Arbitration method used is described as follows. S2 … Sn–1 are initially closed. Referring to Figure 11. In the daisy chain the highest priority I/O interface is placed nearest to the CPU and the lower priority ones are placed behind in order of their priority.

11. . Let us suppose there are four different I/O devices connected to a CPU. Using the device address CPU jumps to the appropriate interrupt service routine. contain the starting addresses of the four interrupt service routines. When the I/O device #3 is ready it posts the code 1008 on the bus. Assume that the system programmer or the operating system designer has written the interrupt service routines for each of these devices and stored these routines in the reserved area of memory as shown in Figure 11. If during processing the interrupt of a device.14. we said that the I/O device identifies itself to the CPU by posting its own address on the bus. We will explain this with an example. the current processing is suspended by CPU at an appropriate point and it jumps to process the interrupt of the device with higher priority.6. #4) respectively.Input-Output Organization 347 Step 2: Step 3: Step 4: Step 5: D2 now places its address on the address bus. a higher priority device interrupts. 1004. In step 2 above. 1008. This information can be used by the CPU to straightaway branch to the corresponding interrupt service routine stored starting in address 7300. This idea is extended in vectored interrupts. 2100 #1 4500 1000 1004 1008 1012 2100 4500 7300 8600 #3 8600 #4 Interrupt service routines loaded in memory #2 7300 Vector of addresses FIGURE 11.14. otherwise this method will not work correctly. This address is sensed by CPU. Let us associate the vector of addresses (1000. #2.14 Vector of addresses. #3. After D2 is serviced S2 is closed so that interrupt by lower priority devices can be acknowledged by CPU. The interrupt service routines may be relocated at some other areas of memory but the vector of addresses must reside in the same place. 1012) with the four devices (#1. The vector of addresses stored starting from address 1000 in Figure 11.4 Vectored Interrupts A major aim in I/O organization is to reduce the overhead involved in interrupt handling.

If a device wants to be serviced. However. .348 Computer Organization and Architecture 11. We have shown in this example four devices D1. the CPU has to assign priorities among multiple requests and select one of them at a time for service. For example.15 Priority encoder for multiple interrupts. The INT register has one bit reserved for each INTR line which registers the interrupt requests from various lines.5 Multiple Interrupt Lines So far. In this scheme two or more simultaneous interrupt requests from different lines can be registered in the INT register. only one interrupt service routine can be executed at a time. We already pointed out that it is impractical to have a pair of INTR and INTA for each device.15 we have shown a block diagram of a priority encoder and bus grant.15. If there are many devices. we have considered models of interrupt systems in which we had only one interrupt request line entering the CPU and one interrupt acknowledge line leaving the CPU. we assumed that the INTR of different devices are ORed together. D3 and D4.15. Referring to Figure 11. D2. INTA) with the CPU.6. We thus group I/O devices and each group has a pair of INTR and INTA lines. Such a selection may be done using hardware devices called priority encoders or by software means with the help of an interrupt analyzer software. This model is shown in Figure 11. the I/O devices may be grouped into 4 classes and connected through 4 separate INTR lines. it sets a bit corresponding to it in an Interrupt Interrupt register D1 D2 D3 D4 1 Priority encoder I0 I1 I2 I3 INTA from CPU Device address 2 3 4 1 2 3 4 Mask register INTR to CPU FIGURE 11. Thus. We can extend this model to associate multiple pairs (INTR. as shown in Figure 11.

6. One such controller is 82C59A which is used with Intel 80386. It also has an output to forward IR Interrupt Request (INTR) signal to the CPU. 11.16 VLSI chip interrupt controller. the interrupt handling software will set mask bit of devices not to be serviced to 0 and those to be serviced to 1.16). A mask register has 1 bit corresponding to each device. The primary function of this chip is the management of interrupts. determines which has the highest priority and forwards INTR from this device to the microprocessor.6 VLSI Chip Interrupt Controller Microprocessor families such as 80x86 have as a support chip an external interrupt controller. A simple controller can control up to eight devices. It accepts interrupts from devices. The processor accepts the interrupt (when it is ready) and acknowledges via INTA line. Slave controller 1 INTR1 IR1 IR2 D1 D2 D8 IR8 INTR2 Master controller INTR 80386 Microprocessor INTR8 INTA D57 IR1 IR2 D64 IR8 Slave controller 8 FIGURE 11. Eight controllers may be connected to a master controller to allow control of up to 64 devices (see Figure 11. The interrupt acknowledgement (INTA) from the CPU is used to capture the address bus and put the device address on it. The interrupt register bits are ANDed with mask register bits and the result is input to a priority encoder. External devices are connected to this chip. Depending on the priority to be assigned in a particular situation. The priority encoder is designed to give as its output the device address to be placed on the address bus.Input-Output Organization 349 Register to 1. This signals the controller to place the device’s address on the data bus (which .

6.17 Application of programmable peripheral interface unit. These are called Programmable Peripheral Interface Unit (PPIU) and can be used for program controlled as well as interrupt controlled data transfer from/to devices to the processor. Fully nested in which priority order is IR1 to IR8. it is put last in the queue for the next servicing. Three interrupt modes are available. 1. namely. 2.17. a keyboard as input and a Video Display Unit as output connected to the PPIU which is in turn connected to the microprocessor. This figure shows two devices. The processor now proceeds to process the interrupt and communicates directly with the I/O interface unit to read or write data. 3.350 Computer Organization and Architecture acts as device address bus) of 80386. 11. A block diagram of such a device is shown in Figure 11. Rotating in which after a device is serviced. Observe that FIGURE 11. . The priority scheme to be used is conveyed by 80386 by sending information to a control word in 82C59A. The 82C59A is a programmable chip. Masked in which a mask is used to alter priority by inhibiting interrupts of specified devices.7 Programmable Peripheral Interface Unit There are also VLSI chips available as I/O interface unit.

several instructions need to be executed by the CPU. 11. it is necessary for the DMA device to know from the CPU where in memory the data should be stored and how many bytes are to be transferred. FIGURE 11. These devices operate synchronously using their own internal clock and are independent of the CPU. For the transfer of each byte. Also appropriate control bits for the devices are sent by PPIU.7 DMA BASED DATA TRANSFER In both program controlled data transfer and interrupt controlled transfer of data. For detailed configuration of chips the reader should refer to the data sheets of microprocessor manufacturers such as INTEL or AMD. The combination of high speed of such devices and their synchronous operation makes it impractical to use program controlled transfer or interrupt controlled transfer via CPU to read or write data in such devices.Input-Output Organization 351 in this example 8 data bits. Interrupt request from the devices is also forwarded to the interrupt controller which forwards it to the microprocessor. We have highlighted only the main principles. We could consider DMA based I/O to be memory centred. the CPU is also connected to the DMA device as shown in Figure 11. memory is the main entity for whose service both I/O and CPU compete. Thus. data is transferred to or from I/O devices to the memory via a CPU register. Input or output of a block of data from a fast peripheral such as a disk is at high speed. Of course. .18. that is. This is slow as data is transferred byte by byte. each from the two devices.18 Block diagram explaining DMA interface. PPIU communicates to the microprocessor using data/address bus and the necessary control bits. are sent to or received from the PPIU. Another method which uses a Direct Memory Access (DMA) interface eliminates the need to use CPU registers to transfer data from I/O units to memory or vice versa.

CPU resources are not required hereafter. The DMA acquires the next byte in IBR and performs steps 4 and 5 until the byte count is zero. A Memory Address Register (MAR) which contains the address in memory to which data is to be transferred or from where data is to be received. A counter register which contains the count of the number of bytes to be sent or received from memory. The DMA cannot be accessed by the CPU till this busy flip-flop is reset. The configuration of the DMA interface and its connection to the CPU through buses is shown in Figure 11. A Memory Buffer Register (MBR) which contains the byte to be sent to the memory or that received from memory. Step 4: As soon as IBR is full it transfers its contents to MBR of the memory. data ready and other status information. 5. The DMA busy flip-flop is set. The address in memory where a byte is to be stored or from where a byte is to be received is also sent to the DMA by the CPU. If several bytes are to be stored or retrieved. 4. 1. Step 5: After the byte is transferred the counter in the DMA is decremented. With a dual ported memory the DMA interface can transfer data to the memory as long as the CPU is not accessing the same address of memory. Step 2: After sending the above information to the DMA interface. the CPU continues with the next instruction in the program. The DMA interface functions as follows: DMA operation Step 1: When an I/O instruction is encountered by the CPU. Step 3: The DMA uses the device address to select the appropriate device. The address is transferred to the MAR of the memory.18. This command is received from the CPU and stored in DMA to enable the DMA interface to carry out the command independent of the CPU. All the information sent by the CPU to the DMA interface is stored locally in appropriate registers by DMA.352 Computer Organization and Architecture A DMA interface contains the following registers to facilitate direct data transfer from I/O devices to memory. one connected to the CPU through a pair of buses and the other to the DMA. . A status register indicating DMA busy/free. 3. This allows complete independence of CPU data transfer and I/O data transfer. A register containing the I/O command to be carried out. it sends the device address and the I/O command to the DMA interface. 2. We have assumed here that the memory has two independent ports. then the address of the first byte in the group and the count of the number of bytes is sent by the CPU to the DMA interface. With the transfer of each byte the counter is decremented by 1.

19. CPU and DMA interface must thus share the data and address bus. This procedure is illustrated in the flow chart shown in Figure 11. small in size. This method of transfer of a word from I/O to memory in which a memory cycle is taken away from CPU by DMA is known as cycle stealing.19 DMA interface procedure (for I/O read). Only one of them can access memory at a time. In step 4 we assumed that the main memory is dual ported. MAR ¬ ADDR No At the end of current instruction CPU surrenders a memory cycle Decrement counter (CT) Increment memory address (MAR) No CT = 0 Yes DMA transfer over FIGURE 11. After the byte is transferred. Dual-ported memories are. Main memory of computers are thus not dual ported. then it shares the MAR and MBR with CPU. In such cases when IBR in the DMA interface is filled and ready to be transferred to memory. During this memory cycle IBR is sent to MBR of memory and the address to MAR and a byte is transferred to memory.Input-Output Organization I/O Instruction 353 Send (Device address. Data addr) to DMA DMA reads data into IBR from I/O device IBR Full? Yes MBR ¬ IBR. I/O command. . the counter in DMA is decremented. the DMA sends an interrupt signal to CPU. The CPU completes the execution of the instruction it is currently carrying out and yields the next memory cycle to DMA. Thus if the DMA does not have access to a separate port of memory. however.

g. contents of a sector) is to be sent to the main memory. FIGURE 11. a bus request line and a bus grant line.18 we showed two separate buses: one from CPU to memory and the other from DMA to memory. This figure shows a single pair of address and data buses being shared by CPU and DMA. DMA now uses the buses to transfer a burst of data. The DMA controller has three control lines to CPU. after completing the current instruction. I/O interface unit interconnections.21. an interrupt line.. In Figure 11. A more accurate block diagram of a DMA interface is shown in Figure 11. the starting address and the number of bytes to be transferred to I/O device is sent by CPU to . If data being transferred is from a fast I/O device such as a disk. This was mainly done for ease of explanation. it requests the address and data buses from CPU by sending a signal on the BR line.354 Computer Organization and Architecture This method of transferring data byte by byte to memory by DMA is appropriate for slow I/O devices such as a keyboard. When DMA wants to send data to memory. This method is illustrated in the flow chart of Figure 11. DMA.20. it informs the CPU by sending a signal on interrupt line. a burst of bytes (e. In this flow chart we have assumed that the I/O instruction is a “read” instruction. memory. The CPU now gets back the buses.20 Details of CPU. If it is a “write” instruction to a fast device. logically disconnects the buses from itself and sends a signal to DMA on BG line. In this case the data and address buses are taken over by DMA for several cycles till all the bytes are transferred. CPU. At the end.

In computer literature such hardware procedures which are hidden from or not visible to the programmer are known as transparent to programmers. . The DMA now requests the CPU to yield the buses for reading from memory by sending a signal on BR line. Starting address known from step 2 No Is data transfer over ? Yes Send interrupt to CPU to take over buses FIGURE 11. The CPU grants the buses by sending a signal on BG line. Observe that this entire process of data transfer is performed by the hardware without any program intervention. I/O instruction Send device address I/O command and starting address of data in memory Is device fast ? Yes No Go to single cycle stealing mode Read from device a block of data Send bus request signal to CPU At the end of current instruction CPU sends Bus Grant signal and logically disconnects buses from CPU DMA sends data to memory using data bus and address bus. Now the DMA reads data from memory and writes it on the fast I/O device.Input-Output Organization 355 DMA. In other words the programmer does not have to worry about the detailed transfer of data.21 DMA data transfer procedure from a fast I/O device.

floppy disk. audio I/O. Before we conclude this section we will compare DMA data transfer with transfer of data via CPU. A block diagram of a computer reorganized with an I/O processor is shown in Figure 11. This allows them to perform I/O operations without loading the CPU. To distinguish I/O processor instructions from those of the CPU. CD-ROMs. VDU.22. scanner. 11. The CPU does not unnecessarily waste time. however.22 Block diagram of a computer system with an I/O processor. The following gives the sequence of steps followed for I/O transfer using I/O processor.8 INPUT/OUTPUT (I/O) PROCESSORS So far we have assumed that each I/O device has a controller and is connected to an I/O interface unit. The programmer must.356 Computer Organization and Architecture This method of data transfer between I/O devices and memory is non-program controlled. interrupt controlled or may provide direct memory access (DMA). The CPU and I/O processor work independently communicating using the memory bus. printer. Nowadays computers have several I/O devices. video camera. etc. many computers have a keyboard. The DMA transfer is faster and convenient to use. flash memory. Memory CPU Memory bus I/O processor D1 D2 I/O devices Dn I/O bus FIGURE 11. tapes. For example. An I/O processor is a full fledged special purpose processor with an instruction set optimized for reading from and writing in I/O devices. Very often it is more economical to consolidate the functions of several I/O interface units into one special unit called an I/O processor. I/O processors are connected to the main memory of the computer just like a DMA interface. This advantage is gained by providing extra hardware in the DMA and also by introducing extra information paths. hard disk. . We have also seen that I/O interface units may be program controlled. these instructions are called commands. give a read instruction sufficiently in advance so that the data is available in memory when needed for use in the application program.

CPU checks the status word to confirm I/O task is over. The starting address of the appropriate I/O program stored in memory is sent to I/O processor. continues its work. 32 or 64. I/O processor stores its status word in main memory and interrupts CPU. however. Microprocessor word lengths have progressively increased from 8 to 16 to 32 and currently 64.9 BUS STRUCTURE In the chapter on CPU we described two types of buses: an internal bus within a chip which connects the processor with cache memory and an external bus which connects CPU chip with external memory chip. Normally I/O processor works with devices which are relatively slow. Data lines are used to transmit data and the collection of these lines is also called a data bus. Data bus widths normally keep up with this increase. At the end of I/O. Observe that I/O processor and CPU work concurrently doing their own assigned tasks.Input-Output Organization 357 Step 1: Step 2: Step 3: Step 4: Step 5: CPU tests if I/O processor is available to read/write data. memory board and I/O processor board are plugged to the back plane to connect them to the bus.1 Structure of a Bus A system bus usually has around 50 to 100 parallel wires which are normally printed on a printed circuit board. CPU having delegated the I/O tasks. Larger the bus width better will be the system performance. 11. address lines and control lines. Usually data bus widths are 8. because component boards such as CPU board. The width of a data bus is the number of parallel wires which carry data. If all OK it continues with processing. It communicates with memory in DMA mode. If I/O processor is available. This information is available in a I/O processor status word stored in the main memory by I/O processor. 16. Address lines carry the address of data in main memory and are collectively called an address bus. They. The bus usually has data lines.9. compete for the main memory. In this section we will describe in greater detail the external buses. Some buses may also have lines to distribute power. The bus width in this case depends on direct addressability . This is called a back plane system bus. 11. CPU sends a request to I/O processor for service. Another external bus is one which connects I/O interface units with CPU/memory. I/O processor uses the I/O program to read/write data in specified I/O device. Therefore CPU will not be significantly slowed down unless there are several high speed I/O devices handled by the system.

This is useful for transferring data from/to fast I/O devices. Besides addressing main memory. The terms read or write are used for transaction between I/O and memory. In this case the first transaction gives the start address k of n data words to be stored. Thus. For a specified period they are used to transmit an address and for another specified period they carry data. The operation of a synchronous bus is based on a clock which is an integral part of the bus. are faster. they carry command signals to specify operations such as memory write. write will transfer data from memory to the I/O device.358 Computer Organization and Architecture of memory. Because the bus is shared by several units. and read from the I/O device to memory. A bus in which transactions are controlled by a clock is called a synchronous bus else it is an asynchronous bus. Some buses also permit block transfer of data. 11.9. Subsequently n words are transmitted in a burst and stored from the starting address k. bus request. The unit then becomes the bus master and remains the bus master until the control transfers to another unit. They are: dedicated bus and multiplexed bus. memory read. there are three other characteristics of buses which are important. interrupt request. Here again with the increase in memory sizes address bus widths have increased. these lines are also used to address I/O devices. I/O read. The control bus is used to control the access to and use of data and address buses by various units which share the bus. Dedicated buses. The primary advantage of multiplexing is that the bus is cheaper as it uses fewer lines. Apart from bus width and controls. The bus transaction has two distinct parts: (i) send the address to identify the receiver and (ii) transmit the data. transaction type. address and data lines are two sets of independent lines whereas in multiplexed bus the same set of lines are time shared. bus grant. I/O write. We will explain them now. They are: type of bus.2 Types of Bus There are two major types of buses. the unit intending to read or write must first acquire the bus before transmission can begin. 11. method of timing and method of arbitration for bus use.4 Timings of Bus Transactions Another characteristic of buses is the presence or absence of a clock. interrupt acknowledge and clock signal to synchronize operations on the bus. 11. on the other hand.9. In a dedicated bus. Earlier microprocessor families used multiplexing whereas current high performance processors use dedicated buses. With 16 address bits 64K words of main memory can be addressed whereas with 32 bits 4G words can be addressed.9.3 Bus Transaction Type A bus transaction can be read or write. Besides this. data acknowledge. The bus protocol is based or timed by the number .

and the read/write control line is set to either read or write mode. Transfer of one unit of data (normally an 8-bit byte or 32-bit word) takes place in each bus cycle. we will assume the address lines and data lines to be different. .23. t1: This event refers to the data being placed on the data lines by the addressed device. or the falling edge of the bus clock-pulse. t2: The signals on the data lines have settled and the data is ready for strobing (or reading) by the master. Synchronous transfer Let us consider a sequence of events in a synchronous data transfer. Such protocols based on counting the clock pulses can be easily implemented in a small finite state machine and can be fast. t3: Cycle is completed. Such buses are called asynchronous buses. They use a hand-shaking protocol rather than a clock for correct operation. clock skewing becomes high. At the end of the cycle the address lines are reset. When the length of a bus increases. First the bus master acquires the control of the bus.23). The event t2 occurs during the OFF One cycle Bus cycle clock ON OFF ON Address Data t0 t1 t2 t3 FIGURE 11. This information is kept on the address lines throughout the bus cycle. t1 and t3 are recognized in this diagram by the rising edge.23 Events represented as a timing diagram (synchronous data transfer–input). The above sequence of events are represented in the timing diagram in Figure 11. Thus long buses are designed without clocks. For convenience. t0: Address of the device to be read is posted on the address lines of the bus by the bus master.Input-Output Organization 359 of clock pulses between events. The events t0. In read mode the following four events take place in the order given and the sequence repeats thereafter cyclically (see Figure 11.

. 2. of an event in one machine and its communication to the other machine. and be read by the receiver. The events (Figure 11. data to propagate to the receiver. zero propagation delay. This handshaking is achieved by following a well-defined protocol in the execution of both the machines.24 we have shown the sequence of 7 events marked 1 to 7 that take place during asynchronous read. The OFF period of the clock should be so selected that it is long enough for the device to put the data on the bus.360 Computer Organization and Architecture period of the clock. let us consider the CPU and I/O as two machines and let us say that the CPU wants to read a data unit (byte or word) from the I/O. 1 to 7 are events). Data must be posted by the device on the data lines after t1 and removed at t2. A directed horizontal line denotes the occurrence CPU Ready to read again 1 Address line asserted D1 2 Ready line asserted 3 Accepted & data asserted D2 I/O Read cycle D3 4 CPU reads data 5 Ready de-asserted D4 6 Address de-asserted D5 7 Accept & data de-asserted Ready to read again Time Time FIGURE 11. If a wide variety of devices with differing speeds is used. After a delay D1 it sends the READY signal for accepting data.24) are: 1.24 Event sequence diagram showing the handshake protocol for read (D1 to D5 are delays. CPU being the bus master sends the address of the device on the address lines. This choice should not be unusually large. Note that zero slope horizontal line implies. Asynchronous transfer Consider two finite state machines communicating with each other in such a way that one does not proceed until it knows that the other has reached a certain state. As an example. otherwise the frequency of data transfer will be reduced. the period will be selected to meet the needs of the slowest device. The two vertical lines denote the time as it progresses in the two machines. In Figure 11.

D4 Ready Accept D2 D1 Data Read data D3 D5 Read occurs in this interval FIGURE 11. 11.25 we have taken delays into account. The set of bus control lines contains a BUS REQUEST line and a BUS GRANT line. In Figure 11. The I/O acknowledges the receipt of the signal by de-asserting ACCEPT and DATA lines. The events and delays are shown in the form of a timing diagram. In practice it is unrealistic to assume zero delay. 5.9. The READY signal is removed to indicate that the data is read. The CPU reads data. 7. The I/O unit receives the READY signal. notifies the acceptance for data transfer and transmits data. interface circuits. 4. The ‘read data’ strobing occurs during the time interval D3 as shown in the last line in the figure. Several devices may simultaneously request for bus mastership. etc. 6. This .25 Timing diagrams for asynchronous read.5 Bus Arbitration The term bus arbitration refers to deciding which device gets the bus mastership next. (Dotted line arrow means trigger the event at the head of the arrow. The events are represented by the rising or falling edges of the timing pulses.) The triggered event is simply a state change (low-to-high or high-to-low). D1 to D5 are time intervals which have minimum requirements for correct operation.Input-Output Organization 361 3. The CPU de-asserts the address lines indicating the end of the cycle. Delays occur due to bus skew.

4. the device compares the bit values on the lines with its own code in order to decide if any other device with higher priority is requesting the bus. 1. their speed or bandwidth and the range of their speeds. It receives the bus requests. Lower priority codes will be lower in binary value. The distributed arbitration scheme based on collision detection is commonly known as CSMA/CD (CS-carrier sense. In order to deal with the request and grant lines. 5. one based on pre-assigned priorities and the other on collision detection. 2. In the first case. asynchronous—allows wide range. it determines who gets the bus next and gives the bus mastership to that device. we need a bus controller hardware unit. relatively slow. it posts its code onto a set of bus-arbitration lines that are shared by all the devices. The number of devices. perhaps through a set of multiple and parallel request lines. After placing its code on the arbitration lines. The arbitration techniques can be divided into three broad categories: 1. Required bus bandwidth. In both these methods. if the central unit (arbiter or bus controller) fails. All the bus-request lines from the various devices will be connected to this controller. The choice or design of a bus will depend on several of the following factors. Desired bus protocol (synchronous—simple. For example. Desired bus arbitration scheme (centralized or distributed). Cost (bus cost must be commensurate with the cost of the total computer system). . This is the protocol used by Ethernet a Local Area Network. the reliability of the bus system depends on the reliability of the central unit. 0010 and 0001 respectively. or a combination of both.362 Computer Organization and Architecture situation necessitates an arbitration mechanism. the bus operation fails. the bus controller is the sole arbiter. the design is such that the code values will be ORed. data lines. single bus master or more). need to support handshaking). Thus. The same concept is applicable here with respect to giving the BUS-GRANT line to one of the ‘daisy chain’ contenders. control lines. 6. The arbitration may be achieved by a special piece of hardware or by using a bus arbitration protocol. the device will back-out and request for the bus at a later time. CD-collision detection). In the case of the centralized-arbiter scheme. Based on some priority. If yes. 3. Based on central bus-arbiter hardware. but accommodates low range of speeds. Daisy chain based (bus grant is daisy chained). We will discuss this scheme in detail in Section 11. higher speeds. Desired bus organization (bus width or the number of address lines. each device that is capable of becoming a bus master is assigned a priority code.12. 2. We have described the daisy chain concept in an earlier section. MA-multiple access. 3. When a device requests for the bus-mastership. Distributed arbitration scheme based on a protocol. connected to the bus. in a group of 4 devices. the highest priority device may be assigned the code 1000 and the next successive priority devices codes 0100. If there are multiple codes posted. The distributed arbitration scheme can be divided into two categories.

These lines are multiplexed for data/address.1 Remarks The detail of the number and purpose of PCI bus lines is given in Table 11. memory and peripherals. Further.10 SOME STANDARD BUSES Over the years many standards for buses have been proposed and used. The other major characteristics of PCI bus are: 1.1 PCI Bus Lines and their Purpose Purpose of Lines Data and Address Number of Lines 64 32 mandatory 64 optional Multiplexed Clock and Reset Control and timing of transactions. 3.1. PCI bus standard allows 32 or 64 bits for data or address lines. memory and I/O devices have evolved so have buses. They also developed chips to connect this bus to older buses and systems. As technology of processors. 2. The standard specifies a synchronous-clocked bus with a clock speed of up to 66 MHz. Bus request and grant lines 2 extra for 64-bit transfer requests. Intel designed a bus called PCI bus (Peripheral Component Interconnect bus) for Pentium-based systems. These decisions made PCI bus popular and currently it has become a standard used by several manufacturers. 4.Input-Output Organization 363 11. they made the specifications of the bus public knowledge and made it available to peripheral developers and chip designers. Coordination Centralized arbitration.224 Gbps. Thus the raw data transfer rate is 66 ´ 64 ´ 106 = 4. Multiplexed data and address Centralized arbitration Transaction types allowed: read. To proliferate use of this bus they developed several VLSI chips to support several configurations of processors. write. Parity and system error 2 optional for 64-bit data 1 bit/byte indicating which bytes carry meaningful data Used by PCI devices that must request for service To support memory on PCI which can be cached Optional—uses IEEE Standard 1149. burst transfer Error detection with parity bit TABLE 11. System Interface control Arbitration Error reporting Multiplexing commands Interrupt commands Cache support Testing bus 2 6 4 2 8 4 2 4 .

flash memory. keyboard. We now briefly explain the USB interface.26 we show a typical configuration of a desktop system using a PCI bus. High speed devices such as graphics board and high speed disks with SCSI controllers are directly connected to the bus. This bus is connected to the PCI bus by an integrated circuit called PCI bridge. CPU Memory bus Main memory PCI bridge PCI bus Extra PCI slots SCSI USB ISA bridge Graphics adapter Monitor SCSI Devices M K ISA bus Extra ISA slots Scanner Sound card M : Mouse K : Keyboard Printer FIGURE 11. this bus can be improved.26 Bus architecture (using PCI bus) of Intel Pentium based desktop computers. As CPU speeds increase. Observe that the system shown is a two-bus system. to the PC. Another innovation is an adapter to connect devices using Universal Serial Bus (USB). The USB standard is designed to allow the following: . The ISA (Industry Standard Architecture) bus was the previous standard used by personal computer manufacturers. The CPU-main memory bus is a proprietory high speed bus to support fast CPUs and memory. USB is a standard evolved by industry group to allow connecting lower speed I/O devices such as mouse. Thus ISA bus is useful to permit connecting older peripherals to the new PCI bus. In Figure 11.364 Computer Organization and Architecture The standard has 49 mandatory lines primarily for supporting 32-bit data and address and 50 optional lines to support 64-bit address and data version. Another integrated circuit called ISA bridge connects devices supported on the ISA bus to PCI bus. etc.

Specialized integrated circuit chips are available to perform serial-to-parallel or parallel-toserial conversions. in LAN and WAN. is serial as it is rather expensive to run parallel lines over long distances. Computers all around the globe can be connected by means of Wide Area Networks or WANs. etc. A physical device which achieves these two processes is known as modem. The I/O devices on USB to get their power from the computer using one of the USB lines. Historically. The I/O interface of a computer system transmits or receives 8 bits or 32 bits at a time (see PCI bus for examples). Up to 127 devices to be connected. remote log-in to use remotely located computing facilities. Users to connect a new device with a standard socket while a computer is working. if necessary). Recognition of a newly plugged device and allocate it an address.Input-Output Organization 365 1. 4. This allows more devices to be daisy chained. This is known as hot pluggable. WANs make use of public telecommunication lines for the transfer of digital data from one place to another. The USB cable has four wires. The digital data can be carried over the analog transmission lines by converting 1’s and 0’s transmitted by computers to analog form and the analog signals received from telecommunication lines to 1’s and 0’s for use by computers. Data transmission. high quality printing facilities) and access them from remote sites and (2) the need for computer networking. A device connected to PC via a USB in turn has some USB ports. 5. But the modern telecommunication lines are digital and carry both digital data and analog voice (in digitized form. two for data. it is necessary to group 8 bits arriving serially and present them as parallel inputs to a computer and serialize the output of a computer for transmission. Such networks are used for several enduser applications.11 SERIAL DATA COMMUNICATION The widespread use of personal computers at homes and desktop computers in offices have created two important requirements: (1) the need to share resources (data. Support real time operation of devices such as a telephone. 11. 3. electronic bulletin boards. As serial communication is very slow (in comparison to the CPU . computer-supported collaborative work among geographically distributed workers. Computer networks connecting people located within a radius of a few kilometres are known as Local Area Networks or simply LANs. The USB plug is standardized for easy plug and play. electronic file transfers. Thus. The USB cable hub is connected to the PCI bus and a serial port is brought to the back of the PC cabinet. disk storage. 2. one for power (+5V) and one for ground. Some examples are: electronic mail. public telecommunication lines were designed primarily to carry voice data in analog form. The former process is known as modulation (while transmitting) and the latter as demodulation (while receiving).

Figure 11. strips them. The buffer register usually stores a byte. As soon as the buffer is full an indicator is set to inform the processor that the data is ready. The receiver part receives a series of bits serially at a predetermined rate called the bit rate (bits per second). The chip marketed by INTEL Corporation is called UART chip and Motorola has a chip which it calls asynchronous communication interface adapter (ACIA). In the output mode the interface should accept 8 bits in parallel from the CPU. then the information bits are stored in a buffer register.27 An asynchronous serial communication bits. We will first describe the nature of an asynchronous serial data communication.2 Asynchronous Communication Interface Adapter (ACIA) Many I/O units such as keyboards accept and send data only serially. As seen from this figure the chip has two logical parts.366 Computer Organization and Architecture speed). Start bit b0 b1 b2 8 bits b3 b4 1 frame b5 b6 b7 Parity bit Start bit FIGURE 11. The data bus of a microprocessor has a width of 8 or 16 bits and receives and transmits data of 8 or 16 bits in parallel. an interface circuit is needed which would receive serial data from serial devices. assemble groups of 8 or 16 bits and send them in parallel to the CPU. A clock at a frequency equal to the bit rate synchronizes the received bits. If parity fails. 11. and checks whether the parity of the received group of bits is correct. The time-interval between successive bytes is arbitrary. Thus. Thus.1 Asynchronous Serial Data Communication In asynchronous transmission there is no clock to time bits.28 gives the functional characteristics of such a chip. If parity is correct.27 we show an asynchronous data frame. for each frame of 8 bits there must be a special start bit to signal its start and a stop bit to indicate the end.11. The receiver recognizes the start and stop bits sent by the input peripheral unit. The processor can interrogate the indicator and then receive the 8-bit data via the data bus. 11. Chips for this serial to parallel and parallel to serial conversion and synchronization are marketed by microprocessor manufacturers. store the bits in a register and send them serially to the I/O device at a rate determined by the I/O device. The receiver must know the configuration of the frame and the speed at which bits are being transmitted.11. In the following paragraphs we describe an IC-chip known as Asynchronous Communication Interface Adapter (ACIA) and the concepts involved in the operation of a modem. asynchronous mode is used for serial communication. an error indicator is set. the receiver part and the transmitter part. . In Figure 11. A parity bit (either odd or even) may be appended as a ninth bit as an optional bit.

and for networking. Receives 8 bits in parallel from the processor and stores them in a buffer register. the chip can be programmed for different clock speeds and different byte sizes. generates and appends a parity bit and follows this with stop bit. The serial . We have not given in detail all the error controls performed by the chip (which are extensive).11. The reader is urged to consult such details in the manufacturers’ data sheets.Input-Output Organization Receiver Serial data from device 8 6 Data ready signal to micro To data bus of micro Control and status bits Clock for transmission 8 Serial data to device Transmitter From data bus of micro Transmitter buffer empty signal to micro 367 Clock for reception FIGURE 11. then the processor sends the 8 bits to the interface adapter and continues with its program. When the buffer is emptied and serialized. When the processor wants to send out a word. Besides this.. Usually two sine waves at two different frequencies are used to represent 0 and 1. 2. shifts the 8 bits from the buffer at a preassigned clock rate. Telephone links are designed to carry analog signals over a frequency range of 50 Hz to 3500 Hz. an indicator is set to inform the processor that the buffer is empty. It prefixes a start bit to the group of 8 bits. The transmitter part of the chip performs the following functions: 1. by polling) to check whether the transmit buffer register is empty. Digital outputs from a computer are serialized by a device such as ACIA. Thus. it reads the status register (e. The interface adapter serializes and sends the data on the output line. 11.3 Digital Modems Serial transmissions of data and programs to terminals situated at long distances through telephone lines is useful to time share computers. If empty. it is necessary to convert dc voltages corresponding to 0 and 1 to audio frequency signals. processor must wait by either looping or by jumping to another part of the program and returning when the transmit buffer is empty. These bits are sent out serially via an output port.g. If the buffer is full.28 Functional diagram of asynchronous communication interface adapter (ACIA). 3. This is known as frequency shift keying (FSK).

Figure 11. 1070 Hz for a 0 and 1270 Hz for a 1 and sent on the telephone line. 3p/4.6 Kbps on ordinary voice grade telephone lines. With the method of using only two frequencies the speed of the modem will be limited as the bandwidth available on telephone lines is limited. 3. As we need communication in both directions. In this method the phase of a sine wave is shifted by p/4. Thus the terminal side of the link is called the originate end and the computer end of the link is called the answer end. the frequencies for 0 and 1 at originating end are chosen as 1070 Hz and 1270 Hz respectively. 10 of the input bit string. Data is to be sent in both directions simultaneously.368 Computer Organization and Architecture bits are modulated to signals at two frequencies. With this coding the bit rate is doubled.30. 01. At the receiving end these frequencies have to be converted back to 0 and 1. 4. 5p/4 and 7p/4 and each phase is used to represent 00. Flexibility to be used either at the originating or at the answering end. In recently manufactured modems (2006) a standard known as V92. A user normally logs on the terminal and requests service from the computer. 2.29 Use of modems with telephone lines. To enable such a communication. A block diagram of a modem is given in Figure 11. 11. redundant bits are used to correct errors in data transmission together with data compression to obtain speeds of 56. namely. This is done by a demodulator. Controls to allow automatic answering and disconnecting. Thus a method called phase modulation is used. With 8 phase to represent 3 bits in the input. Modulation of 0 and 1 to two audio frequencies and demodulation from two audio frequencies to 0 and 1. . From the answer end the 0 and 1 are transmitted by signals of frequencies 2025 Hz and 2250 Hz. Two sets of frequencies for the transmitting and the receiving ends to enable full duplex communication.29 shows a link between a computer and a terminal. from the computer to the I/O device and vice versa each end of the telephone line needs a modulator and a demodulator. 5. As modems are extensively used special IC chips are marketed to replace a large number of discrete circuit elements. Devices to modulate 0 and 1 to sine waves and demodulate the received sine waves to 0 and 1 are known as modems. This is known as full duplex communication. A C I A M O D E M M O D E M A C I A From or to computer Telephone line From or to computer FIGURE 11. the speed is tripled. These chips have the following functional characteristics: 1.

i. This is called Media Access Control (MAC). The type of transmission medium used to interconnect the individual computers. 2. way in which the individual computers are interconnected.e. There are four important aspects which characterize different types of LANs.12. 3. This was later refined and standardized as IEEE standards 802.12 LOCAL AREA NETWORKS When computers located within a small geographical area such as an office or a university campus (within a radius of 10 km) are connected together we call it a Local Area Network (LAN). There are two major topologies of connecting computers using Ethernet protocols. How smaller LANs called subnetworks are interconnected to form a network encompassing the whole organization which may have thousands of computers.31).1 Ethernet Local Area Network—Bus Topology Ethernet is a standard developed by Digital Equipment Corporation. 11. Xerox and Intel for interconnecting computers within a small geographic area. Topology of the networks. We will first describe the bus topology. a bus structure (Figure 11. They are respectively.30 Block diagram of a MODEM.Input-Output Organization 369 Transmit Modulator To telephone line A C I A Receive Demodulator From telephone line Control Terminal control logic Auto answer/ disconnect logic Control signals to data computer Clock and timing FIGURE 11.35). The protocol used to access the physical medium by a computer connected to a LAN. 4. and a star interconnection (Figure 11. They are as follows: 1. The original standard was for the . 11.

Ethernet has advanced much beyond the earlier standard. As it waited for a random time the probability of another collision is low. The Ethernet protocol is thus implemented as part of Ethernet chip. The data link layer defines controlling access to the network (called media access control or MAC for short) and how data packets are transmitted between computers connected to the network. Due to the simplicity of Ethernet protocol and its flexibility. As the bus is accessible to all NIUs connected to it. When a NIU wants to send data. Many versions have appeared and we will discuss them later in this section.370 Computer Organization and Architecture interconnection of computers using a bus. The NIU also stops transmitting and waits for a random time and retransmits the packet.31 An Ethernet LAN. Transmission of bits on the cable without modulation is known as base band transmission. Computer A B C D E F Transceiver or Network interface unit NIU NIU NIU NIU NIU NIU Coaxial cable FIGURE 11. The period T is the time which the packet will take to reach the farthest NIU in the bus and return back to the sender. Exchange of data between NIUs proceeds as per the following protocol. Modulation is not necessary as the maximum length of the cable is small. If no signal is detected. its receiver listens to the bus to find out whether any signal is being transmitted on the bus. . Referring to Figure 11. Once a collision is detected. Thus the receiver part of the transceiver of a NIU must listen to the bus for a minimum period T to see if any collision has occurred.31 we observe that all computers are connected to the bus and each computer communicates with the bus via a Network Interface Unit (NIU) which is a combined transmitter and receiver. then these packets will collide and both packets will be spoiled. Collision is detected if the energy level of signal in the bus suddenly increases. it transmits a data packet. another NIU could also find no signal on the bus at that instant and try to transmit a packet. If both NIUs transmit a packet on the bus. it is the most widely used Local Area Network standard. Each computer delegates the task of sending/receiving packets as a set of unmodulated bits to NIU. The physical layer was specified as a shielded coaxial cable supporting a data rate of 10 Mbps. This is called Carrier Sense (CS). the NIU which detected the collision sends a jamming signal which is sensed by all other NIUs on the bus so that they do not try to transmit any packet. Thus all computer manufacturers have provided a built-in Ethernet chip to allow the computer to be connected to LAN.

The length of the data packet is between 46 and 1500 bytes. Other NIUs ignore it. FIGURE 11. data packet and check bits (Figure 11. standardized and supported by all vendors of computers. it waits for double the previous random period and transmits. Currently Ethernet is one of the most popular Local Area Network protocols used as it is well proven. The length is based on the length of the bus and the number of NIUs connected to the bus. It is called Multiple Access as any of the NIUs can try to send a packet on the bus or receive a packet from the bus. address of the receiver. . A packet sent by an NIU is monitored while it is in transit by all other NIUs on the bus and the NIU to which it is addressed receives the packet and stores it. The format of a packet consists of some bits for clock synchronization followed by the address of the sender. that is.Input-Output Organization 371 If there is again a collision.32 as a flow chart. This method of accessing the bus and transmitting packets is known as Carrier Sense Multiple Access with Collision Detection (CSMA/CD) system. By experiments and analysis it is found that this method is quite effective and collisionless transmission will take place soon. It is possible to broadcast a packet to all NIUs.33). The protocol is explained in Figure 11. Ethernet is a broadcast medium. sent to a subset of NIUs. A packet can also be multicast.32 CSMA/CD protocol used in Ethernet.

33 The format of a frame or packet in Ethernet LAN. Ethernet may be extended using a hardware unit called repeater.34. The number 10 stands for Mbps. several options have emerged for the physical layer of Ethernet. Use of repeaters is an inexpensive way of interconnecting Ethernets. This Ethernet supports a fewer computers over a shorter distance compared to the Ethernet standard (see Table 11. A repeater reshapes and amplifies the signal and relays it from one Ethernet segment to another. No two computers can have more than two repeaters between them if they have to communicate reliably. The main disadvantage of repeaters is that they repeat any noise in the system and are prone to failure as they require separate power supply and are active elements unlike a cable which is passive. Ethernet segment Floor 1 Floor 2 Backbone cable Floor 3 Repeater FIGURE 11. A typical use of repeaters in a building is shown in Figure 11. A repeater is used to attach Ethernet segments running in each floor to the backbone. Transmission media Recently. BASE indicates base band transmission and 5 stands for a coaxial cable with 50 ohm impedance. A backbone cable runs vertically up the building.2). A cheaper version is called 10 Base 2 where the coaxial cable is thinner and cheaper. . It is also known as thin-wire Ethernet. The first standard using a coaxial cable is called 10 Base 5 Ethernet. Each Ethernet segment is usually limited to 500 metre.372 Computer Organization and Architecture Destination address 48 bits Source address 48 bits Frame type 16 bits Check bits 32 bits Preamble Data packet 64 bits 368 to 12000 bits FIGURE 11.34 Use of repeaters to extend Ethernet connection.

The topology is shown in Figure 11.35 Ethernet using unshielded twisted pair of wires and hub.2 Physical Wiring of Ethernet Type of Wiring IEEE Standard 10 Base 5 Maximum Cable Length (metres) 500 Topology Shielded coaxial cable RG-8u (Thicknet) Shielded coaxial cable RG-8u (Thinnet) Unshielded twisted pair (telephone cable) Unshielded cat 3 twisted pair Bus 10 Base 2 185 Bus 10 Base T 100 Star with hub 100 Base T 100 Star with hub 11. This is called 10 Base T. In this topology each node (i.Input-Output Organization 373 TABLE 11.12. . The physical transmission medium used is unshielded twisted pair of copper wires normally used in telephone networks.2 Ethernet Uusing Star Topology A star topology for interconnecting computers which logically follows the same principle for media access as a bus-based Ethernet has now become the preferred method for LANs.35.e. a computer with FIGURE 11.

Early systems used narrowband technology in which a low power carrier signal was modulated by digital data using amplitude modulation. etc. This technology currently provides a peak bandwidth in the range of 1 to 11 Mbps. The power should be low to avoid interference with other transmitters. in a hub connection. Hub-based wiring is much more flexible compared to cable wiring.36). 11. Local Area Networks (100 Mbps) using CAT3 unshielded twisted pairs UTP and gigabit Ethernet LANs using fibre optic cables are available.) The distance between a node and hub must be less than 100 metres. Adding new computers to the LAN is easy. The protocol is the same as Ethernet. if a node fails it can be isolated and repaired while other nodes work. Unlike a cable connection where if there is a fault in cable all nodes are affected and troubleshooting is time consuming.374 Computer Organization and Architecture an NIU) is connected to a central hub using twisted pair of wires (see Figure 11. In order to communicate using radio waves between a mobile computer and a fixed LAN. cellular radio technology used by telephones has been adopted to communicate between mobile laptops and stationary local networks. The hub detects collisions and sends this information to all the nodes connected to it. The situation changed dramatically in the 1990s with the emergence of portable computers and better wireless technology. If the capacity of hub is exhausted more hubs may be used as shown in Figure 11.4 MHz) which is not used for commercial radio and other purposes. (This number is increasing with improvements in technology.3 Wireless LAN The use of wireless media to transmit digital information started in the late 1960s with the ALOHA Project at the University of Hawaii. It was thus affected by noise and special coding methods and error detection/correction algorithms were implemented.35(a)). namely. Each hub can normally handle up to 16 nodes. The main advantage of this type of connection compared to cable connection is higher reliability and ease of trouble shooting.12. ALOHA network which was set up used an early version of CSMA protocol. (CSMA/CD). The transmitter uses frequency in the so-called unlicensed band (2.35(b). Executives moving around with their laptops wanted to be connected to the computers in their organizations to look at their emails and retrieve information from databases and also send email. To reduce error while maintaining . The hub has electronic circuits which receive signals from the twisted pair connected to it. purchase orders. The motivation was the requirement of communication between computers located in a number of scattered islands of Hawaii. Wireless technology also improved leading to widespread use of cellular telephones. It was necessary for the transmitter power to be low to avoid interference with other systems using this frequency. amplifies and reshapes it and broadcasts it to all the other connections. Currently 100 Base T. the mobile computer should have a transceiver (a combination of a wireless transmitter and receiver) and the LAN must have a base station with a transceiver to transmit and receive data from the mobile computer (see Figure 11. Other advantages are that most buildings are wired with twisted pair telephone lines and it is easy to connect them to hub. Thus.

4 GHz band and gives a data rate up to 11 Mbps and IEEE 802. This is the most common use as of now.4 GHz is expensive.36 Wireless communication with LAN. As equipment at 5. IEEE 802. Thus. we now have an additional method of catering to I/O requests.12. etc. With the proliferation of powerful desktop computers and rapid reduction of the cost of microprocessors. Japan and India. In 2003. IEEE 802. low power transmission.Input-Output Organization LAN 375 C1 C2 BS Antenna C1.37). Clients request services such as printing and access to large files . hotels and many public places. Normally a wireless transmitter/receiver in a fixed spot (called wireless hotspot) is connected to a wired backbone (such as Ethernet LAN).11a which works at 5.11b is more popular. This spreading of frequencies reduces the probability of jamming of the signal and makes it difficult for unauthorized persons to acquire and interpret data. Thus it is more reliable and secure. cost more than 10 times the cost of a desktop processor. They are IEEE 802.11g standard is now becoming popular as it gives a higher speed of up to 54 Mbps.8. large RAID systems. Only 2. the scenario has changed.4 GHz and a data rate of 54 Mbps. That organization is primarily used in mainframe computers where the CPU is very powerful and expensive. In this method the input signal is transmitted (at a low power level) over a broad range of frequencies. Thus.11b which works at 2. a mobile laptop with Centrino chip can access hotspots maintained by ISPs at airports.4 GHz is currently a free band in Europe.11b standard. IEEE 802. I/O systems such as high speed laser printers.4 Client-Server Computing Using LAN We described I/O processor organization in Section 11. a newer method called spread spectrum is now being used.11g which is just now emerging and works at 2. There are three standards for wireless connection which are currently prevalent. Both 2. Intel announced a new processor called Centrino which has Pentium IV architecture with built-in wireless interface using IEEE 802.4 GHz and gives data rates of up to 54 Mbps. C2: Fixed Computers BS: Base Station MC: Mobile Computer MC FIGURE 11. This method of performing I/O consists of one or more client computers connected to a LAN to which are also connected several server computers (Figure 11. This hotspot is accessed from a mobile computer.4 GHz band are so called ‘free’ or unlicensed wireless bands in USA.4 GHz and 5. 11.

37 Client-server computing. A specialized print program stored in the server will print the file. 2. 3. For example. The speed of data transfer from an I/O device is at least 1000 times slower than that of the CPU. . They are: (i) program controlled data transfer. In program controlled data transfer the CPU after sending an I/O request is idle till I/O transfer begins at which stage it transfers data to or from the main memory. I/O Systems organization deals with methods of connecting I/O devices to the CPU-memory system of a computer. To provide uniform I/O access each I/O device has a controller which is device dependent. received by the server and queued along with other requests. queue them. SUMMARY 1. The major challenge in I/O organization arises due to the fact that there is a large variety of I/O devices which vary in their speeds and data formatting. There are three methods used to transfer data to/from I/O devices to the main memory. Servers are full-fledged computers and are programmed to cater to the requests coming from several clients. This method wastes CPU time. Server computers Print server File server LAN Client computers FIGURE 11. This is connected to an I/O interface unit which has a uniform logical structure independent of devices’ specific characteristics. 4. and carry them out.376 Computer Organization and Architecture which are provided by servers. 5. Observe that a client having delegated the print job proceeds with the rest of its computing. Thus I/O transfer methods are required which reduce the impact of this speed mismatch on a computer’s performance. (ii) interrupt controlled data transfer and (iii) DMA based data transfer. 6. a print request will be accompanied by a file to be printed which will be transmitted by the LAN.

In interrupt controlled I/O the CPU issues the command to the I/O interface unit and continues with the program being executed.) to be used. The main units of a computer. To allow older I/O devices (used with Intel 80486. Thus to connect a computer to a terminal or another computer a few metres away we use standard telephone lines (such as twisted pair of wires) which carry bits serially. a special unit called Direct Memory Access controller is used as an intermediary between the I/O interface unit and the main memory. 9. Some computers have separate I/O processors to which the CPU delegates I/O. The lengths of buses are limited to tens of centimetres because in buses the number of parallel lines are large and their capacitance increases with length affecting their speed. memory and I/O. 14. Another bus called Universal Serial Bus (USB) is connected to the PCI bus. the I/O interface unit sends an interrupt signal to CPU informing it that I/O is ready to transmit/receive data. PCI bus is attached to this bus using a special IC chip. A bus called Peripheral Component Interconnect (PCI) bus has been standardized by industry for higher speed Pentium class of computers. 11. The DMA controller now directly sends/receives data to/from main memory. 12. When I/O is ready. several devices may need to transact I/O simultaneously. the PCI bus is connected with a special chip to the ISA bus used by the older I/O devices. etc. 10. USB is standardized for use with slow and cheaper I/O devices such as keyboard and mouse. To carry out transactions on buses specially designed timing signals are needed which are discussed in detail in the text. 13. CPU suspends its current activity and attends to I/O using an I/O procedure which sends/receives data to/from main memory. CPU. They work in parallel with the main CPU and perform I/O using DMA mode. CPU idle time is thus reduced or eliminated. namely. 8. CPU has no role to play in data transfer other than yielding the buses. When I/O data is ready. A set of parallel wires which carries a group of bits in parallel and has an associated control scheme is called a bus.Input-Output Organization 377 7. The CPU after issuing the I/O command proceeds with its activity. In a DMA based I/O. . DMA controller requests CPU to yield the data and address buses to it. As computers have several I/O devices. An IC chip called Asynchronous Communication Interface Adapter (ACIA) is designed to take a byte (8 bits) appearing in parallel and convert it to serial data and vice versa. are interconnected with buses. In this class of computers an internal bus connects the CPU to main memory. Thus they should be attended one by one based on their priority. Either software or hardware methods are used to assign priorities to devices and give them access to the main memory. I/O processors have a specialized instruction set to perform efficient I/O. All I/O interface units of fast peripherals are connected to PCI bus.

378 Computer Organization and Architecture 15. Ethernet uses a protocol called CSMA/ CD (Carrier Sense Multiple Access with Collision Detection) to communicate between any two computers connected to it. Make a comparative chart comparing the hardware and software features of programmed data transfer. IEEE standard 802. Communication between a mobile computer (such as a laptop) and computers on a LAN is established by using wireless communication. A wireless transceiver is connected to the LAN as a base station. To increase the speed of transmission via modems serial data is grouped (3 bits) per group. Wireless transceivers are added to a mobile computer. With coaxial cables it is difficult to add new computers to the network and also to troubleshoot if there is any problem in the LAN. interrupt based data transfer and DMA data transfer. 2.4 GHz wireless band is used. . with a data rate of 54 Mbps are currently prevalent. Each group is represented by a single sine wave with a phase shift of p/8. 20. A modem converts 1’s and 0’s to two distinct sinusoidal signals of frequency in the 1000 Hz range to 2000 Hz. Ethernet uses a multidrop coaxial cable or unshielded twisted pair of wires to interconnect computers. The hub is an electronic circuit which helps in implementing Ethernet protocol in such a LAN. Earlier Ethernets used coaxial cable to interconnect machines. (ii) Use these instructions in an illustrative machine language program. When serial digital data is to be transmitted using the telephone lines provided by a Public Switched Telephone Network (PSTN). Ethernet is a very popular LAN.11g. 2. EXERCISES 1. 18. Mobile machines establish communication to computers on the LAN via the base station. 16. we need a device called a modem because PSTNs are designed to transmit analog telephone conversation whose frequency in the band 50 Hz to 3500 Hz and not digital data.11b with a data rate of 11 Mbps and IEEE 802. Speeds of 10 to 100 Mbps are supported by Ethernet. PSK signals can be transmitted at a higher speed. This is called Phase Shift Keying. The protocol used is Ethernet protocol. 17. (i) Add instructions to SMAC++ to enable it to use the program controlled transfer to transfer a word from an I/O buffer register to a specified location in memory. Nowadays individual computers are connected using unshielded twisted pair of wires to a hub as a star connection. 19. Computers located within a small geographical area may be connected to constitute a Local Area Network (LAN).

(i) Why are bi-directional lines used? (ii) When does read signal go to DMA and when does it go to Memory? (iii) When does the DMA controller send write signal to Memory? 12. Give the steps which would be taken by an I/O processor that byte multiplexes one high speed device which can transfer 4 bytes in parallel in 500ns (once every 0. 8.20 giving interconnections between DMA controller. how many computers are connected to a hub? (iv) If any modems are used what are their characteristics? 11. (i) What information is to be sent by CPU to the DMA controller when it initiates a read command? (ii) Draw a flow chart explaining how the sector is stored in the main memory. 9. 7. with 10 low speed devices that can each transfer 1 byte in 5 ms (once every 100 ms). For a Pentium V based desktop PC find out the following: (i) Peripheral devices attached to the computer. Enumerate the programming steps necessary in order to check when a device interrupts the computer while it is serving the previous interrupt from the same device. packs them in a 48-bit word and stores the word in memory. In a LAN used at your institution find out the following: (i) How many computers are connected to the LAN? (ii) What is the distance between the farthest computers connected to the LAN? (iii) How is each computer logically connected to the LAN? Is it a hub? If it is a hub. Observe that in Figure 11. 5. A DMA interface receives 8-bit bytes from a peripheral. What are the major differences in handling these in a computer system? 4. The computer is byte addressable.5 microseconds).6. 6.Input-Output Organization 379 3. CPU and Memory the R and W control lines are bi-directional. Draw a comparative chart showing the differences between the five methods of servicing interrupts discussed in Section 11. Distinguish between traps and external interrupts. Draw a priority logic and encoder for an interrupt with six interrupt sources. A disk sector with 512 bytes are to be sent to the main memory starting in address 258. . Draw a block diagram of the DMA registers and obtain a sequence of micro-operations of the DMA. (ii) How is each peripheral device logically connected to the computer? (iii) How is the information represented and coded in the peripheral device? (iv) What is the speed of data transfer from the peripheral device to the computer’s memory? 10.

A message ABRACADABRA is sent from a keyboard to be stored in main memory.380 Computer Organization and Architecture 13. CPU speed is 100 mips (million instructions per second). In Figure 11. Assume a computer uses software to assign priority to devices. Explain how interrupt servicing programs assign priorities. Draw a flow chart to explain priority interrupt servicing. Estimate by how much CPU is slowed down due to data input. 16. In exercise 17 assume interrupt controlled data transfer is used. . how many times will the CPU be interrupted? 15. If D1 interrupts before INTA is initiated by CPU. Is DMA given higher priority to access the main memory of a computer compared to CPU? If so why? 17. which device will be attended? If D1 interrupts after INTA reaches D2. Assume that D2 interrupts CPU at t1 and INTA signal comes at t2 > t1.13 we have shown daisy chaining of devices which interrupt a CPU. A DMA device transmits data at the rate of 16 KB/s a byte at a time to the main memory by cycle stealing. which device will be attended by CPU? 14. An interrupt processing program needs 50 machine instructions. Which method of data transfer will be appropriate and why? If it is interrupt controlled. By how much is CPU slowed down due to DMA? 18.

In this chapter we will explain the evolution of CPU architectures over the decades and the general principles which govern their design. The motivation in designing Complex Instruction Set and Reduced Instruction Set (CISC and RISC) Computers. How pipelined processors improve the speed of processing instruction. superscalar processor. I nstruction level parallelism available and how it is exploited by 12. We will examine the way in which Complex Instruction Set and Reduced Instruction 381 .ADVANCED PROCESSOR ARCHITECTURES 12 LEARNING OBJECTIVES In this chapter we will learn: â â â â â â How the architecture of processors evolved from the sixties to the present.1 INTRODUCTION So far in this book we examined the design of a simple CPU and saw how the instruction set is evolved. The factors which govern the performance of processors and how they change with time. The factors which upset pipelined execution and hardware as well as software methods of alleviating them.

In the sixties the major applications were scientific computing. computer designers have been developing systems which exploit parallelism inherent in problems to be solved. We will examine how to design such parallel computers in the next chapter. In this chapter we will mainly discuss parallelism which is exploited at the instruction level. To solve these complex problems.1 Main Determinants in Designing Processor Architecture Three major factors have driven the design of processors over the last four decades. 12. there are many computing systems being designed and marketed which use several independent computers which are interconnected to form what are known as parallel computers. These computing elements cooperate and solve problems faster than a single computing element in a chip. 12. In data parallelism groups of instructions are carried out on different data sets simultaneously. While the speed of individual processors has been doubling the complexity of problems being solved using computers has been increasing at a faster pace. A number of innovations in the design of arithmetic units. We are also witnessing the evolution of new processors with multiple independent computing elements integrated within a single chip. which was dominated by the need for fast arithmetic speeds and floating point calculations. A major requirement in scientific and engineering .2 GENERAL PRINCIPLES GOVERNING PROCESSOR ARCHITECTURE THE DESIGN OF Over the decades a set of design principles has remained invariant and is likely to remain so in the future.382 Computer Organization and Architecture Set computers came into being and the main motivation which led to the development of Reduced Instruction Set Computers (RISC). Parallelism exist both at single instruction level and at the level of a group of instructions or a program. Application requirements Processor design has always been driven by the requirements imposed by the major application domains. They are: l Application requirements l Programming convenience l Prevailing technology We will now briefly review how these have been changing over the years. particularly floating point units. In temporal parallelism the attempt is to overlap in time.2. In this section we will review these general principles. These machines are programmed to cooperatively solve a single problem in a short time. phases of execution of a group of instruction. Besides this. emerged. Parallelism is also of two types: temporal parallelism and data parallelism.

Applications demanded better human interfaces to the operating systems leading to the emergence of graphical user interfaces or GUI. plenty of applications appropriate for individuals emerged. The processor architecture was dominated by Intel with their successful ´86 series processors. They saw computers as electronic equivalents of older accounting machines. online realtime control of complex industrial plants and processes. The 1990s was dominated by improved PCs and emergence of distributed computing. . indirect addressing and base registers. Time sharing came from the demand for quick debugging of application programs. The 1970s saw rapid growth in the use of computers in industries besides large scale scientific computing. This led to the emergence of mini-computers dominated by the pioneering effort of Digital Equipment Corporation’s PDP series of machines. these computers paid special attention to this area. Consequently smaller data processing machines such as IBM 1401 dominated the scene. byteoriented storage with variable length data and instructions were popular with CPU architects. Personal computers as their name implies were designed for use by individuals. CPU architects during this decade tried to design “general purpose machines” which would work effectively for both scientific and business computing. Requirements of data processing was somewhat different from scientific computing. Towards the mid sixties applications of computers in business data processing emerged as a big potential market. Since memory was a scarce resource. Character oriented processing was important. Towards the later part. A number of ideas such as index registers. which earlier had mechanical punched card equipment. Large databases emerged with the growth in the size of disk storage. emerged during that period. graphics and video data. As I/O needs was dominant compared to CPU needs. was dominated by the International Business Machines Corporation (IBM). spread sheets and computer games but also by the need to share objects like data files and programs. a new application area was emerging. namely. The PDP series introduced many innovations in CPU architecture such as displacement addressing and variable length operation codes. These applications required analog to digital (A/D) and digital to analog (D/A) conversion. Once a low cost Personal Computer (PC) was introduced. arithmetic with rounding (as in floating point) was not acceptable and decimal arithmetic with BCD numbers was considered more appropriate. They also pioneered the development of time-shared machines. During this decade. The IBM 360 and 370 families were dominant in the market. The 1980s was the decade of the microprocessors which led to the development of the IBM PC.Advanced Processor Architectures 383 computation is computation with vectors and matrices. For accounting applications. This market. Further. they had to be low cost and also work in plant environment and often integrated with the plant as controllers. there was a significant change from numeric and character oriented applications to applications using audio. Applications were driven not only by individuals’ requirements such as word processing.

picking an appropriate instruction set. Programming convenience Ease of use of computers has always been paramount from the point of view of computer designers. whether to have multiple processors on a chip and numerous similar questions. with improved knowledge of compiler design it became apparent that efficiency of application programs can be enhanced if there is close interaction between a CPU architect and an expert on compilers. ease of debugging. the CPU was built and then a language written and adapted to the CPU architecture. almost doubling every 18 months. The CPU designer has the interesting task of deciding how to effectively use the silicon area. Ada. Pascal. Another major requirement of a good language is the possibility of reusing programming modules. Thus all modern CPU architecture teams have as an important member a compiler expert besides a logic designer and an expert on VLSI/solid-state technology. High level languages continued to improve during the succeeding decades and we saw FORTRAN followed by COBOL. communications and entertainment but will also demand processors whose computing power is high and power consumption is low. However. . portability and maintainability of programs written using that language. C++ and Java to highlight some of the trends. has always been an important influence in CPU architecture design. BASIC. we have several applications requiring multimedia along with new applications in which the use of wireless and mobility are becoming important. Besides this. how many registers to have.384 Computer Organization and Architecture In the current decade of 2000. particularly the number of logic gates which can be fabricated on a chip. Further the size of silicon wafers on which a single system can be integrated has also been going up. on-chip cache size. To summarize. constraints of CPU architecture. Languages have evolved to meet several criteria such as ease of expressing algorithms. Of late. In early designs. applications require very fast processing couplied with the need to address huge amounts of memory of the order of giga and terra bytes. The main challenge has been to take decisions regarding word length. Prevailing technology Over the years the levels of integration in integrated circuits has progressively increased. C. Thus this decade will not only see the convergence of computers. to conserve battery power in mobile computers. These two trends together have led to VLSI circuits with over 100 million transistors today. issues of maintaining low power and reducing wire delays in a chip have become important. available technology. Currently we have emerging systems with multiple processors on a chip and doubling of word length to 64 bits. efficiency. PL/I. This led to the development of assembly language in the 50s and high level languages in the 60s. This trend started in the mid 80s and progressively compiler expertise has become more and more important.

Sequential thinking seems to be natural for humans and sequential programming has always been the dominant programming style. This led to the idea of ‘family’ of CPU architectures. 80586 (Pentium).2 General Principles There are five important principles which have guided the design of processors over the last four decades[25]. Even other architects have been forced to ensure 80x86 compatibility in their CPUs either by hardware emulation or by a recent technique called “code morphing”[32]. locality of reference has remained an important determinant in design. A number of data structures such as vectors. In many organizations the source code may not be available for the applications they routinely use.2. In fact this used to be one of the most important considerations while planning replacement of computers with newer models. …. This fact has been very effectively used in all the generations of CPU architectures to reduce execution time of programs and to expand addressable memory. if one data item in the structure is referred. Thus it is essential to conserve the investments made in software development when an organization changes or upgrades a computer system. They are: l l l l l Upward compatibility Locality of reference of instructions and data Parallelism in application Amdahl’s law 20%–80% law Upward compatibility It has been realized over the years that design of good application software is time consuming and expensive.Advanced Processor Architectures 385 12. Thus it may be necessary to run object code as is when a computer is changed. The most important example of this is IBM 360 and 370 families of mainframe computers and the Intel 80x86 series of microprocessors. the later generation of processors in this series. 80386. . In other words. The x in the microprocessor family has progressively increased upto 5 with the common trait that an object program which ran on 80186 can be executed without change on 80286. Locality of reference of instructions and data A cornerstone of Von Neumann architecture is storing a sequence of instructions in the order in which they are executed. sequential files and strings are also linear and sequential. interleaving of main memory and the evolution of virtual memory are a direct consequence of the locality of reference. it is most likely that its neighbour will be referred next. In fact the success of early versions of Intel’s 8086 architecture and availability of a large number of application programs on it has constrained Intel to stick to this general CPU architecture even though its short comings became evident with improvements in compiler design and semiconductor technology. Use of cache memory. Over the four decades of evolution of CPU architecture.

It has now been realized that we need to measure the performance of a system we build. Typical examples are simultaneously processing pixels in a picture and simultaneously processing all components of a matrix. This is called instruction level parallelism. 20%–80% law This law states that in a large number of applications 20% of the code takes 80% of the execution time.386 Computer Organization and Architecture Parallelism in applications There is inherent parallelism in many application programs. Other parallelism which occur are temporal parallelism and data parallelism. the reduction in overall time will not be proportionate. possibility of overlapping execution of successive instructions has been an important method of speeding up execution of instructions by pipelining. The road journey time cannot be easily reduced. Temporal parallelism.2. To give a simple day-to-day example. In this expression two multiplications and one division can be executed simultaneously provided we have sufficient number of arithmetic units. that is. air travel between Bangalore and Delhi takes two and a half hours but to get to the airport from one’s residence in Bangalore takes an hour and it takes another hour to go to the place of work in Delhi from the airport. Otherwise the slowest sub-system will slow down the entire system. Data parallelism occurs when the same instruction can be executed simultaneously on several data items. The measurement of performance of computers is not simple as it is a very complex system and the performance depends on the type of use of a computer. say to half an hour. With increasing silicon area in a CPU chip exploiting parallelism has become an important design goal. In other words the total effort expended in optimization can be considerably reduced if this 20% is identified and attention is paid to it. Also the . In fact it is becoming longer with increasing number of vehicles on the road. 12. Thus it is advisable to focus on how to reduce the time taken to execute this 20% of the code.3 Modern Methodology of Design Over the years the methodology of design of processor has been evolving. Amdahl’s law Amdahl’s law states the simple fact that systems consist of many sub-systems and in order to increase the speed of the system it is necessary to uniformly speed up all the sub-systems. A simple example is computing an arithmetic expression such as: a ´ b – c/d + f ´ g. Thus there is no point in allocating huge resources to reduce air travel time without simultaneously reducing the time taken for road travel. Thus in order to decrease the overall time for solving an application it is essential to uniformly increase the speed of all parts of a computing system. Even if we decrease the air travel time. A CPU designer has to keep this aspect in mind while proposing innovations.

the following attributes are of interest to us: 1. when a computer system has not yet been designed nor installed. where Fi is the frequency of occurrence of the ith instruction is used in the evaluation of the CPU. This is known as the instruction mix. However. It should be noted that Fi could be derived from Si and Di of benchmark programs or obtained synthetically by other means. it is customary to use standard instruction mixes. There are three major sub-systems in a computer: CPU. Static frequency is computed when the program is loaded in memory but not executed. Memory and I/O. The reciprocal of T is expressed in so many million instructions per second or simply as MIPs. F2. .1.. A vector of frequencies F1. The mean execution time T of an instruction mix is defined as follows: T È n É Ti ¹ Fi Éi 1 Ê Ç Ç i n Ø Fi Ù microseconds Ù Ú 1 The value of T can be used to measure the computational power of a CPU. Instruction length denoted as Li for the ith instruction and measured in bytes. and dynamic frequency is computed during program execution. The benchmark programs differ from application to application. Fn. If Ti is data dependent or a variable we take its average. The number of instructions in the instruction set of a computer varies from one computer to another. CPU performance: The CPU’s role is to execute instructions. business data processing applications. Let there be n instructions in the instruction set of a computer. . Frequency of occurrence: Static frequency of occurrence of instructions is denoted as Si and dynamic frequency as Di. 3. Among  the possibly many attributes of an instruction...Advanced Processor Architectures 387 CPU of a computer is not its only part. Scientific application. 2. Execution time denoted as Ti for the ith instruction and measured in microseconds. Loops and iterations in programs make Si differ from Di. In order to estimate Si and Di we need a collection of representative programs called benchmark programs. Typical instruction mixes which have been obtained from the work loads of a large number of representative programs in different areas of applications are shown in Table 12. multimedia applications and real time applications are four typical categories. An instruction mix can also be derived from the actual workload of a computer by direct measurements. We have to device performance measures to evaluate each of them and also when they are integrated into a computer system.

3 (Included) The concept of instruction mix can be extended from the domain of machine languages to that of procedure oriented languages.0 0.6 10. .8 3.6 6.6 4.2 16.1 Some Instruction Mixes Gibson Mix % Load.2 4.82 C System Programming 38 3 12 43 3 1 Floating point arithmetic operations are of special interest in scientific data processing.3 Flynn Mix 45. Store Branches Integer Add/Subtract Compare Floating Add/Subtract Floating Multiply Floating Divide Integer Multiply Integer Divide Shifts Miscellaneous Indexing (in Load) 31.6 0.8 1.0 0.5 1.5 0.3 18.4 33.0 0.1 3.0 PDP 11 DEC’s Mix 22.82 PASCAL Application 45 5 15 29 — 6 Patt.8 6. Further. these operations are much slower than other instructions and require more hardware resources.4 5.5 7.1 27. Then it should be appropriately called statement mix.2.9 3. Pascal and C programs are shown in this table.0 4.388 Computer Organization and Architecture TABLE 12.7 19. As a result it has become a common practice to exclude them from the calculation of MIPs and separately calculate their power as so many millions of floating point operations per second or mega flops.0 12.2 Relative Dynamic Frequency of HLL Statements Reference Language Work Load Assignment Loop Call if goto Other Knuth 71 FORTRAN Student 67 3 3 11 9 7 Patt. TABLE 12.2 (all FP) — — — 4.3 3.5 0. A statement mix obtained by Knuth[33] and by Patterson[33] is shown in Table 12.0 0. The dynamic frequencies obtained form a set of representative programs in FORTRAN.

Advanced Processor Architectures

389

Recall that an instruction execution involves the following five steps: Step 1: Step 2: Step 3: Step 4: Step 5: Instruction fetch (memory access) Instruction decode (encoding complexity) and calculate effective address (addressing modes used) Operand fetch (memory access) Perform the specified operation (ALU speed) Store the result in memory

An instruction execution time is dependent on all these steps. Thus CPU performance also depends upon the memory speed and ALU speed. Memory system performance: The performance of a memory system can be measured using the following parameters: 1. Memory cycle time:  If cache is used the effective cycle time should be used to judge the performance. Also, the number of levels of caches affect performance. 2. Memory bandwidth (measured in bytes/second):  In certain cases, each memory access brings n bytes (n = 4 or 8). 3. Presence of memory access parallelism through interleaving. Its impact could be included in the memory bandwidth measure. 4. Memory capacity (measured in MB):  The existence of virtual memory can be noted as a remark. I/O system performance: An I/O system can be evaluated at two levels: individual device level, or the I/O sub-system level. A disk device, for example, has certain parameters with which its performance can be measured—disk rotation speed measured in revolutions per minute, disk arm’s lateral movement speed, recording density in bits per inch, length of a track, number of tracks per surface and number of surfaces. Each device has its own characteristics and its performance at the individual device level can be measured through a set of attributes. However, an I/O device is not used in isolation. In Chapter 11 we have studied I/O organization. To study the I/O device’s performance as a sub-system, we have to consider the following five aspects: (a) (b) (c) (d) (e) Device characteristics. Device interface and its characteristics. I/O organization (bus width, interrupt structure). I/O software (part of the operating system). File system (buffers, file organization, indexes).

There are specialized evaluation programs developed which evaluate an I/O sub-system as a whole. The unit of measurement is the number of transactions executed per second.

390

Computer Organization and Architecture

12.2.4

Overall Performance of a Computer System

In order to evaluate the performance of a computer system as a whole, we need representative programs. There are several tough questions in this context:

l l l l

Who Who Who How

develops these representative programs? decides what are really representative of users’ jobs? runs them and measures the performance? to interpret the vector of numbers obtained by running these programs?

There are no good answers to these questions. But the more one knows about the kind of applications the computer will be used for, the better will be the performance comparisons one can make between two computers. In the past, people have used kernels and synthetic benchmarks to evaluate the performances of computers. The current trend is to use real application programs and we will focus our study on this. Kernels: They are key pieces selected from real programs. Linpack consisting of a set of linear algebra subroutine packages is one example. It is used mainly for evaluating computers used for scientific calculations. Livermore loops which consists of a series of 21 small loops is another example of a kernel used to evaluate high performance computers used for scientific applications. Synthetic benchmarks: They are not real programs but synthesized to match an average execution profile. Whetstone and Dhrystone are two popular synthetic benchmarks. System evaluators do not use them any more. The SPEC benchmarks A group called Systems Performance Evaluation Cooperative (SPEC) was formed in 1982. Several computer companies became members of SPEC. Its objective was to “establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high performance computers.” These programs are collectively referred to as SPEC benchmark suite and are periodically revised. In 1992 the SPEC benchmark suite was modified (called SPEC92) to provide separate sets of programs to measure integer and floating point performance. Currently these numbers are quoted by every computer manufacturer and often a manufacturer tries to design the hardware and software to obtain a high SPEC mark. Transaction processing benchmarks SPEC marks are appropriate for evaluating CPU power of programs run in a batch mode and do not consider the needs of interactive users such as reservation clerks in a booking office working online with computers and databases to reserve tickets for customers. In this case many clerks share a computer and database and all require fast service. Each interactive request of a clerk and the response can be considered a transaction. For such applications there are transaction processing benchmarks. Two benchmarks known as TPC-A and TPC-B are commonly used by

Advanced Processor Architectures

391

the industry and a new benchmark TPC-C is emerging. These benchmarks are evolved by the Transaction Processing Council (TPC), representing 40 computer and database vendors. Performance criteria for a given TPC rating are specified by the council. Choice of database engine is left to the vendor conducting the test. Thus the results can vary based on the choice of software tested and how well the software is tuned for the machine. They are also periodically reversed. In this chapter we are primarily concerned with the design of CPUs. Thus we will not discuss this in greater detail. The availability of powerful computers has facilitated the design of new processors using simulation of proposed architecture. The empirical data available from benchmarks is used as data in the simulation experiments. Various levels of simulation beginning with functional level up to gate level are carried out before the fabrication of a chip. Having looked at the general design principles which have been used over the years in designing CPUs, we will now briefly review the history of development of CPUs during the past four decades.

12.3

HISTORY

OF

EVOLUTION

OF

CPUS

We will trace the evolution by looking at the most significant innovations which occurred in each decade starting with the 60s and tracing it upto the 90s. We have picked only the innovations which have made lasting impact on the design of CPUs. Decade of 60s Upto early 1960s separate families of computers were developed for scientific computing and business data processing. It was, however, realised that such a compartmentalization was not appropriate and a general purpose computer which can be used for a number of different purposes were then preferred by organizations. As computers became more powerful, programs became more complicated and expenses on software development increased. Thus organizations were reluctant to change/upgrade their computers unless programs developed for earlier models could be executed in newer models without change. The upward compatibility of object code has remained an important requirement even today. This has remained invariant over five decades! This led to the concept of a family of computers. For instance, one of the most popular computers of the 60s was the IBM 360 series of computers. Starting with IBM 360/30 there was a progression of models upto model 360/67; all of them had a common architecture. Thus object code of 360/30 could be executed without change in all higher models. In order to implement more instructions in higher models while keeping the architecture same, the idea of microprogramming was implemented in this series. Even though the idea of microprogram control unit for CPUs was proposed by Wilkes in the 50s the first commercial implementation of this idea was in the IBM series. Microprograms could also be used to emulate instructions of say machine A on another model B.

392

Computer Organization and Architecture

Besides the family idea and microprogramming, the other major innovation during this decade was the introduction of I/O processors to work parallely with the CPU of the computer. I/O speeds are 1000 times lower compared to CPU speeds. Thus CPUs wait for I/Os. To alleviate this problem, I/O processors with their own memories to perform I/O and buffer I/O in their memories was introduced. The data already read and cleaned up/formatted was then forwarded directly to the CPU’s memory without making the expensive CPU wait. This idea called Direct Memory Access (DMA) was also an interesting innovation during this decade. To summarize, the decade of 60s saw the following innovations:

l The concept of unified architecture of a series of computers. l Use of microprogramming to design control unit of a commercial computer
family. l I/O processors and the idea of DMA.

Decade of 70s The decade of 1970s may be called the ‘decade of minicomputer’s. Towards the end of 60s, IBM was dominating the computer market with over 85% of the world market. There were competitors such as CDC who made high performance computers appropriate for scientists and engineers. Towards the later part of the decade, Cray introduced supercomputers. All these machines were characterized by their very high cost which forced scientists to use batch mode for computing which delayed solving problems. Meanwhile, transistors had been introduced and their cost was rapidly coming down. Microprocessors were also being designed and they became commercially available in the later part of 70s. Digital Equipment Corporation saw the opportunity of designing low cost computers for use in individuals’ laboratories and also the possible use of low cost computers in process control. They introduced the PDP range of mini-computers. At that time memory was expensive and it was necessary to reduce its use. Thus instead of having uniform instruction length, half word instructions, 1½ word instructions, etc, were introduced depending on the number of operands used. High level languages had more or less replaced assembly language in application programs. Thus attempts were made to add instructions to the instruction set of computers so that each high level language statement could have a one instruction machine language equivalent. This increased the number of machine instructions to over 300 in some computers. The large number of instructions coupled with the non-uniformity of instruction structure made the control units of processors very complicated. This type of CPU architecture is nowadays known as Complex Instruction Set Computers (CISC). This was, however, necessary with the then prevailing technology. Magnetic core memory was replaced by semiconductor memories during this decade. Inspite of this, the speed of main memories was 10 times slower compared to CPU speed. Thus CPU often waited to get the operands from the main memory. This led to the development of small fast memory to act as a buffer between CPU and main memory. This was called the cache memory. In the CPU architecture the

Advanced Processor Architectures

393

idea of pipelining instructions was emerging and was being used in the high-end expensive machine, made by CDC, Cray, IBM etc. To summarize this decade saw the following important developments in architecture.

l l l l l

A large variety of instructions. Variable length instructions with complex structure Pipelining of instruction processing Introduction of cache memory Virtual memory to enable large programs to be stored on disk but allowing a user to view the combination of main memory and disk as a one level addressable memory. l Beginnings of time shared operating system to allow multiple users to use a computer interactively. Decade of 80s This decade may be characterized as the decade of the micros and Reduced Instruction Set Computers (RISC) processors. By the 80s the levels of integration in VLSI chips had reached a level where entire processors could be integrated in one chip. This led to Intel’s 80x86 series of processors. IBM saw a business opportunity and designed the first desktop personal computer (PC). The advent of PC was a revolution in computing and it changed forever the way computers were used. The relatively low cost of PCs made them accessible to a large number of users and novel application such as word processing and spread sheets made them almost indispensable equipment in offices. On the CPU design front there was another revolution brewing. Whereas Intel had continued with their 80x86 architecture which was quite complex, a group at the University of California, Berkely, began serious empirical study on systematic design of CPUs[33]. Their main goal was to design a CPU using empirical data on frequency of occurrence of various instructions (in a set of commonly used high level language programs written in C) to optimize selection of instruction set. They used the fact that only 20% of instructions in a CPU instruction set are used 80% of the time and it is better to concentrate on implementing these efficiently. They were thus able to reduce the instruction set drastically. Their other goals were to have a single chip implementation of the processor, simplify structure of instructions and reduce CPU to memory traffic. These goals led to an architecture known as RISC (Reduced Instruction Set Computer) architecture for CPU. RISC emphasized the importance of quantitative data in designing CPUs and this was a landmark in the evolution of CPU design. Starting in the 80s there has been considerable emphasis on gathering and using quantitative empirical data in CPU design. The RISC idea being an important landmark in processor evolution we will devote the next section to discussing the RISC idea. To conclude the important innovations which happened in the 80s were:

394

Computer Organization and Architecture

l l l l l

The emergence of IBM PC architecture. Serious use of empirical data in CPU design. VLSI improvements leading to CPU on a chip. Emergence of the RISC processor architecture. Aggressive use of pipelining in CPU which exploited temporal parallelism.

Decade of 90s By early 90s, levels of integration in VLSI chips had reached over a million transistors leading to the interesting question of how to use the available chip area. At the same time the science of compiler design had advanced considerably. The availability of inexpensive powerful processors acted as an impetus to application developers. Two innovations in applications were use of multimedia, namely, audio, graphics and video, and good graphical user interface. These developments in applications led to change in requirements of processors leading to a change in the design of processors. An attempt was made by Intel who introduced the MMX coprocessor for graphics. Another route was to enhance the CPU instruction set. Yet a third method was to design special purpose processors for media processing. The earliest processors were for audio processing and became popular as DSPs (Digital Signal Processors). With the growth of the multimedia applications more innovative processors were being designed towards the end of the decade. Another important development was the emergence of computer networks and the Internet. This led to the development of network processors to interconnect computers and in recent years to provide extra facilities in the CPU to facilitate networking and to allow wireless communication with other computers in the network. As we pointed out in the beginning, compiler design had made significant advances. The available chip area having increased it was possible to fabricate multiple arithmetic processing units on a chip. Using better compilers it was possible to create object code which could use multiple arithmetic units in a processor simultaneously in one clock cycle. In other words it was possible for the CPU to carry out multiple instructions in one clock cycle increasing the speed of processing. This is called superscalar processing. The other major advance which took place was the emergence of inexpensive parallel computers using multiple microprocessors. The monopoly of vector processors which dominated the supercomputer architecture was broken and speeds similar to that of Cray supercomputers could be obtained by a set of powerful microcomputers working in parallel. The cost of such machines was a fraction of the cost of machines manufactured by Cray/CDC, etc. This led to the death of traditional Cray-like supercomputers. Cray, CDC, Convex and ETA systems closed shop. To summarize the major landmarks of this decade were:

l Emergence of superscalar and superpipelined CPUs. l Emergence of DSPs and towards the later years of the decade multimedia
processors.

Advanced Processor Architectures

395

l Emergence of parallel computers using large number of microprocessors as

high performance computing alternative to traditional vector supercomputers. l Emergence of local and wide area networks to interconnect computers.

12.4

RISC PROCESSORS

We saw that the idea of RISC processors was introduced in the 80s and it is an important development on the methodology of processor design. In this section we will examine the ideas used in designing RISC and the major interesting characteristics of RISC processors[33]. As was pointed out earlier, high level languages (HLL) became the most important vehicle for communicating algorithms and executing them on computers. These languages not only reduced the effort in developing software but also increased their reliability. There is, however, what is known as semantic gap between the structure of HLL and that of CPU architecture as their objectives are different. In earlier generations of computers when memory was a scarce resource and CPUs were slow the overhead of translating HLL to machine language was considered a ‘waste’. In order to reduce this overhead CPU designers introduced a large number of instructions, several addressing modes and implementation of some HLL statements in hardware. These did ease the task of compiler writers and to some extent improved the efficiency of running HLL programs. However, in the meanwhile VLSI technology had improved by leaps and bounds and some of the concerns of earlier CPU designers were no more valid. This led to a rethinking of the basic principles of the CPU design. The most important was to study the dynamic execution time behaviour of the object code obtained by compiling HLL programs. The HLL frequently used for system programming was C at the time the study was conducted. The major data obtained was on how often operations are performed, types of operands used and how executions were sequenced. Table 12.3 shows the result[25].
TABLE 12.3 Dynamic Frequency of Occurrence of Operations
Operation Assignment Looping Call to procedure If Other Dynamic Frequency of Occurrence 38 3 12 43 4 Proportional Time Taken in Existing CISC Processors 13 32 33 21 1

Statistical data on the use of operands in HLL programs showed that the majority of references were to simple scalar variables (55%). Further, more than

396

Computer Organization and Architecture

80% of the scalars were local variables in procedures. Thus it is important to optimize access to scalar variables. From Table 12.3 we see that a large proportion of execution time is spent in looping and call to procedures. Thus CPU designers should pay special attention to optimize these. Based on these considerations RISC designers decided to:

l Introduce a large number of registers in the CPU. This allows efficient

reference to scalar variables without accessing main memory. l Optimize call/return from procedures by using available registers and an artifice called register windowing. This reduces call/return time. This idea is also appropriate as number of variables passed and returned was found to be less than 8 words in the statistical studies. l Pay special attention to pipelined processing so that looping and conditional jumps do not adversely affect the pipeline. Another important decision that was taken was to simplify the control unit of CPU so that the entire CPU can be integrated in one chip and it will be easy to debug errors in chip design. A complex control structure complicates chip design and increases the time and cost of designing and fabricating a chip. These decisions led to the following consequences on the instruction structure of RISC machines.

l Simple instruction format. All instructions preferably of same length, namely,
one word (32 bits). Number and variety of instructions reduced. l Simple addressing modes. Almost all RISC instructions use simple register addressing. Addressing modes such as indirect addressing are avoided. l Most operations are register to register. This reduces access to main memory thereby speeding up instruction execution. The only references to main memory are load and store operations. A side advantage of this decision is optimization of register use by compilers. l One instruction is carried out in one clock cycle. This is achieved by pipelining instruction execution. (As pipelining is an important part of RISC, we will analyse pipelining idea in detail in the next section and see how it is used in RISC).

To summarize, the RISC idea emerged when integrated circuits could accommodate large number of logic gates enabling a single chip implementation of an entire CPU. To achieve this and to speed up processing, RISC designers took the view “simple is beautiful”. The number of instructions were reduced, the structure of each instruction was simplified, all instructions were made of equal length, addressing modes reduced, registers increased to reduce memory access, compilers optimized to enable pipelined execution of instruction so that one instruction could be carried out in each clock cycle. Thus RISC idea was a landmark in the evolution of CPU architecture and has had a lasting impact on processor design.

Advanced Processor Architectures

397

12.5

PIPELINING

All recently designed processors use a technique of speeding up computation by a method called pipelining. This technique uses temporal parallelism which can be exploited in solving many problems. We will first illustrate the idea of pipeline processing or assembly line processing with an example. Assume that an examination paper has 4 questions to be answered and 1000 answer books are to be graded. Let us label the answers to the four questions in the answer books as Q1, Q2, Q3, Q4. Assume that the time taken to grade the answer to each question Q1, Q2, Q3, Q4 is 5 minutes. If one teacher grades all the answers he/she will take 20 minutes to grade a paper. As there are 1000 answer papers, the total time taken will be 20,000 minutes. If we want to increase the speed of grading, then we may employ 4 teachers to cooperatively grade each answer book. The four teachers are asked to sit in a line as shown in Figure 12.1. The answer books are placed in front of the first teacher. The first teacher takes an answer book, grades Q1 and in turn passes it on to the second teacher who grades Q2; the second teacher passes the paper on to the third teacher who grades Q3 and passes it on to the last teacher in the line. The last teacher grades Q4, adds up the marks obtained by the student, and puts the paper in a tray kept for collecting the graded answer books (output).
T1, Q1 T2, Q2 T3, Q3 T4, Q4

P1 P2 Pile of answer book (Input)

P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P1 P2 P3 P4 P1 P2 P1 P2

P3 P3 (Output)

FIGURE 12.1

Four teachers grading papers in a pipeline.

It is seen from the figure that when Q1 in the first answer book is being graded, three teachers are idle. When the second answer book is being graded, two teachers are idle. However, from the fourth answer book all teachers are busy, with teacher 1 grading Q1 of book 4, teacher 2 grading Q2 of book 3, teacher 3 grading Q3 of book 2 and teacher 4 grading Q4 of book 1. As the time to grade answer to each question is 5 minutes, the first answer book will take 20 minutes to be output but subsequently

398

Computer Organization and Architecture

graded papers will be output every 5 minutes. The total time taken to grade 1000 answer books will be 20 + (999 ´ 5) = 5015 minutes. This is about one-fourth of the time taken by one teacher. Observe that the time taken to grade a student’s answer book is still 20 minutes (4 teachers take 5 minutes each) but the total number of papers graded in a given interval is increased by a factor of four. Thus the rate at which papers are graded increases by a factor of four. In this example we define a job as that of correcting an answer book. This job has been divided into four tasks. The four tasks are correcting the answer to Q1, Q2, Q3 and Q4 respectively. As this method breaks up a job into a set of tasks to be executed overlapped in time it is said to use temporal parallelism. This method of processing is appropriate if: 1. The jobs to be carried out are identical. 2. A job can be divided into many independent tasks. In other words, each task can be done independent of other tasks. 3. The time taken for performing each task is the same. 4. The time taken to transmit a job from one processing stage to the next is negligible compared to the time needed to execute a task. 5. The number of tasks into which a job is broken up is much smaller compared to the number of jobs to be carried out. We can quantify the above points as worked out below: Let the number of jobs = n Let the time taken to do a job = p Let a job be divisible into k tasks each taking time = p/k Time required to complete n jobs with a pipeline of k stages, = p + (n – 1) (p/k) = p (k + n – 1)/k Speed up due to pipeline processing, = np/{p(k + n – 1)/k} = k/{1 + (k – 1)/n} If the number of jobs n is much larger than the number of stages in the pipeline k, then (k – 1)/n << 1 and the speed up is nearly equal to k.
EXAMPLE 12.1

A job consists of 4 tasks. The time taken by the four tasks are respectively 20 ns, 10 ns, 15 ns, 20 ns. Pipelining is used to reduce processing time. If the number of jobs entering the pipeline is 120 find the efficiency of pipelining. Time taken by each job = 65 ns. If no pipelining is used, the time taken by 120 jobs = 120 ´ 65 = 7800 ns. If pipelining is used all tasks must be allocated equal

3 ns. pipeline efficiency = 7800/2460 = 3. then the job has to wait for a time (t2 – t1) before entering stage 2.2 A 4 stage pipeline adder is to be designed. Fault tolerance: The system does not tolerate faults. Further. If one of the stages in the pipeline fails for some reason. It is also used to increase the number of instructions carried out by a processor in a specified time.2 while giving the solution. between stages. if an answer book has only two questions answered. two teachers will be forced to be idle when that answer paper is being corrected. EXAMPLE 12. t2. .17 = 4 = (3. what is the time taken to add 100 pairs of operands? (d) What is the efficiency of pipeline addition if all the operands are non-zero? (e) What is the efficiency in case (c)? Refer to Figure 12. The time taken by each of the pipeline stages are 3 ns. Therefore speed-up Ideal speed-up Therefore. this method is a very effective technique as it is easy to perceive in many problems how jobs can be broken up into tasks to use temporal parallelism.17/4) ´ 100% = 79. In the example of teachers grading papers.Advanced Processor Architectures 399 time which should be the maximum time for a task = 20 nsec. If the time taken by stage 1 is say t1 and is less than the time taken by stage 2. the pipeline can be made very efficient by tuning each stage in the pipeline to do a specific task well. 10 ns and 5 ns. Inspite of these disadvantages. Bubbles in pipeline: If some tasks are absent in a job ‘bubbles’ form in the pipeline. (a) What should be the clock frequency to drive the adder? (b) What is the time taken to add 100 pairs of operands? (c) If 20 operands are zero (at random). This job will thus need. a temporary storage area (called a buffer) where it can wait. The time taken to complete 120 jobs = 80 + 119 ´ 20 = 2460 nsec. Pipelining is a common technique used to speed up arithmetic operation on a sequence of numbers. the entire pipeline is upset. Intertask communications: The time to transmit a job between pipeline stages should be much smaller compared to the time taken to do a task.25% The main problems encountered in implementing temporal parallelism are: Synchronization: Each stage in the pipeline must take equal time for completion of a task so that the job can flow between stages without hold up.

c2. Therefore clock frequency = 1/10 ns = 100 MHz.407 » 40. 12. a2.3. b2 a1. b1 Time Clock 1 T Clock 2 2T Clock 3 3T Clock 4 4T Clock interval 1 a4. b3 a2.51 or 51%. We also assume a large register file in the CPU. b4 a3. a3. b1 a2. (b) 40 + 99 ´ 10 = 1030 ns (c) Same as (b). b1 c1 = (a1 + b1) a3. b3. a1 b4. As it is a RISC architecture the only instructions which refer to the main memory are load and store. (e) Efficiency = ((80 ´ 21)/1030) ¸ 4 = 0. A block diagram of this abstract architecture is given in Figure 12. No method of detecting zeros. b3 a2. c3. b2 a1. (d) Efficiency = actual speed up/ideal speed up = (2100/1030)x 4 = 0.7% (assuming both input operands of a pair are not equal to 0). b1 c4.6 INSTRUCTION PIPELINING IN RISC We will illustrate instruction pipelining using a model of a computer with RISC architecture. . (a) We should design for the slowest operation in the pipeline.2 Adding vectors in a pipelined adder. We will assume that the computer has separate instruction and data caches. b1 Stage 1 Stage 2 Stage 3 Stage 4 a1. b2 a1. c1 Output Clock pulses T 0 T T 2T T 3T FIGURE 12. b2.400 Computer Organization and Architecture 4 Stage adder a4.

STO C}. Decode instruction and access register file. Further as a large number of instructions are carried out sequentially. ALU is used either for arithmetic operations (add.4 we show the overlapping of operations for the sequence of instructions {Load A. STORE and 3 for ADD). Load B. the sequence of instructions (Figure 12. The following steps are repetitively carried out by this computer. Step 1: Step 2: Step 3: Step 4: Step 5: Retrieve instruction from instruction memory address given by PC. FIGURE 12. If no pipelining is used. Observe that we have assumed one clock period each for loading (or storing). Increment PC. . Write in register.3 An abstract architectural diagram of a RISC CPU.4 Pipeline processing of instructions. Execute instruction using ALU. The time taken to decode instruction and register access together is also assumed to be one clock period. Read/Write data from/in data memory. sub) or to calculate data memory address in LOAD or STORE operations. and for register read/ store.Advanced Processor Architectures 401 Data D E C O D E R PC Instruction memory Register file A L U Data memory Address FIGURE 12. ALU operation. C ¬ A + B. In Figure 12. the pipeline efficiency will be quite good. It is clear that pipelining is possible as these are independent operations which can be overlapped.4) will take 15 clock period (4 clock periods each for LOAD.

In the meanwhile the next sequential instruction would have been fetched. What is the efficiency of pipeline processing.84 Speed up if no branch instructions are there = (5004/1004) = 4. In practice.98 . This instruction has to be ignored and an instruction as determined by the branch has to be fetched. Speed up is thus 1500/404 » 3. EXAMPLE 12. In the meanwhile two more instructions would have been fetched which have to be discarded if the branch succeeds (Figure 12. B JZE XX Load P Load Q To be discarded if branch succeeds IF 2 DER IF 3 ALU DER IF 4 DM ALU DER IF ALU DER 5 RS 6 7 8 FIGURE 12.402 Computer Organization and Architecture If 100 such sequences each of 4 operations are carried out the time taken will be 1500 units. Of the 1000 instructions 300 are branches and 150 of them will lead to branching.5). The fact that an instruction is a branch instruction will be found out only during the instruction decode phase. Clock cycles 1 Sub A. Thus total delay = 150 ´ 2 = 300 clock cycles Time taken by 1000 instructions if there is no branching = 5 + 999 = 1004 clock cycles Time taken by 1000 instructions including branching = 1304 clock cycles Speed up = 5 – 1000 1304 3. If it is a conditional branch the address of next instruction will be known only at the end of ALU operation.71 which is not much lower than the ideal speed up of 5. This speed up is quite good.5 Delay in pipeline due to branches. With pipeline processing the time taken will be (5 + 399) = 404 units. however. it may not be possible to sustain such a pipeline processing. Pipelining is upset when a branch instruction is encountered. If branch condition is true the pipeline is delayed by 2 clock cycles. Of these in 50% branches the condition is True.3 A sequence of 1000 instructions has 30% conditional branch instructions.

l Delays due to branch instructions or control dependency in a program which is known as control hazard. Thus the execution step of instruction (i + 2) cannot start at . instruction LOAD A requires to read from data cache and instruction STO C requires an instruction to be fetched from the instruction cache. An ideal pipeline assumes a continuous flow of tasks. We will discuss each of these in the following subsections and examine how to reduce these delays. This is known as structural hazard. If one common cache is used for both data and instructions.98/5) ´ 100% = 99. during clock cycle 4. say. The non-ideal conditions are: l Available resources in a processor are limited. l Successive instructions are not independent of one another. Normally a floating point division takes longer than.7.6. This is one of the reasons why two different caches (with independent read/write ports) for instructions and data are used in RISC.7 DELAY IN PIPELINE EXECUTION Delays in pipeline execution of instructions due to non-ideal conditions are called pipeline hazards. Referring to Figure 12. an integer addition. The result generated by an instruction may be required by the next instruction.6% The loss of efficiency is 23% which is quite high.1 Delay due to Resource Constraints Pipelined execution may be delayed due to non-availability of resources when they are required during execution of an instruction. We will discuss pipeline delays in detail in the next section. Each of these non-ideal conditions causes delays in pipeline execution. If we assume that a floating point division takes 3 clock cycles to execute the progress of pipeline execution is shown in Figure 12.7% Efficiency without branching = (4. We can classify them as: l Delays due to resource constraints. 12.4.Advanced Processor Architectures 403 Efficiency = (3. only one of these instructions can be carried out and the other has to wait. Instruction (i + 1) in this example is a floating point division which takes 3 clock cycles.83/5) ´ 100% = 76. Forced waiting of an instruction in pipeline processing is called pipeline stall. l All programs have branches and loops. 12. l Delays due to data dependency between instructions which is known as data hazard. This is a hardware method of avoiding stalling of pipeline execution. Execution of a program is thus not in a ‘straight line’. Pipeline execution may also be delayed if one of the steps in the execution of an instruction takes longer than one clock cycle.

How do we avoid this? One way is to speed up floating point division by using extra hardware so that it also takes one clock cycle. the number of stages in the pipeline. Total time needed to carry out 950 ordinary instructions and 50 divisions speeded-up = 1000 ´ 5 = 5000. It has to wait till clock cycle 7 when the execution unit becomes free to carry it out.98) ´ 100 = 4. Total time needed to carry out with special hardware to speed up division to 1 clock cycle and with pipelining = 5 + 999 = 1004.24/4. Speed-up with pipelining = (5000/1004) = 4.4 A sequence of 1000 instructions has 5% floating point division operations. We have calculated the speed up due to pipelining in the ideal case as n. This is a pipeline stall due to nonavailability of required resource (structural hazard). Clock cycles 1 i i+1 i+2 i+3 IF 2 DER IF 3 ALU DER IF 4 DM ALU DER IF 5 RS ALU IDLE DER ALU IDLE IDLE ALU IDLE DM 6 7 8 FIGURE 12. Before deciding whether it is worthwhile doing it.74 Efficiency loss in pipelining (0. we compute the delay due to this resource constraint and if it is not too large we may decide to tolerate it.98 Total time needed with no special hardware for division operation and with pipelining 5 + 949 + 50 ´ 2 = 1054 Speed-up with pipelining and no special hardware for division (5000/1054) = 4.404 Computer Organization and Architecture clock cycle 5 as the execution unit is busy. EXAMPLE 12. There are average 1000 ´ 5/100 = 50 division operations in 1000 instructions.82% . What is the efficiency of the pipeline with and without speeding-up the division operation. We will now calculate the speed-up when there is a pipeline delay due to resource constraint. This will require more hardware resources. Division delays pipeline execution by 2 clock periods.6 Delay in pipeline due to resource constraints.

However. the hardware detects that the execution unit is busy and thus suspends pipeline execution.7 Delay due to data dependency. during this ‘locked’ cycle. Consider the following sequence of instructions assuming A and B are in registers: ADD STO MUL Clock cycles 1 Add A. no work is done during this cycle. 12.2 Delay due to Data Dependency Pipeline execution is delayed due to the fact that successive instructions are not always independent of one another. In Figure 12. recent machines do not always lock a pipeline and let instructions continue if there are no resource constraints or other problems which we discuss next. The result produced by an instruction may be needed by succeeding instructions and the results may not be ready when needed. B STO C MUL C. a later instruction may complete before an earlier instruction. observe that in cycle 5 of instruction i + 2. Referring to Figure 12. it should be allowed. that is. because of the locking of the pipeline. Many machines have used this technique. In some pipeline designs whenever work cannot continue in a particular cycle the pipeline is ‘locked’.7. C FIGURE 12.6. C Add B IF 2 DER IF 3 ALU DER IF 4 DM ALU DER IF 5 RS DM IDLE DER RS IDLE IDLE ALU IDLE ALU 6 7 8 A. which illustrates pipeline delay due to resource shortage.Advanced Processor Architectures 405 As the loss of efficiency is less than 5% it may not be worthwhile expending considerable resources to speed-up floating point execution in this example. In such a case completion of instruction may be ‘out of order’. In such a case even though the decode register reads step (DER) of instruction (i + 3) can go on (as the decoding unit is free). The main advantage of this method of locking the pipeline is the ease of hardware implementation. In other words.7 we see that the value of C (the result of the ADD operation) will not be available till clock cycle 6. Thus the MUL operation should not be executed in cycle 5 as the value of C it needs will not be stored in the register . Also by such locking we ensure that successive instructions will complete in the order in which they are issued. B C C. If this is logically acceptable in a program. no step of an instruction can be carried out.

Referring to Figure 12.8. In the following sequence of instructions. Hardware must of course be there to detect that the next instruction needs the output of the current instruction and should be fed back to ALU. C D. Instead of waiting till RS cycle to store it in the register file. Thus the execution step of MUL instruction is delayed till cycle 7. one may provide a path from the ALU buffer register to ALU input and bypass the DM and RS cycles. The order is: ADD STO ADD STO MUL A. B C D. Many pipelined processors have register forwarding as a standard feature. C A. B C C. Compilers have now been developed specifically for RISC machines which will detect the possibility of pipeline delay and reorder instructions provided it does not change the meaning of the program. B will be in a buffer register in ALU. Observe that in cycle 7 when MUL C. there will be no delay at all in the pipeline. C The pipeline execution of this series of instructions is shown in Figure 12. C A.7 the result of ADD A. If register forwarding is done during ADD instruction. This technique is called register forwarding. B and STO F are not dependent on C. Thus the hardware should have a facility to forward C to SUB also. B F C. Thus pipeline delay is avoided. A question which naturally arises is how can we avoid pipeline delay due to data dependency. Thus the order of the instructions can be changed as shown below without changing the meaning of the program. This stall is due to data dependency and is called data hazard as the required data is not available when needed. The hardware method is called register forwarding. C needs C it is available to ALU. . There are two methods available to do this.406 Computer Organization and Architecture file by the ADD operation and thus not available. Instruction reordering is a software method of avoiding pipeline delays. B C C. One is a hardware technique and the other is a software technique. B F Consider another sequence of instructions: In the above series of instructions ADD D. not only MUL but also SUB needs the value of C. ADD STO MUL SUB ADD STO MUL ADD STO A.

B STO F MUL C. In our design whether the fetched instruction should be Clock cycles 1 i (i + 1) Branch (i + 2) Branch not taken (i + 3) Branch taken IF 2 DER IF 3 ALU DER IF 4 DM ALU IDLE 5 RS DM IDLE RS DER IF ALU DER Continue if branch not take discard if branch taken ALU 6 7 8 FIGURE 12. Let us now examine the effect of this on pipeline execution using Figure 12. In the meanwhile. JEQ. the next instruction would have been fetched. Consider 5 branch instructions: JMP (unconditional jump).8 Pipeline processing with out of order execution.7. The fact that it is a branch will be known to the hardware only at the end of decode (DER) step. 12.6.9. however. JZE and BCT instruction. In our design we have shown the branch address assignment at the end of DM operation uniformly for all branch instructions. For JEQ. C IF 2 DER IF 3 ALU DER IF 4 DM ALU DER IF 5 RS DM ALU DER IF RS DM ALU DER RS DM ALU RS DM RS 6 7 8 407 FIGURE 12. only after executing ALU operation one would know whether to jump or not. . B STO C Add D. A branch disrupts the normal flow of control. For JMI also the jump address could be found during the decode step as the negative flag would have been set by the previous instruction. JMI. The jump address for JMP is known at decoding step.Advanced Processor Architectures Clock cycles 1 Add A. JZE and BCT(conditional jumps).9 Delay in pipeline execution due to branch instruction. In this figure instruction (i + 1) is a branch instruction. We have already given an example in Section 12. If an instruction is a branch instruction (which is known only at the end of instruction decode step) the next instruction may be either the next sequential instruction (if the branch is not taken) or the one specified by the branch instruction (if the branch is taken).3 Pipeline Delay due to Branch Instructions The last and the most important non-ideal condition which affects the performance of pipeline processing is branches in a program.

Let us calculate the reduction of speed-up due to branching. 5 in Figure 12.2) = 0. One of them is to tackle the problem at the hardware level and the other is to have software aids to reduce delay.4%.15 ´ 0.408 Computer Organization and Architecture executed or not would be known only at the end of DM cycle of the branch instruction (see Figure 12.7. the fetched instruction could proceed to DER step.9). EXAMPLE 12. Assume that 80% of the conditional branches are taken in the programs. EXAMPLE 12. of cycles per instruction (ideal case) = 1 Average delay cycles due to unconditional branches = 3 ´ 0.4%.06 = 0. We can decide to branch or not at the end of decode step. If branch is not taken. We will first examine how hardware can be modified to reduce delay cycles. The branch address of JMP and JMI can be found at the end of DER step if we put a separate ALU in the decode stage of the pipeline to find the effective jump address for JEQ and BCT instruction.8) + 2 ´ (0. There are two broad methods of reducing delay due to branch instructions. Thus this instruction will be delayed by two cycles. Thus it is essential to reduce pipeline delay when branches occur.36 + 0.15  0. Thus there is 3 cycles delay (clock cycles 3.6 Assume again 5% unconditional jumps.42 % loss of speed-up due to branches = 36. By adding this extra circuitry we have reduced delay to 1 cycle if branch is taken and zero delay if it is not taken. We will calculate the improvement in speedup with this hardware. .9) in this case.15 Average delay cycles due to conditional branches = delay due to taken branches + delay due to non taken branches = 3 ´ (0. If the branch is taken then the instruction fetched is not used and the instruction from the branch address is fetched. Let the percentage of unconditional branches in a set of typical programs be 5% and that of conditional branches be 15%.4 Hardware Modification to reduce Delay due to Branches The primary idea is to find out the address of the next instruction to be executed as early as possible. 4.15 ´ 0.05 = 0.18 1  0.42 \ Speed-up with branches = 5  3. Observe that the speed-up lost is 36.5 In our pipeline the maximum ideal speed-up is 5. 15% conditional jumps and 80% of conditional jumps as taken. No. 12.

Other technologies have been used which are more cost-effective in certain processors. Initially the prediction bits are 00. The contents in each location of this buffer memory is the address of the next instruction to be executed if the branch is taken. The prediction bits are examined. Both ideas can.Advanced Processor Architectures 409 Average delay cycles with extra hardware when branch is taken = 1 ´ 0.10. The prediction is based on the execution time behaviour of a program.05 + 1 ´ (0. we will know whether the instruction is a branch instruction or not. While executing an instruction. In addition. two bits count the number of times the branch has been successfully taken in the immediately preceding attempts. The first method we discuss is less expensive in the use of hardware and consequently less effective. Otherwise the next sequential instruction is executed.9 ´ 0. if at all. If the prediction bits are 11 and a branch is taken it remains at 11 and if a branch is not taken it is decremented to 10. Every time a branch is taken the prediction bits are incremented by 1 and every time a branch is not taken they are decremented by 1. at the decode step of the pipeline.6% Thus we get a gain of 22% in speed-up due to the extra hardware which is well worth it in our example.17 % loss of speed-up = 14. If it is a branch instruction the low order bits of its address are used to address the branch prediction buffer memory. This memory.95 = 0. We will first discuss the use of branch prediction buffer.8) = 0. It may not be cost effective to add hardware of the type we have illustrated. with this scheme?’ The second is “Why should there be 2 bits in the prediction field? Would it not be sufficient to have only one bit?” . of course. With 1000 entries in the branch prediction buffer it is estimated that the probability of finding a branch instruction in the buffer is 95%. The second method which is more effective also uses a fast memory called a branch target buffer. The probability of finding the branch address is at least 0.85. There are two questions which must have occurred to the reader.12 = 0. Branch prediction buffer In this technique. has to be much larger and requires more control circuitry. We will discuss two methods both of which depend on predicting the instruction which will be executed immediately after a branch instruction. Experimental results show that the prediction is correct 90% of the time.05 + 0. however.15 ´ 0. In this example we assumed only a small number of branch instruction types and the extra hardware addition was simple. If they are 10 or 11 control jumps to the branch address found in the branch prediction buffer. the lower order bits of the address of branch instructions in a program segment are used as addresses of the branch prediction buffer memory. It uses a small fast memory called branch prediction buffer to assist the hardware in selecting the instruction to be executed immediately after a branch instruction. be combined. This is shown in Figure 12. The first is ‘How many clock cycles do we gain. Commercial processors are more complex and have a variety of branch instructions.

Once a BTB entry is made. compared to a 2 bit predictor. In many machines where address computation is slow this buffer will be very useful. the branch instruction governing the loop would not be found in BTB.10 Contents Address where branch will jump Prediction bits 2 bits The fields of a Branch Prediction Buffer memory. Remember that the fact that an instruction is a branch will be known only during the decode step. The various fields of the Branch Target Buffer memory (BTB) are shown in Figure 12. Address Address of branch instruction FIGURE 12. Without any hardware enhancements.12. When a program is executed whenever a branch statement is encountered.410 Computer Organization and Architecture We will take up the first question. two clock cycles will be saved when this buffer memory is used provided the branch prediction is correct. particularly in loops.11 Contents Address where branch will jump Prediction bits 1 or 2 bits (optional) The fields of a Branch Target Buffer memory. At this time the target address is entered in BTB and the prediction bit is set to 1. In this flow chart it is assumed that the actual branch target is found during decode step. there will not be any gain by using this scheme as the branch address will be found at the decode step of the pipeline. it can be accessed at instruction fetching phase itself and target address found. A single bit predictor mispredicts branches more often. its address and branch target address are placed in BTB. the branch target would be found at the instruction fetch phase itself thus saving 3 clock cycles delay. Branch target buffer Unlike a branch prediction buffer. Typically when a loop is executed for the first time. the target address of the branch will be known if branch is taken. When the loop is executed the second and subsequent times. The contents of BTB are created dynamically. At the end of the execution step. We explain how the BTB is used with a flow chart in Figure 12. In our example machine with hardware enhancements. Observe that the address field has the complete address of all branch instructions. If no hardware is added then the fact that the predicted branch and actual branch . Observe that left portion of this flow chart describes how BTB is created and updated dynamically and the right portion explains how it is used.11. a branch target buffer is used at the Instruction Fetch step itself. This is possible in our example machine with added hardware. It will be entered in BTB when the loop is executed for the first time. Address Low order bits of branch instruction address FIGURE 12. Thus it has been found more cost-effective to use a 2 bit predictor.

We will now compute the reduction in speed-up due to branches when BTB is employed using the same data as in Example 12. If they do not match. Thus BTB cannot be very large. . are same will be known only at the ALU step of the pipeline.Advanced Processor Architectures Pileline steps 411 Fetch instruction FI Step Is its address found in BTB? No Is the fetched instruction a branch? Fetch instruction from predicted branch address Yes Is predicted branch actual branch? No Yes Yes No DE Step Normal flow EX Step Creation of BTB entries Enter address of fetched instruction and branch address in BTB Continue execution Flush fetched instruction. About 1000 entries are normally used in practice. Modify prediction bit of BTB FIGURE 12.12 Branch Target Bufer (BTB) creation and use. Observe that we have to search BTB to find out whether the fetched instruction is in it. the instruction fetched from the target has to be removed and the next instruction in sequence should be taken for execution.2. Fetch right instruction.

95 = 0. the average delay due to conditional branches = 0. We will also assume that the branch address is put in PC only after MEM cycle. 12. .8 = 0. In other words there is no special branch address calculation hardware.05 ´ 2.05).066 % loss of speed-up due to branches = 6% Compare this with the loss of speed-up of 36.066. Average delay when unconditional branch not found in BTB = 2. By having a BTB the average delay cycles when unconditional branches are found in BTB = 0.1 + 0.05. This may not always be possible but analysis of many programs shows that this technique succeeds quite often.95).95.8 ´ 0.) Average delay due to conditional branches when they are found in BTB = 0. Thus average delay due to unconditional branches 2 ´ 0.412 Computer Organization and Architecture EXAMPLE 12.14.70 1  0.1(Probability of branch not found in BTB is 0. We will assume that in 90% cases.8) + (2 ´ 0.8.5 Software Method to reduce Delay due to Branches This method assumes that there is no hardware feature to reduce delay due to branches. Thus use of BTB is extremely useful. 5 Therefore speed-up with branches when BTB is used  4.4% found in Example 12. As probability of not being in BTB = 0.061 = 0. Average delay when conditional branch are not found in BTB = (3 ´ 0.7 Assume unconditional branches = 5% Conditional branches = 15% Taken branches = 80% of conditional For simplicity we will assume that branch instructions are found in BTB with probability 0.2) = 2.14 + 0. The technique is explained with some examples.05 = 0.005 + 0. (For unconditional branches there can be no misprediction. Average delay due to misprediction of conditional branches when found in BTB = 0. As 5% of instructions are unconditional branches and 15% are conditional branches in a program the average delay due to branches = 0.7.1 ´ 2.5 with no BTB.15 ´ (0.266 (As probability of conditional branch being in BTB is 0. the branch prediction based on BTB is correct.05 ´ 0. The primary idea is for the compiler to rearrange the statements of the assembly language program in such a way that the statement following the branch statement (called a delay slot) is always executed once it is fetched without affecting the correctness of the program.266) = 0.

the ADD statement would have been fetched in the rearranged code and can be allowed to complete without changing the meaning of the program..ADD A. EXAMPLE 12.... C JMI X ADD D.. C JMI X -------------X -------------FIGURE 12.14 Rearranging code in a loop to reduce stalls.8 Original Program . Observe that the delay slot is filled by the target instruction of the branch. F. While the jump statement is being decoded.13 Rearranged Program . If no such statement is available then a No Operation (NOP) statement is used as a filler after the branch so that when it is fetched it does not affect the meaning of the program.. F -------------X -------------- Rearranging compiled code to reduce stalls.Advanced Processor Architectures 413 EXAMPLE 12.ADD AB STO C ADD D...9... F SUB B.. If the probability of the branch being taken is high then this procedure is very effective.. This is illustrated in Example 12. B STO C SUB B. control will fall through and the compiler should . When the branch is not taken..9 FIGURE 12. Another technique is also used by the compiler to place in the delay slot the target instruction of the branch. Observe that in the rearranged program the branch statement JMI has been placed before ADD D..

Forward branching can also be handled by placing the target instruction in the delay slot as shown in Figure 12. page faults and I/O calls. hardware malfunction or power failure we have no option but to terminate the program. There are other software techniques such as unrolling a loop to reduce branch instructions which we do not discuss here. In Table 12.4 we have also indicated whether we could resume the program after the exception condition or not. They are discussed in detail in Rajaraman and Siva Ram Murthy[35]. that memory protection violation can occur either during the instruction fetch (IF) step or load/store (MEM) step of the pipeline.414 Computer Organization and Architecture ‘undo’ the statement executed in the delay slot if it changes the semantics of the program. We call all these exception conditions.6 Difficulties in Pipelining We have discussed in the previous subsection problems which arise in pipeline execution due to various non-ideal conditions in programs. 12. In Table 12. We will examine another difficulty which makes the design of pipeline processors challenging. In the case of undefined instruction.15. If the Operating System has a checkpointing feature one can restart a half finished program from the last checkpoint. For each exception condition we have indicated whether it can occur in the middle of the execution of an instruction and if yes during which step of the pipeline.7.4 we list a number of exception conditions.15 Hoisting code from target to delay slot to reduce stall. This difficulty is due to interruption of normal flow of a program due to events such as illegal instruction codes. for example. Some pipelined computers provide an instruction "No Operation if branch is unsuccessful" so that the delay slot automatically becomes No Operation and the program's semantics does not change. Referring to Table 12. .4 we see. FIGURE 12.

3. FIGURE 12. When an exception is detected the following actions are carried out: 1.16). In other words.Advanced Processor Architectures 415 An important problem faced by an architect is to be able to attend to the exception conditions and resume computation in an orderly fashion whenever it is feasible. The problem of restarting computation is complicated by the fact that several instructions will be in various stages of completion in the pipeline. A trap instruction is fetched as the next instruction in pipeline (Instruction (i + 4) in Figure 12. If the pipeline processing can be stopped when an exception condition is detected in such a way that all instructions which occur before the one causing the exception are completed and all instructions which were in progress at the instant exception occurred can be restarted (after attending to the exception) from the beginning. whatever actions were carried out by (i + 1).16). (i + 2) and (i + 3) before the occurrence of the exception should be cancelled.16 assume an exception occurred during the ALU step of instruction (i + 1).16 Occurrence of an exception during pipeline execution. (i + 2) and (i + 3) in Figure 12. 2. As soon as the exception is detected turn off write operations for the current and all subsequent instructions in the pipeline (Instructions (i + 1). . the pipeline is said to have precise exceptions. (i + 3) and (i + 4) should be stopped and resumed from scratch after attending to the exception. This instruction invokes the Operating System which saves PC of the instruction which caused the exception condition to enable resumption of program later after attending to the exception. Referring to Figure 12. If the system supports precise exceptions then instructions i and (i + 1) must be completed and instructions (i + 2).

17 we have allocated a full clock cycle to instruction fetch (IF). namely. For example. . readers should refer to the book by Hennessy and Patterson[36]. In practice some pipeline stages require less than one clock interval. one clock interval. DM IF. Further if the steps are subdivided into two phases such that each phase needs different resources thereby avoiding resource conflicts. For a detailed discussion. pipeline execution can be made faster as shown in Figure 12. Thus we may divide each clock cycle into two phases and allocate intervals appropriate for each step in the instruction cycle. Referring to Figure 12. DM IF. execution (ALU) and Data Memory load/store (DM) steps while half a cycle has been allocated to the decode (DER) and store register (RS) steps.8 SUPERSCALAR PROCESSORS In a pipelined processor we assumed that each stage of pipeline takes the same time.4 Exception Conditions in a Computer Exception Type Occurs during Pipeline Stage? Yes/No I/O request OS request by user program User initiates break point during execution User Tracing program Arithmetic overflow or underflow Page fault Misaligned memory access Memory protection violation Undefined instruction Hardware failure Power failure No No No No Yes Yes Yes Yes Yes Yes Yes Which stage ALU IF. 12.17. DER and RS stages may normally require less time.416 Computer Organization and Architecture TABLE 12. There are many other issues which complicate the problem. particularly when the instruction set is complex and multicycle floating point operations are to be performed. DM DER Any Any Resume Resume Resume Resume Resume Resume Resume Resume Terminate Terminate Terminate Resume or Terminate We have briefly discussed the procedure to be followed when exceptions occur.

All the difficulties associated with pipeline processing (namely. Pipeline execution with 2 instructions issued simultaneously is shown in Figure 12. Executing multiple instructions simultaneously would require multiple execution units to avoid resource conflicts. The data cache must also have several independent ports for read/write which can be used simultaneously. . Observe that 6 instructions are completed in 7 clock cycles. Another approach to improve the speed of a processor is to combine temporal parallelism used in pipeline processing and data parallelism by issuing several instructions simultaneously in each cycle. In the steady state two instructions will be completed every clock cycle under ideal conditions. The minimum extra execution units required will be a floating point arithmetic unit in addition to an integer arithmetic unit and separate address calculation arithmetic unit. In the steady state. If the instruction is a 32-bit instruction and we fetch 2 instructions. various dependencies) will also be there in superpipeline processing. For successful superscalar processing the hardware should permit fetching several instructions simultaneously from the instruction memory.17 Pipeline and superpipeline processing. If many instructions are issued simultaneously. This is called superscalar processing. we may need several floating points and integer execution units.Advanced Processor Architectures 417 FIGURE 12. 64-bit data path from instruction memory is required and we need 2 instruction registers. one instruction will take half a clock cycle under ideal conditions.18. Observe that the normal pipeline processing takes 7 clock cycles to carry out 3 instructions whereas in superpipelined mode we need only 5 clock cycles for the same three instructions. This method of pipeline execution is known as superpipelining. As superpipeline processing speeds up execution of programs it has been adopted in many commercial high performance processors such as MIPS R4000.

two floating point operations. is sub-optimal. In this architecture.418 Computer Organization and Architecture IF IF DER DER IF IF ALU ALU DER DER IF IF DM DM ALU ALU DER DER SR SR DM DM ALU ALU SR SR DM DM SR SR 0 1 2 3 4 5 6 7 Clock cycles FIGURE 12.18 Superscalar processing. Hardware looks only at a small window of instructions and scheduling them to use all available processing units. decoder and arithmetic unit. a single word incorporates many operations.9 VERY LONG INSTRUCTION WORD (VLIW) PROCESSOR The major difficulty in designing superscalar processors is. the delay due to data dependency of successive instructions during pipeline execution can become quite high in superscalar architecture unless there are features available in hardware to allow execution of instructions in arbitrary order while ensuring the correctness of the program. As programs are generally written without any concern for parallel execution. 12. taking into account dependencies. two load/store operations and a branch may all be packed into one long instruction word which may be anywhere between 128 to 256 bits . Compilers can take a more global view of the program and rearrange code to better utilize the resources and reduce pipeline delays. We will not discuss this as it is outside the scope of this book. There are many problems which arise in using this approach particularly when there is a dependency of successive instructions. besides the need to duplicate instruction register. Readers may refer to the book by Rajaraman and Siva Ram Murthy[35] for a detailed treatment of this problem. An alternative to superscalar architecture has thus been proposed which uses sophisticated compilers to expose a sequence of instructions which have no dependency and require different resources of the processor. it is difficult to schedule instructions dynamically to reduce pipeline delays. Another approach is to transform the source code by a compiler in such a way that the resulting code has reduced dependency and resource conflicts so that parallelism available in hardware is fully utilised. Typically two integer operations.

we also need. If the branches follow the predicted path. 2 integer operations.Advanced Processor Architectures 419 long.10 SOME EXAMPLE COMMERCIAL PROCESSORS In this section we describe three processors which are examples of a RISC superscalar processor. Assume that the two floating point units (FPUs) are pipelined to speed-up floating point operations. On the average about half of each word is found to be empty. l Inefficient use of bits in a very long instruction word. The VLIW hardware also needs much higher memory and register file bandwidth to support wide words and multiple read/write to register files. to sustain two floating point operations per cycle. 12. Besides this. a CISC high performance processor and a long instruction word processor. 2 floating point units. For instance. l Difficulties in building hardware. In other words it must have 2 integer units. Lastly it is not always possible to pack all 7 operations in each word. This requires a large silicon area on the processor chip. The overall instruction level parallelism may not be this high. Binary code compatibility between two generations of VLIWs is very difficult to maintain as the structure of the instruction will invariably change. . we need at least 10 floating operations to effectively use such a pipelined FPU. scheduling instructions across branches by examining the program globally and by what is known as trace scheduling. though an interesting idea. the code becomes a ‘straight line code’ which facilitates packing sequences of instructions into long instruction words and storing them in the instruction memory of VLIW[35] processor. will become popular as commercial high performance processors only if compiling techniques improve and there are hardware features to assist in trace scheduling. Overall it looks like VLIW. This increases the required memory capacity. With two FPUs we need at least 20 independent floating point operations to be scheduled to get the ideal speed-up. Trace scheduling[37] is based on predicting the path taken by branch operations at compile time using some heuristics or by hints given by the programmer. we need two read ports and two write ports to register file to retrieve operands and store results. If the number of stages in the pipeline of FPU is 4. The main challenges in designing VLIW processors are: l Lack of sufficient instruction level parallelism in programs. Parallelism in programs is exposed by unrolling loops. The processor must have enough resources to execute all operations specified in an instruction word simultaneously. The challenging task is to have enough parallelism in a sequence of instructions to keep all these units busy. two data memories and a branch arithmetic unit in this example. A major problem is the lack of sufficient parallelism.

. the temporary results in the renamed registers are stored in the register file. Floating Point ALU Branch Prediction Unit 128 bits Data Cache FIGURE 12.19. Thus there will be no hold up in the instruction dispatch unit. Observe that this processor has three integer ALUs. A block diagram of the processor is given in Figure 12. 128 bits 32 KB Instruction Cache 128 bits Instruction Fetch and Despatch 64 bits 128 bit L2/Bus Interface Integer ALU Integer ALU Integer ALU Integer Regs. The Power PC 620 processor uses a sophisticated branch prediction unit with a branch history table of 2048 entries.1 Power PC 620 The Power PC architecture[35] is a RISC processor designed by IBM Corporation in cooperation with Motorola. Thus it is a superscalar processor with hardware level parallelism of 4.420 Computer Organization and Architecture 12. If the prediction turns out to be incorrect. Observe that there are separate instruction and data caches allowing instruction and data to be fetched in the same clock cycle. one to the floating point unit and three to the three integer units. If the prediction turns out to be correct. which cooperate in executing a program. In order to sustain this. the processor can recover as it has renamed register in the speculative branch which can be flushed without damaging the contents of the register file. a branch prediction unit and a load/store unit.10. one floating point ALU. the processor has a high performance branch prediction unit which incorporates prediction logic. Load/ Store Unit 64 bits 32 KB Floating Point Regs. Power PC 620 allows despatch of up to four instructions simultaneously. an instruction fetch and despatch unit. It can speculatively execute across four unresolved branch instructions with a success rate close to 90%.19 Block diagram of Power PC 620. Each execution unit has two or more buffer storage areas which temporarily hold dispatched instructions waiting for results from other instructions.

Execute stage: In this stage data cache fetch or ALU or FPU operation may be carried out. A simplified block diagram of a Pentium processor is given in Figure 12. Decode 2 stage: Addresses for memory reference are found in this stage.21).2 Pentium Processor Pentium is a CISC processor with a 2 instruction superscalar pipeline organization.20. Decode 1 stage: In this stage the processor decodes the instruction and finds the opcode and addressing information. 64 To/from 64-bit bus interface Instruction Cache 8 KB 256 Floating Point Unit Prefetch Buffers Branch destination address Queue A 32 Queue B 32 Branch Target Buffer Instruction Decoder and Pairing Checker 32 U 32 V Branch instruction address U V U V Integer ALU Integer ALU 64 32 32 Address of branch taken Integer Register File Data Cache 8 KB 64 To 64-bit interface FIGURE 12. Pentium[25] uses a 5-stage pipeline with the following stages in the pipeline (Figure 12.Advanced Processor Architectures 421 12. Observe that two operations can be carried out. There is a 256-bit path from instruction cache to the prefetch buffer. l Prefetch stage: Pentium instructions are variable length and are stored in l l l l a prefetch buffer.10. checks which instructions can be paired for simultaneous execution and participates in branch address prediction. .20 Block diagram of Pentium Processor. Write back stage: Registers and flags are updated based on results of execution.

. some constraints to ensure that no potential conflicts exist. the BTB is checked. Two successive instructions I1 and I2 can be dispatched in parallel to the U and V units provided the following conditions are satisfied. When a branch instruction is predicted as taken. 4. segment register the instruction should use. Only I1 may contain an instruction prefix. Branch prediction Pentium employs a Branch Target Buffer (BTB) described in Section 12. If the instruction address is already there. then the current instruction queue is frozen and instructions are fetched from the branch target address and placed in queue B. however. A BTB caches information about recently encountered branch instructions. whether it should have exclusive use of memory in multiprocessor environment etc. Instructions for execution are retrieved from only one of these queues (say. There are. operand size. the queue B is operational else instructions are taken from queue A. U V U V U V PF PF D1 D1 PF PF D2 D2 D1 D1 PF PF EX EX D2 D2 D1 D1 WB WB EX EX D2 D2 WB WB EX EX WB WB FIGURE 12.[25]). The BTB in Pentium has 256. When an instruction in the instruction stream is fetched. then it is updated. the destination register of I1 is not the source or destination of I2 and vice versa. Both I1 and I2 are simple instructions. If yes. 1. 32 bits branch destination address and 2 bits history). 66 bit entries (32 bits branch instruction address.422 Computer Organization and Architecture Pairing of instructions Pentium has 2 ALUs called U and V and two instructions can be executed simultaneously.4. (An instruction prefix is 0 to 4 bytes long and specifies information such as address size. 2. Instructions in prefetch buffer are fed into one of the two instruction queues called A and B. In other words. 3. it is a taken branch instruction and the history bits are checked to see if a jump is predicted. An instruction is called simple if it can be carried out in one clock cycle.7. If the prediction is correct. I1 and I2 are not flow dependent or output dependent. the branch target address is used to fetch the next instruction.21 Pipeline execution in Pentium. A). Neither I1 nor I2 contains both a displacement and an immediate operand. If an entry for a branch instruction is not in BTB.

A 6-bit field defines 64 predicate registers. This information is put in an 8-bit template field which identifies the instructions in the current bundle of 3 instructions (in the instruction word) and in the following bundle that can be carried out in parallel. We have seen that branch instructions delay pipelines and we used branch prediction techniques. It is called Explicitly Parallel Instruction Computing (EPIC) by Intel and is a big change from the Pentium Series of architectures. each of which specifies 128 integer or 128 floating point registers.Advanced Processor Architectures 423 12. with the program segment shown in Figure 12. These registers are used to speculatively evaluate instructions across branches. An instruction has 3 GPR fields of 7 bits.3 IA-64 Processor Architecture IA-64 (or Merced) is an architecture of Intel developed in collaboration with HewlettPackard as the next generation architecture for the coming decades. This is an innovation in the IA-64 architecture. An intelligent compiler has been developed which explicitly detects which instructions in a program can be carried out in parallel. packed in it.10. The objective of the compiler of IA-64 is to get maximum possible parallelism and execute as much straight line code as possible in many pipelines working in parallel.22 IA-64 instruction format. each of 40 bits. Instead of predicting which branch will be taken and carrying out instructions along that branch (and later on finding that the prediction was wrong) IA-64 schedules instructions in both the branches till such a point that results are . IA-64[35] instruction word is 128 bits long and its format is given in Figure 12. We explain the working of this idea. Observe that each instruction word has three instructions.23. called predication by Intel.22. Pentium series will continue primarily because it is upward compatible with 80x86 series architecture which has a large user base. 128 bits 40 bits T Instruction 1 Instruction 2 Instruction 3 Template field OP code 13 PR 6 GPR 7 GPR 7 GPR 7 64 Predicate registers 128 General purpose register addressing FIGURE 12.

The evaluation along one of the branches is of course a waste. This is to alleviate memory—CPU speed mismatch. Thus when the compiler has a global look at the program it will assign only selected branches for speculative evaluation. Another innovation in IA-64 is speculative loading of data from memory to registers.23) is encountered it attaches a predicate register I1 I2 13: If x then A else B (Assign Predicate Register P1 to this branch instruction) A: I41 (P1 = 1) I51 (P1 = 1) I61 (P1 = 1) B: I40 (P1 = 0) I50 (P1 = 0) I60 (P1 = 0) FIGURE 12.24 Predicate Register bits for speculative evaluation. say P1 to that instruction and to each of the instructions along the two branches.23 Program to illustrate use of Predicate Register.24). only one of the branches would be valid depending on the result of the test of the predicate. The idea is to reorder instructions and issue a speculative load ahead of a branch instruction. The P1 field is set to 1 along the true branch and 0 along false branch (see Figure 12.424 Computer Organization and Architecture not committed to memory. Thus when a predicate instruction (I3 in the program of Figure 12. If . Remember that there are only 64 prediction registers and every sixth instruction in a program is normally a branch. However. When the predicate is evaluated P1 register gets a value of 1 or 0 depending on whether x is true or false. A speculative check instruction is inserted where the original load instruction was present. I1 P1 I41 1 I40 I2 P1 0 I50 I3 P1 P1 0 I51 1 I60 0 I61 1 FIGURE 12. The compiler has to decide whether cycles we lose due to incorrect prediction wastes more cycles compared to this. If it is true all instructions with P1 = 0 are discarded and those with P1 = 1 are committed.

Otherwise it discards it. Execution of an instruction in processors can be divided into 5 steps: Fetching instruction from instruction cache. Details are explained in Section 12. 5. Pipeline delays may be caused due to resource constraints (e. accessing an operand from a register. carrying out one instruction per clock cycle by using pipelined execution. an instruction requiring result of . the speculative check commits the data. storing result in data cache and/ or in registers. executing instruction (ALU). locality of reference of instructions and data in programs. no separate instruction and data caches. The five important principles used by processor architects are upward compatibility. Overall the primary aim in IA-64 architecture is to do an aggressive compiler transformation. 4. In practice. 2. The idea of Reduced Instruction Set Computers emerged in the 80s based on results of executing benchmark programs in several applications. decoding instruction. Over the past decades the architecture of processors have been primarily determined by application requirements.Advanced Processor Architectures 425 branch is taken as speculated. floating point operation requiring several clock cycles). 7. or due to data dependency (e. These can be pipelined. 9. Amdahl's law which states that to speed-up a system all its sub-systems must be uniformly speeded up.g. however. reducing access to main memory to only load and store operations. 8. It is possible if a task taking T units of time can be divided into n sub-tasks each taking (T/n) units of time and assigning each sub-task to a processing unit connected serially. SUMMARY 1. programming convenience and prevailing technology. Ideally an n stage pipeline will speed-up executing n folds. Modern methodology of design of processor architecture uses emphirical data on frequency of use of machine instructions in a representative set of programs (called benchmarks) to optimize design. carry out parallel instructions speculatively and use resources in the processor with greatest possible efficiency. this ideal speed-up is not possible due to several reasons. and the fact that 20% of code consumes 80% of the execution time and the need to pay attention to optimize this 20%.g. use of large set of CPU registers to reduce access to main memory/cache. Pipeline processing exploits the idea of temporal parallel processing.5. detect parallelism at compile time. 6. simple addressing modes. RISC processors are characterised by simple uniform length instruction.. 3. small number of instructions and integrating the entire processor in one VLSI chip. exploiting parallelism.

Such delays may be alleviated by hardware improvements such as providing separate instruction and data caches.426 Computer Organization and Architecture 10. What do you understand by a synthetic benchmark? Name some commonly used synthetic benchmarks? 4. VLIW depends on intelligent compilers which will generate machine code which allows several instructions to be packed in one word and carried out simultaneously. 12. What is an instruction mix? How is it determined? What is the difference between instruction mix and statement mix? 3. In order to do it. Thus two instructions are completed in each cycle. What are SPEC benchmarks? How are these better than synthetic benchmarks? 5. What are TPC benchmarks? Why are they necessary? In what way are they different from SPEC marks? Does a high SPEC marks imply a high TPC rating? . Commercial processors have been built using advanced architectural ideas presented in this chapter. 15. IA-64 of Intel and HP is a 128-bit word VLIW processor packing 3 instructions per word. Super pipelining is a technique in which the operations in each clock cycle is divided into two steps each taking half a cycle. due to branch instruction (as fetched instruction would not be useful). 11. previous instruction which may not yet be stored). hardware resources have to be increased and machine code has to be dynamically rearranged. It uses many innovations such as speculative loading. increasing speed of floating point arithmetic. 14. Superscalar processor carries out two sequential instructions simultaneously in one clock cycle. Power PC620 of IBM is a superscalar pipelined RISC processor. predicate registers and treatment of branching instructions. It also has several ALUs and address computation units. EXERCISES 1. Very Long Instruction Word (VLIW) processor packs several instructions in one word. 13. what are its drawbacks? 2. Besides this. Delays may also be reduced by using an ‘intelligent’ compiler which rearranges the source code without changing its meaning. What is MIPS rating of a computer? How is it computed? Is it a meaningful measure of CPU performance? If not. forwarding results to next instruction before storing it when required and branch prediction buffer. It uses aggressive compiler optimization to select appropriate instructions to pack in a word which can be simultaneously executed.

Draw pipeline execution diagram during the execution of the following instructions. A job consist of 5 tasks.Advanced Processor Architectures 427 6. (i) What is instruction pipelining? (ii) What are the maximum number of stages an instruction pipeline can have? What are these stages? (iii) Enumerate the situations in which an instruction pipeline can stall. 4 ns. conditional branches 18% and immediate instructions 8% in programs executed in a CPU with 5 pipelined stages as used in this chapter. What is the minimum number of pairs of operands needed to obtain a pipeline efficiency of 40%? 8. 11. 9. 15. Assume that the processor uses a 4-stage pipeline. The four stages take respectively 5 ns. We assumed a five-stage pipeline processing for a CPU. 35 ns and 45 ns respectively. The time taken by the 5 tasks are respectively 25 ns. (i) Find out the speed-up obtained by using pipeline processing (ii) What is the efficiency of the pipeline? (iii) What is the efficiency of the pipeline if 8 jobs are to be processed? (iv) What is the maximum possible efficiency of this sytem? 7. Assume that the percentage of unconditional branches is 30%. Compute loss of speed-up due to this if a fraction p of total number of instructions require reference to data stored in cache. 80 jobs are to be proceed. 8 ns and 4 ns. 12. In our model of CPU we assume availability of separate instruction and data caches. 14. What would you suggest as the 4 pipeline stages. Repeat Exercise 11 assuming a 4-stage pipeline. Assume normalized floating point representation of numbers. calculate the speed-up due to pipeline execution with and without pipelined floating point unit. 13. If there is a single combined data and instruction cache discuss how it will affect pipeline execution. 10. Design a pipelined floating point adder by picking the right clock to drive the circuits and the buffer register lengths needed between pipeline stages. Justify your answer. Compute the average clock cycles required per instruction. find out the speed-up due to pipelining. 16. Assume 5-stage pipelining with percentage mix of instructions given in Exercise 10. Describe how a floating point add/subtract instruction can be pipelined with a 4-stage pipeline. Assuming 8 ns clock and 80% of the branches are taken. 40 ns. . If a pipelined floating point unit is available in a computer and if 25% of arithmetic operations are floating point instructions.

R7 R3 R4 R4 R7 ¬ ¬ ¬ ¬ R1 R2 R4 R6 ´ + + – R2 R3 1 R3 Find out the delay in pipeline execution due to data dependency of the above instructions. Templ = 6. What is the difference between branch prediction buffer and branch target buffer used to reduce delay due to control dependency? 22. What conditions must be satisfied by the statement appearing before a branch instruction so that it can be used in the delay slot? 26. For the program of Exercise 19 is it possible to reduce pipelining delay due to branch instructions by rearranging the code? If yes. 24. R4 INC R4 SUB R6. NEXT LOOP R4. (ii) With R0 = 0. Temp 1 R3.428 Computer Organization and Architecture MUL R1. Given the following program LOOP LD SUB JMI ST LD DEC JEQ JMP ST R3. Using the data of Exercise 20. R4. 20. calculate the loss in speed-up of a processor with 4 pipeline stages. If a program has 18% conditional branch instruction and 4% unconditional branch instructions and if 7% of conditional branches are taken branches. R3 ADD R2. R3 = 3. R4 = 4. 23. Describe how the delay can be reduced in execution of the instructions of Exercise 16 by (i) Hardware assistance (ii) Software assistance. Large R3 ¬ Temp 1 R5 ¬ R3 – R4 Temp ¬ R3 R4 ¬ Temp R1 ¬ R1 – 1 If R1 = R0 goto NEXT Large ¬ R4 XX NEXT (i) With R1 = 2. 27. 0. Temp1 = 4 draw the pipeline execution diagram. 17. 0. Repeat Exercise 16 if MUL instruction takes 3 clock cycles whereas other instructions take one cycle each. compute the reduction in pipeline delay when branch prediction buffer is used. R2. 18. 19. R3. R4 = 6. show how it is done. 21. R3. R5 XX R3. 0. Temp R1 R1. Explain how branch instructions delay pipeline execution. R0. Using the data of Exercise 20. 0. How can software methods be used to reduce delay due to branches? What conditions should be satisfied for software method to succeed? 25. Temp R4. What do you understand by the term precise exception? What is its significance? . compute the reduction in pipeline delay when branch target buffer is used. draw the pipeline execution diagram.

(i) (ii) (iii) (i) (ii) (iii) (iv) Is Power PC 620 a RISC or a CISC processor? Is it superscalar or superpipelined? Is it a VLIW processor? How many integer and floating point execution units does Power PC 620 have? Show pipeline execution sequence of instructions of Exercise 30 on Power PC 620. explain how.Advanced Processor Architectures 429 28. 32. (i) Define a trace. 29. . What is the difference between superscalar processing and superpipelining? Can one combine the two? If yes. (iii) Is it possible to rename registers to reduce the number of execution cycles? Reschedule instructions (if possible) to reduce the number of cycles needed to execute this set of instructions. Assume there is one floating point and 2 integer execution units Instruction R2 ¬ R2 * R6 R3 ¬ R2 + R1 R1 ¬ R6 + 8 R8 ¬ R2 – R9 R5 ¬ R4/R8 R6 ¬ R2 + 4 R2 ¬ R1 + 2 R10 ¬ R9 * R8 Number of cycles Needed 2 1 1 1 2 1 1 2 Arithmetic unit Needed Floating point Integer Integer Integer Floating point Integer Integer Floating point (ii) If there are 2 integer and 2 floating point execution units repeat (i). When an interrupt occurs during the execution stage of an instruction in a pipelined computer. (ii) How is it used in a VLIW processor? (iii) List the advantages and disadvantages of VLIW processor? (i) (ii) (iii) (iv) (v) 33.18. What extra resources are needed to support superscalar processing? (i) For the following sequence of instructions develop supersclar pipeline execution diagrams similar to that given in Figure 12. Is Pentium Processor a RISC or CISC processor? Is it superscalar or superpipelined? What is the degree of superscalar processing in Pentium? Is IA-64 a VLIW processor? Is it a superscalar processor? How many instructions can it carry out in parallel? Explain how it uses predicate registers in processing. 34. explain how the system should handle it. 30. 31.

Classification of parallel computers as Single Instruction Stream Multiple namely.PARALLEL COMPUTERS 13 LEARNING OBJECTIVES In this chapter we will learn: â â â â â â â â Data Stream (SIMD). How vector computers and array processors are organized. and connecting a single main memory to processors with an interconnection network. Multiple Instruction Stream Single Data Stream (MISD) and Multiple Instruction Stream Multiple Data Stream (MIMD) computers. How a cluster of workstations can be used as message passing parallel computer. in shared memory machines. How to interconnect processors to build parallel computers. The protocols used to ensure cache coherence of multiple caches used How to organize distributed shared memory parallel computers (also known as Non Uniform Memory Access or NUMA computers). shared bus. 430 . How to organize message passing parallel computers. Different methods of organizing shared memory parallel computers.

1 shows a generalized structure of a parallel computer. This type of parallelism is known as fine grain parallelism as the smallest grain is an instruction in a program. the type of communication network used and the technique of allocating tasks to PEs and how they communicate and cooperate. This general structure can have many variations based on the type of PEs. In this chapter we will examine how processors may be interconnected to build parallel computers which use a coarser grain.1 INTRODUCTION In the last chapter we saw how parallelism available at the instruction level can be used extensively in designing processors. a thread or a process executing in parallel.1 A generalized structure of a parallel computer. for example. We see from this definition that the keywords which define the structure of a parallel computer are PEs. and how memories are connected to the PEs. The variations in each of these lead to a rich variety of parallel computers. Collection of processing elements M M M M M I/O System PE PE PE PE PE CI CI CI CI CI COMMUNICATION NETWORK PE : Processing Element I/O : Input/Output CI : Communication Interface M : PE’s local Memory Remarks: PE may have a private memory in which case it is called a Computing Element (CE) FIGURE 13. Figure 13. The heart of the parallel computer is a set of PEs interconnected by a communication network. the memory available to PEs to store programs and data. .Parallel Computers 431 13. communication and cooperation among interconnected PEs. A parallel computer is defined as an interconnected set of Processing Elements (PEs) which cooperate by communicating with one another to solve large problems fast.

This will allow us to describe a given parallel computer using adjectives based on the classification. A computer with a single processor is called a Single Instruction stream Single Data stream (SISD) computer.432 Computer Organization and Architecture 13. . 2. In such a computer. In this section we will examine each of these classifications. the PEs use identical programs to process different data streams.1 Flynn’s Classification Flynn classified parallel computers into four categories based on how instructions process data[38]. An instruction may be broadcast to all PEs and they can process data items fed to them using this instruction. such a parallel computer is called a Single Program Multiple Data (SPMD) computer.3). 13. The unique characteristic of SIMD computers is that all PEs work synchronously controlled by a single stream of instructions. processed and the results stored back in the main memory. All processors in this structure are given identical instructions and they execute them in a lock-step-fashion (simultaneously) using data in their respective memories. SIMD computers have also been built as a grid with communication between nearest neighbours (Figure 13. The next class of computers which have multiple processors is known as a Single Instruction stream Multiple Data stream (SIMD) computer. a single stream of instruction and a single stream of data are accessed by the PE from the main memory. How do PEs access memory? Accessing relates to whether data and instructions are accessed from a PE's own private memory or from a memory shared by all PEs or a part from one's own memory and another part from the memory belonging to another PE. SIMD computers are used to solve many problems in science which require identical operations to be applied to different data sets synchronously. A block diagram of such a computer is shown in Figure 13. Observe that in this structure there is no explicit communication among processors. data paths between nearest neighbours is used in some structures. However. 3. What is the quantum of work done by a PE before it communicates with another PE? This is commonly known as the grain size of computation.2.2 CLASSIFICATION OF PARALLEL COMPUTERS The vast variety of parallel computer architecture may be classified based on the following criteria. How do instructions and data flow in the system? This idea for classification was proposed by Flynn[38] and is known as Flynn's classification. 1. If instead of a single instruction. 4. What is the coupling between PEs? Coupling refers to the way in which PEs cooperate with one another.2. It is considered important as it is one of the earliest attempts at classification and has been widely used in the literature to describe various parallel computer architectures.

g. ÇÇ (a i k ik  bik ) . e. The third class of computers according to Flynn's classification is known as Multiple Instructions stream Single Data stream (MISD) computers.4.3 Regular structure of SIMD computer with data flowing from neighbours. Observe that in this structure. CE1 IM CE5 CE2 CE3 CE4 R1 CE6 CE7 CE8 R2 Instruction path CE : Processing Elements with private data memory IM : Instruction Memory FIGURE 13. different PEs run different programs on the same data. In fact pipeline processing of data explained in Section 12. Examples are adding a set of matrices simultaneously.2 Structure of a SIMD computer. This structure is shown in Figure 13.5 the answer books passed from one teacher to the next corresponds to the data stored in DM and instructions to grade different questions given to the set of teachers in analogues to the contents of IM1 to IMn in the structure of Figure 13.Parallel Computers DM1 PE1 R1 DM : Data Memory IM : Instruction Memory DM2 PE2 R2 PE : Processing Element R : Result Nearest neighbours (PEs) are connected in some systems 433 DM3 PE3 R3 DMn PEn Rn IM FIGURE 13.4. In pipeline .5 is a special case of this mode of computing. In the example we discussed in Section 12. Such computers are known as array processors which we will discuss later in this chapter.

5. M : Data and Instruction Memory M1 PE1 R1 PE : Processing Element R : Result M2 PE2 R2 CN CN : Communication Network Mn PEn Rn FIGURE 13.1. We will devote most of our discussions in this chapter to MIMD computers as these are general purpose parallel computers.5 Structure of a MIMD computer. etc. This type of processor may be generalized using a 2-dimensional arrangement of PEs.434 Computer Organization and Architecture IM1 PE1 DM : Data Memory IM : Instruction Memory PE : Processing Element R : Result R1 IM2 PE2 R2 IM3 PE3 R3 IMn PEn Rn DM FIGURE 13. and DM contents will be input to only PE1. R1 is fed to PE2. namely.. The last and the most general model according to Flynn's classification is Multiple Instructions stream Multiple Data stream (MIMD) computer. .4 Structure of a MISD computer. processing. The structure of such a computer is shown in Figure 13. R2 to PE3. Such a structure is known as a systolic processor. the data processed by PE. Observe that this is similar to the model of parallel computer we gave in Figure 13.

These two models may be called Common Shared Memory (SM) computer and Distributed Shared Memory (DSM) computer respectively. In a distributed shared memory computer.2 Coupling between Processing Elements The autonomy enjoyed by the PEs while cooperating with one another during problem-solving determines the degree of coupling between them. If they want to cooperate they will exchange messages.6 Classification as loosely or tightly coupled systems.Parallel Computers 435 13.3 Classification Based on Mode of Accessing Memory In a shared memory computer all the processors share a common global address space. on the other hand. 13. In Figure 13. Parallel Computer Coupling Physical Connection Loosely coupled Tightly coupled Processing Elements with private memory communicate via a network Processing Elements share a common memory and communicate via shared memory Logical Cooperation Compute independently and cooperate by exchanging messages Cooperate by sharing results stored in a common memory Type of Parallel Computer Message Passing Multicomputer Shared Memory Multiprocessor FIGURE 13. Physically the main memory may be a single memory bank shared by all processors or each processor may have its own local memory and may or may not share a common memory. the time taken to access a word in the memory local to it is smaller than the time taken to access a word stored in the memory of another comptuer or a common shared memory. In other words programs are written for this parallel computer assuming that all processors address a common shared memory.6 we summarize this discussion. Thus logically they are autonomous and physically they do not share any memory and communication is via I/O channels. Thus this system (DSM system) is said to have a Non Uniform .2. For instance.2. A tightly coupled parallel computer. In this case each of the workstations work independently. Thus communication among PEs is very fast and cooperation may be even at the level of instructions carried out by each PE as they share a common memory. The time to access a word in memory is constant for all processors in the first case. shares a common main memory. Such a parallel computer is said to have a Uniform Memory Access (UMA). a parallel computer consisting of workstations connected together by a local area network such as an ethernet is loosely coupled.

4 Classification Based on Grain Size We saw that the quantum of work done by a PE before it communicates with another processor is called the grain size. In Figure 13.7 UMA and NUMA parallel computer structures. medium grain and coarse grain. will correspond to a procedure (or subroutine). If a remote memory is accessed by a PE using a communication network it may be 10 to 1000 times slower than accessing its own local memory. fine grain. In this chapter we are concerned with fine. It will typically be 1000 machine instructions. medium and coarse grain processing as applied to parallel computers which use microprocessors as PEs. The smallest grain size is a single instruction. PE1 PE2 PEn PE1 M1 Communication Network NI1 Memory Communication Network NI2 NIn PE2 M2 PEn Mn Memory PEs have no local memory. on the other hand.2. The grain sizes are classified as very fine grain. Medium grain parallelism. This will correspond to the number of instructions carried out by a PE before it sends the result to a cooperating PE. Compilers are very useful in exposing very fine grain parallelism and processors such as VLIW use it effectively.7 we show the physical organization of these two types of parallel computers. We saw in the last chapter how instruction level parallelism is used to make single processors work fast. The grain size determines the frequency of communication between PEs during the solution of a problem. Tightly coupled parallel computers can exploit fine grain . Coarse grain will correspond to a complete program. typically 100 machine instructions. Instruction level parallelism is exploited by having multiple functional units and overlapping steps during instruction execution.436 Computer Organization and Architecture Memory Access (NUMA). 13. All PEs share a common shared memory in UMA systems PE : Processing Element M : Memory NI : Network Interface There may or may not be a shared physical memory (b) NUMA parallel computer (a) UMA parallel computer FIGURE 13. We will call it very fine grain parallelism. A thread typically may be the set of instructions belonging to one iteration of a loop. In this case fine grain parallelism will correspond to a thread.

Thus time taken to compute and communicate once = pi + qi. on the other hand. Thus qi = T + si where T is the fixed overhead. ideally each processor (if it computes in a fully overlapped mode) should take k/n cycles to solve the problem.Parallel Computers 437 parallelism effectively as the processors cooperate via a shared memory and time to access a shared memory is small (a few machine cycles). Each communication event in a loosely connected system has an overhead having a fixed part (due to the need to take the assistance of the operating system to communicate) and a variable part which depends on the speed of the communication channel. Assume it takes k cycles to solve a problem on a single processor. Let k/n = p cycles. The main point is that the compute time must be much larger than the communicating time. Loosely coupled parallel computers. The number of communication events will be inversely proportional to the grain size. the time taken by each processor to compute Çp i 1 m i  qi Ç p  Çq i i 1 i 1 m m i p Çq i 1 m i Total time to compute is increased by Çq i 1 m i cycles. If n processors are used. If there are m communication events then the total communication time Ç i 1 m qi mT  Çs i 1 m i Thus the total time taken to compute in parallel p  mT  Çs i 1 m i . can exploit only medium or coarse grain parallelism as the time to communicate results using a communication network will be a few hundred clock cycles and thus the quantum of work done by a processor must be much larger than this to obtain any speed-up. To explain the points made above we will consider an n-processor machine solving a problem. If there are m communication events. Let each processor compute continuously for pi cycles and spend qi cycles communicating with other processors for cooperatively solving the problem. During these qi cycles the processor is idle.

24 Thus there is 20% loss in speed-up when the grain size is 500 cycles. p p This implies that the grain size ! 10 (T  q ). Thus each processor of a loosely coupled parallel computer may communicate only once every 1200 instructions if loss of efficiency is to be kept low. Thus if the parallel computer uses such processors and if each processor communicates once every 500 instructions. Assuming q of the order of 2 cycles the grain size for 10% loss of speed-up is 20 cycles.438 \ Computer Organization and Architecture m È Ø k É p  mT  si Ù É Ù Ê i 1 Ú Speed-up Ç k/ p m ÎÈ Ø Ñ si Ù 1  ÏÉ mT  Ù ÑÉ i 1 Ú ÐÊ Ç Þ Ñ pß Ñ à n m 1  (T  q ) p where we have assumed for simplicity that si = q. In other words after every 20 instructions a communication event can take place without excessively degrading performance. m In general if the loss in efficiency is to be less than 10% then (T  q )  0. m If T = 100 and q = 20 then grain size > 1200 cycles. These points will be reiterated later with a more realistic model.8 which gives a taxonomy of parallel computers. one instruction is carried out in each cycle. If the processors are pipelined. On the other hand in a tightly coupled system T is 0 as the memory is shared and no Operating System call is necessary to write in the memory. . and let m the number of communication events be 100. Let n = 100 (number of processors). Let the overhead T for each communication event be 100 cycles and let the communication time q be 20 cycles. The above calculations are very rough and intended just to illustrate the idea of grain size. In fact we have been very optimistic in the above calculations and made many simplifying assumptions.1. Speed-up = 100 100(100  20) 1 50000 100 12000 1 50000 100  80 1  0. If one instruction is carried out every cycle the grain size is 1200 instructions. the overall efficiency loss is 20%. We summarize the discussions so far in Figure 13. To fix our ideas we will substitute some values in the above equation. p the total compute time of each processor = 50000 cycles.

exp z). If one or more significant bits of mant z = 0 shift mant z left until leading bit of mant z is not zero. Add mant x and mant y. A floating point number consists of two parts—a mantissa and an exponent. If m < 0 shift mant x. Let z = x + y.3 VECTOR COMPUTERS One of the earliest ideas used to design high performance computers was the use of temporal parallelism (i. The sum z may be represented by (mant z. Step 3: Step 4: . m positions right and fill the leading bits of mant x with zeros.8 Taxonomy of parallel computers.e. pipeline processing). Subtract p from exp z. Thus x and y may be represented by the tuple (mant x. 13. The task of adding x and y can be broken up into the following four steps: Step 1: Step 2: Compute (exp x – exp y) = m If m > 0 shift mant y. exp y) respectively. Set exp z = exp x. Let the number of shifts be p. Consider addition of two floating point numbers x and y. If m = 0 do nothing.Parallel Computers 439 FIGURE 13. The most important unit of a vector computer is the pipelined arithmetic units. exp x) and (mant y. m positions right and fill its leading bit positions with zeros. Set exp z = exp y. Vector computers use temporal processing extensively. Let mant z = mant x + mant y. If mant z > 1 then shift mant z right by 1 bit and add 1 to exponent. Set exp z = exp x.

One pair of operands is shifted into the pipeline from the input registers every clock period. . z3. (a2 + b2) … (an + bn) are generated. ….10 A four-stage pipelined adder. namely. Thus clocks per sum is (67/64) » 1. they are multiplied by the components of C. Typically vector registers store 64 floating point numbers which are each 64 bits long.440 x y Computer Organization and Architecture Normalize and adjust exponent Compare exponents Align mantissas Add mantissas z FIGURE 13. let us say T.. The first component zi will take time 4T but subsequent components will come out every T seconds. Observe that an important part is a set of vector registers which feed the vector pipelined adders and multipliers. x1) (yn. Observe in Figure 13. (a2 + b2)*c2. The pipeline stages may also be increased which will reduce T. . x2. To add vectors x and y. we feed them to a pipelined arithmetic unit as shown in Figure 13. the more efficient the pipeline addition would be. A vector may be defined as an ordered sequence of numbers. A block diagram of a vector arithmetic unit of a typical vector computer is shown in Figure 13. z2. y2. Pipelined multiply and divide units are more complicated to build but use the same general principle. xn) is a vector of length n. the time taken to add them will be 67T. each of length n. are fed to the pipelined adder.. There is no control dependency as the entire vector is considered as a unit and operations on vectors are implemented without any loop instruction.. (a1 + b1). A block diagram of a pipelined floating point adder is shown in Figure 13. (a3 + b3)*c3.. the period required for each stage. This method of feeding the output of one pipelined arithmetic . The longer the vector. For example x = (x1. namely. c 1 .11 that the output of the pipeline adder unit may be fed to another pipelined multiply unit..9.9 A four-stage pipelined floating point adder. (xn.10. y1) (zn. z1) FIGURE 13. Recent vector machines built by NEC of Japan use vector registers which store 128 double precision (64 bits) numbers. x2. . each 64 components long. Vector subtract is not different (it is the complement of addition). If two vectors. As the components of (A + B). Such a pipelined adder can be used very effectively to add two vectors.05 which is almost one sum every clock period. x3. c 2 … c n to produce (a 1 + b 1 )*c1. We discussed vector add earlier. y3..11[39]. Observe that pipelined arithmetic operations do not have problems of data dependency as each component of a vector is independent of the other. … (an + bn)* cn respectively and stored in the vector register D. This was the vector register size used by Cray computers which were the most famous supercomputers of the 80s.

a2. If each PE takes T units of time to add.11.Parallel Computers 441 unit to another is called vector chaining. In this example a single instruction. b3 … bn) by streaming them through a pipelined arithmetic unit. Another method of adding vectors would be to have an array of n PEs. b2.11 Use of pipelined arithmetic in a vector computer. Vector Register A Adder B A+B D Main Memory P Multiplier C P+Q Q Adder FIGURE 13. FIGURE 13. there is a second pipelined adder unit using vector registers P and Q. . These registers and an associated arithmetic unit can be used independently and simultaneously with other pipelined units provided the program being executed can use this facility. This is an example of a Single Instruction Multiple Data (SIMD) architecture[40]. the speed of computation is doubled as an add and multiply are performed in one clock cycle. 13. Such an organization of PEs is known as an Array Processor. where each PE stores a pair of operands (see Figure 13.12). many independent pipelined arithmetic units are provided in a vector computer[39].4 ARRAY PROCESSORS A vector computer adds two vectors (a1. An add instruction is broadcast to all PEs and they add the operands stored in them simultaneously. is performed simultaneously on a set of data. Besides vector chaining. namely ADD. With chaining. An array processor uses data parallelism. for example. In Figure 13.12 An array processor to add a vector. a3 … an) and (b1. the vector will also be added in T units as all the PEs work simultaneously.

Each processor has a private cache. The data to be processed is stored in the local memory of each of the PEs in the array. these instructions are stored in the private instruction memory of each of the PEs (making them full fledged Computing Elements). Two statements used for this purpose are: 1. capability to perform all the arithmetic operations. A shared bus parallel machine is also inexpensive and easy to expand by adding more processors. Each PE can. a common program and data are stored in the main memory shared by all PEs.5. however.5 SHARED MEMORY PARALLEL COMPUTERS One of the most common parallel architectures using moderate number of processors (4 to 32) is a shared memory multiprocessor[41]. be assigned a different part of the program stored in memory to execute with data also stored in specified locations. A sequence of instructions to process the data are broadcast to all the PEs by the host. join: When the invoking process needs the results of the invoked process(es) to continue. If instead of broadcasting the instructions from the host.1 Synchronization of Processes in Shared Memory Computers In a shared memory parallel computer. This model of parallel processing is called Single Program Multiple Data (SPMD) processing. In both these cases the average access time to the main memory from any processor is same.442 Computer Organization and Architecture An array processor will have. All the PEs independently and simultaneously execute the instructions and process the data stored in each of them. Thus many desktop systems use this architecture now with a single VLSI chip incorporating several CPUs In this section we will first discuss this architecture. The main program creates separate processes for each PE and allocates them along with information on the locations where data are stored for each process. It is attached to a host computer. 13. This architecture provides a global address space for writing parallel programs. in general. 13. When all PEs finish their assigned tasks they have to rejoin the main program. The main program will execute after all processes created by it finish. This architecture is popular as it is easy to program it due to the availability of globally addressed main memory where the program and all data are stored. . Each PE computes independently. the host has to merely issue a start command and all the CEs would start computing simultaneously and asynchronously leading to faster computation. The processors are connected to the main memory either using a shared bus or using an interconnection network. fork: to create a process 2. Thus this architecture is also known as Symmetric Multiprocessor abbreviated SMP. Statements are added in a programming language to enable creation of processes and for waiting for them to complete.

Suppose Process A loads Sum in its register to add f(A) to it. Process X when it encounters fork Y invokes another Process Y. If Y terminates earlier. The invoked Process Y starts executing concurrently in another processor. if Process B also loads Sum in its local register to add f(B) to it. Process A will have Sum + f(A) and Process B wil have Sum + f(B) in their respective local registers. the value in the main memory will be either Sum + f(A) or Sum + f(B) whereas what was intended was to store Sum + f(A) + f(B) as the result in the main memory. Assume that Sum ¬ Sum + f(A) + f(B) is to be computed and the following program is written.1. When Process X reaches join Y statement. Depending on which process stores Sum first. When multiple processes work concurrently and update data stored in a common memory shared by them. If Process A stores Sum + f(A) first in Sum and then Process B takes this result and adds f(B) to it then the answer will be correct. # # # end Y. it continues doing its work. After invoking Process Y. Procedure 13. This is done by using a statement . Now both Processes A and B will store the result back in Sum.1 Process X.2 Process A # # Process B # Sum ¬ Sum + f(B) # fork B # Sum ¬ Sum + f(A) # join B # end A. Before the result is stored back in the main memory by Process A. then X does not have to wait and will continue after executing join Y. # join Y. special care should be taken to ensure that a shared variable value is not initialised or updated independently and simultaneously by these processes. # fork Y. Procedure 13. Process Y. consider the following statements of Procedure 13. The following example illustrates this problem. end B.Parallel Computers 443 For example. it waits till Process Y terminates. Thus we have to ensure that only one process updates a shared variable at a time.

# 1 RETURN STORE L. Thus updating a shared variable is serialised. L CMP R1.4 attempt to implement lock and unlock respectively in a hypothetical computer. Let us first assume there is no hardware assistance. Sum ¬ Sum + f(B). Procedure 13. Any other process wanting to update Sum will now have access to it. unlock Sum. The operations carried out by each instruction is evident from the remarks. # end A.4 lock: LOAD R1.3.e. #0 BNZ lock STORE L.444 Computer Organization and Architecture called lock <variable name>. # join B. if it . The two assembly language programs given as Procedure 13. A process trying to obtain the lock should check if L = 0 (i. This method ensures that only one process is able to update a variable at a time.. unlock Sum. 0. Process B # lock Sum. Sum ¬ Sum + f(A). It will unlock Sum after updating it. That process will lock Sum and then update it. In the above case whichever process reaches lock Sum statement first will lock the variable Sum disallowing any other process from accessing it. The updating program can thus be rewritten as given in Procedure 13. If a process locks a variable name no other process can access it till it is unlocked by the process which locked it. In order to correctly implement lock and unlock operations used by the software we require hardware assistance. A locking program locks by setting the variable L = 1 and unlocks by setting L = 0.3 Process A # fork B. # 0 RETURN /* /* /* /* /* C(R1) ¬ C(L) */ Compare R1 with 0*/ If R1 0 try again */ L ¬ 1 */ RETURN */ unlock : /* Store 0 in L */ In the above two programs L is the lock variable and #0 and #1 immediate operands. # end B. Procedure 13. This ensures that the correct value is stored in the shared variable. # lock Sum.

e. it will use the unlock routine to allow another process to capture the lock variable. L R1. However. Implementing barrier synchronization using an atomic read-modify-write instruction is left as an exercise to the reader. . With this instruction we can rewrite the assembly codes for lock and unlock as follows: Procedure 13. enters a critical section locking it to other processes. If suppose L = 0 originally and two processes P0 and P1 execute the above lock code. Test and set is a representative of an atomic read-modify-write instruction in the instruction set of a processor which one intends to use as a PE of a parallel computer. unlocked). In other words they are separate instructions and another process is free to carry out any instruction in between these instructions. testing it to see if it is 0 and changing it to 1 are not atomic. P0 reads L and finds it to be 0 and passes BNZ statement and in the next instruction will set L = 1. It loads the value found in L in the register R1 and sets the value of L to 1. If a process finds L = 1 it knows that the lock is closed and will wait for it to become 0 (i. 0. Observe that another process cannot come in between and disturb locking as test and set (TST) is an atomic instruction and is carried out as a single indivisible operation. if before P0 sets L = 1. capturing the value of L and the procedure will return after making L = 1. # 0 lock L. The first program above implements a "busy-wait" loop which goes on looping until the value of L is changed by some other process to 0. If L at some time is set to 0 by another process. This instruction stores the contents of a location “L” (in memory) in a processor register and stores another value in “L”. After completing its work. The lock program given above looks fine but will not work correctly in actual practice due to the following reason.5 lock: TST CMP BNZ STORE RETURN R1.Parallel Computers 445 is unlocked). Observe that if L = 1 the first program will go on looping keeping L at 1. if P1 reads L it will also think L = 0 and assume the lock is open! This problem has arisen because the sequence of instructions: reading L. Another important primitive called barrier which ensures that instead of one process all processes complete specified jobs before proceeding further also requires such atomic read-modify-write instruction to be implemented correctly. L ¬ 1 */ /* Compare R1 with 0*/ /* If R1 0 try again */ /* L ¬ 0 */ unlock: The new instruction TST is called Test and Set in the literature even though a better name for it would be load and set. # 0 /* C(R1) ¬ L . What is required is an atomic instruction which will load L into register and store 1 in it. enter the critical section and immediately lock the critical section by setting L = 1 thereby making the lock busy. thereby implementing locking. It now knows it is unlocked. R1 will be set to 0.

13. The bandwidth of the bus is given as 10 GB/s. . PE PE PE Cache Cache Cache Shared Bus Main Memory I/O System FIGURE 13. We will illustrate this with a simple example.5. We will see in succeeding section that in a shared memory computer with cache coherance protocol sequential consistency is ensured.13. The second function is to reduce the need for all processors to access the main memory simultaneously using the bus. Assume 0. In order to ensure this in hardware each processor must appear to issue and complete memory operations one at a time atomically in program order. The private cache memory has two functions. Observe that each PE has a private cache memory.446 Computer Organization and Architecture An important concept in executing multiple processes concurrently is what is known as sequential consistency. Thus bus traffic is reduced during program execution. One is to reduce the access time to data and program. Lamport[42] defines sequential consistency as: “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operation of all processors were executed in some sequential order and the operations of each individual processor occurs in this sequence in the order specified by its program”. Assume that 15% of the instructions are loads and 10% are stores.1 Consider a shared bus parallel computer built using 32-bit RISC processors running at 1GHz which carry out one instruction per clock cycle. EXAMPLE 13.13 A shared memory parallel computer.2 Shared Bus Architecture A block diagram of a shared bus parallel computer is shown in Figure 13.95 hit rate to cache for reads and write through caches. Assume that the average clock cycle per instruction is 1.

of write transactions = 0.15 ´ (0. How many processors can the bus support without getting saturated? 2. into its private cache in address x.25 ´ 109 = 1GB/s \ Number of processors which can be supported = 10 ´ 109/109 » 10 This example illustrates the fact that use of local caches allows more processors on the bus even if the main memory is fast because the shared bus could become a bottleneck. There are many protocols (or rules) to ensure coherence.05) ´ 1 ´ 109 + 0. say 6.5 ´ 109 = 430 ´ 106 \ Number of processors which can be supported = 10 ´ 109/430 ´ 106 » 23 If no caches are present. It is thus essential to keep the data in a given address x same in all caches to avoid errors in computation. the following actions take place when read/write requests are initiated by the processor. of read transactions + No. If another PE reads data from the same address x of its own cache it will read a data stored in x earlier which may be. .10 ´ 1 ´ 109 = (7. say 8. then every load and store instruction requires main memory access.5 ´ 106 Bus band width = 10 ´ 109 bytes/s Average number of bytes traffic to memory = 4 ´ 107. the block containing the data in the main memory replaces a block in the cache and the requested data is supplied. We will describe a simple protocol for bus based systems.5. If caches are not there how many processors can the bus support assuming the main memory is as fast as the cache? No. 13. If there is a read request and if the data is in cache. If the data is not in the cache. for example. it is delivered to the processor.3 Cache Coherence in Shared Bus Multiprocessor In a single processor system. The use of caches is essential but they bring with them a problem known as cache coherence problem: The problem arises because when a PE writes a data. it is not known to the caches of other PEs. Before describing it we quickly review the protocol used in a single processor system.5 + 100) 106 = 107. Average number of bytes traffic to memory = 1 ´ 4 ´ 0. of transactions to main memory/s = No.Parallel Computers 447 1.

A broadcast is sent to all other caches to invalidate their copies if they have the specified block. Case 3: Read request by PEk. . Case 2: Read request by PEk. Thus cache coherence protocols are based on cache controllers of each processor listening (called snooping which means secretly listening) to the broadcast on the bus and taking appropriate action. bandwidth available on the bus and cache controller complexity. We will describe one simple protocol.448 Computer Organization and Architecture If it is a write request. it overwrites the existing data in the cache. Data is supplied to PEk from its cache. and the data is in the cache. Case 4: Read request by PEk and block not in any PE's cache. Thus these protocols are known as snoopy cache protocols. The actions to be taken to maintain cache coherence depend on whether the command is to read data from memory to a register in the processor or write the contents of a register in the processor to a location in memory. Let us first consider a Read instruction. besides the data. Case 5: Write request from PEk to write data (from specified register) to an address. The situation in multiprocessor is more complicated because there are many caches. Many variations are possible as there are trade-offs between speed of reading/writing. This protocol is called write-invalidate. (It is necessary to remember that at least one copy has to be valid). and data address within the block. The request is broadcast on the bus and if the block is found in some other PE's cache and is valid it is read from that cache and sent to PEk. and one has to know the status of each of them. Block not in PEk's cache. It also has a bit (called a dirty bit) which is set to 1 when a new data overwrites the data originally read from memory. The cache contains. The necessary block is retrieved from the main memory and placed in PEk's cache. If the block containing the address is in PEk's cache and is valid (a bit associated with the block should indicate this) it is retrieved from the cache and sent to PEk. If the protocol is write-now protocol (known usually as writethrough) the data in the main memory is also updated. If the protocol is write-later (known usually as write-back) then the data is written back in the memory only when the cache line in which the data is contained is to be replaced by another block from the main memory. one per processor. If the block containing the requested address is in PEk's cache but is not valid then it cannot be taken from PEk's cache. The valid block is also written into the main memory so that the main memory has a valid copy. a tag which indicates the cache line address. The request is broadcast and it is found in PEm's cache and is valid. A bus based system has the advantage that a transaction involving a cache is broadcast on the bus and other caches can listen to the broadcast. The following cases can occur: Case 1: Read request by a processor PEk. It is read from PEm's cache to PEk's cache and specified register in PEk . If a cache line is ‘clean’ it need not be written back in the main memory when the block is overwritten. If the block with this address is in PEk's cache the data overwrites the existing one.

The Pentium cache coherence protocol is known as the MESI protocol.1 Decision Table for Maintaining Cache Coherence in a Bus Based Multiprocessor Events/conditions Read from Memory Write to Memory Block in own cache? Block in any one's cache? Own Block valid? Actions Read data from valid cache. TABLE 13.14. Read from own cache. — — — X — — — X — — — — — — X X X — — — — — — X X — — — — — — — X X X — X X — — — X 1 Y — Y — Y 2 Y — Y Y N 3 Y — N Y — Cases ® 4 Y — N N — 5 — Y Y — — 6 — Y N — — 13. Case 6: Write request to PEk's cache but the block is not in PEk's cache. This protocol invalidates the shared blocks in caches when new information is .5. This protocol is summarized in Table 13. As we pointed out earlier.1. Usually a state diagram representation is used. Invalidate copies in others caches. Replace line in own cache from main memory using line replacement policy. Set dirty bit in cache line.Parallel Computers 449 PEk can also broadcast the block containing the new value and it can be copied into all caches which have this block.2.4 State Transition Diagram for MESI Protocol We have presented the protocol in the form of a decision table.1) this protocol reads data from valid cache and does nothing more. Write in own cache. This protocol is called writeupdate protocol (A better term would be write-all-now). Write valid block in main memory. We present the protocol used by Pentium processors in the form of a state diagram in Figure 13. The expansion of MESI is given in Table 13. many variations to this protocol are possible. Instead one may replace the cache line in PEk with the valid block and also update the main memory with the valid block. For instance in case 2 (see Table 13. An invalidate signal is broadcast to enable all other caches containing this block to invalidate their copies. This would increase the delay in reading data but may reduce future read times. This block replaces an appropriate cache line based on block replacement policy. In this case the block in the main memory is updated and retrieved from the main memory into PEk's cache.

A3: Write copy into cache line. (Dirty bit set in cache line) The data in cache line is valid and is the same as in main memory. (It is a write-invalidate protocol). If it is not found in the cache. etc. the circles represent the state of a cache line in the processor initiating action to Read or Write. namely. Referring to Figure 13. Update own copy. A5: Broadcast request for valid copy from another cache.450 Computer Organization and Architecture written in that block by any PE. shared or exclusive copy into own cache line. Main memory copy is an old copy. the cache line transitions. Invalidate copies in other caches.2 MESI Protocol of Pentium Processor Cache Block State M = Modified E = Exclusive S = Shared I = Invalid Explanation of State The data in cache line has been modified and it is the only copy. A2: For Read.14(a). to state M and A4 is the control action initiated by the sytem. the following actions are taken. A2. A4: Broadcast intent to write on bus. The data in cache line has been invalidated as another cache line has a newly written value. The above actions are specified when the required data is found in the cache. The action taken by the processor is also shown as A1. Actions when the required data is in cache of PE for Read or Write A1: Read copy into processor register from cache. No other cache has this copy. All caches with shared copy of block mark them invalid. Write new data into own cache line. In Figure 13. Replace copy in cache with valid copy. observe that the protocol is explained using two state transition diagrams marked (a) and (b).14.14. Other caches may also have valid copies.14(a) when a cache line is in state S and a write command is given by PE. copy modified. When new information is written in any cache line it is not written immediately in the main memory. . Referring to Figure 13.14(b) we show the state transitions of the corresponding cache lines in other PEs which respond to the broadcast by the initiating PE. The data in cache line is valid and is the same as in memory. In other words it is a write-back protocol to main memory. In Figure 13. The solid line shows the transition of the cache line state after read/write. which are listed below. We show these transitions using dashed lines and label them with the actions shown in Figure 13. TABLE 13. Read miss or Write miss.

A7: A8: .14 State transition diagram for MESI protocols. The main memory is now read into the processor's cache. that copy is stored in the main memory. A command to read the required block is broadcast. If no processor's cache has this line it is read from the main memory to the requesting processor's cache replacing a line. The cache line status of requesting and supplying processor go to the shared state. If any other processor's cache has this line in exclusive or shared state it supplies the line to the requesting processor's cache. The initiating and supplying caches go to the shared state. The processor's cache line goes to the exclusive state. Read-miss actions: A6: A command to read the required block from main memory is broadcast. A command to read the required line is broadcast.Parallel Computers 451 FIGURE 13. If any other cache is in modified state.

Each PE has a cache with line size of 64 bytes.452 Computer Organization and Architecture Write-miss actions: A9: Intent to write is broadcast. A10: Intent to write is broadcast. Assume that cache invalidate command is 1 byte long.2 A 4 Processor System shares a common memory connected to a bus. When a cache line is to be replaced and the line being replaced is in modified state it is written to main memory. the line is read from it and updated. Bus traffic = (8 + 28) ´ 4 = 144 bits (8 bits invalidate command. The requesting cache line will be in modified state. Assume lines containing these addresses are initially in all 4 caches. If another cache has the requested line in shared or exclusive state. The following sequence of instructions are carried out. Word size of the machine is 4 bytes/word. P1: P2: P3: P4: P1: P2: P3: P4: Store R1 in AFFFFFF Store R2 in AFFEFFF Store R3 in BFFFFFF Store R4 in BFFEFFF Load R1 from BFFEFFF Load R2 from BFFFFFF Load R3 from AFFEFFF Load R4 from AFFFFFF (i) If Write invalidate protocol is used what is the bus traffic? (ii) If Write update protocol is used what is the bus traffic? (i) Write invalidate protocol Writes: 4 invalidate messages for the 4 write transactions. The other processor’s cache line is invalidated. The main memory size is 256MB and the cache size is 64KB. The line is now read by the requesting processor to its cache and modified. Total bus traffic = (144 + 2192) = 2336 bits = 292 bytes (ii) Write update protocol Writes: 4 ´ (update command + address + data + line replacement in main memory) = 4 ´ ( 8 + 28 + 32 + 512) = 2320 bits = 290 bytes . EXAMPLE 13. If any cache has the requested line in modified state it is written to main memory. Other copies are invalidated. The state of requesting cache line is set to be modified. 28-bit address) Reads: For all 4 reads valid data is in another cache Bus traffic = 4 ´ (Request + Response) = 4 ´ ( 36 + 512 ) = 2192 bits = 274 bytes 36 bits for request transaction and 64 bytes = 512 bits for response as the whole cache line is retrieved.

36 address lines and some control lines operating at 66 MHZ. The 4 processors are Pentium Pro microprocessors with first level caches. Recently (2007) Intel has announced a Pentium chip with four PEs sharing a common on-chip memory calling it a Quad-Core processor. For instance if one processor writes many times to a location in memory and it is read only once by another processor (see Exercise 16). The bus called P-Pro bus has 64 data lines. Total bus traffic = 290 bytes This example does not illustrate that write invalidate protocol and write-update protocol are identical. translation lookaside buffer. FIGURE 13. 256KB second level cache. Memory controller and interleave unit (MIU) connect the memory bus to multiple banks of Dynamic Random Access memory upto 2GB. A different pattern of read/write will make write-update protocol better. A logical diagram of the system is shown in Figure 13.Parallel Computers 453 Reads: No bus traffic as all caches have current copy.15.5 A Commercial Shared Bus Parallel Computer Many multiprocessors use a four processor shared memory parallel computer manufactured by Intel as a building block[41]. . an interrupt controller and a bus interface all in a single chip which connects to a bus. Display. network. SCSI and I/O connections to the bus is provided by what are known as PCI bridge. Larger parallel machines may be built using this ‘quad-pack’ as the building block.15 Four processor shared memory parallel computer. 13. Memory transactions are pipelined so that a peak bandwidth of 528 MB/s is obtained.5. The Pentium Pro modules contain all the logic necessary for ensuring cache consistency and interprocessor communication.

The cache consistency is maintained by the actions taken by a central memory controller.5. FIGURE 13. is the difficulty in ensuring cache coherence.16). The main disadvantage. two bits are allocated to indicate status bits. which is kept in the main memory. When an individual cache controller wants to read a cached line. the central controller looks up the directory in the main memory and issues necessary commands for transfering the cached line from the main memory or from a cache which holds a valid copy of the line. is to know which blocks are in caches and what is their status. each cache line also has a 2-bit status indicator. then the kth bit in the directory is set to 1.6 Shared Memory Parallel Computer Using an Interconnection Network Shared bus architecture has limited scalability as the bandwidth of the bus limits the number of processors which can share the main memory. The main advantage of such a system is higher bandwidth availability which can also be increased as more processors are added. The main purpose of the directory. Each processor has a private cache to reduce load/store delay. . Each memory block has an N bit directory entry shown in Figure 13.454 Computer Organization and Architecture 13. Many types of interconnection networks have been used[35]. a scheme known as directory scheme[43] is used to ensure cache coherence in these systems. Suppose a multiprocessor has M blocks in the main memory and there are N processors and each processor has a cache. For shared memory computers with large number of processors. specially designed interconnection networks are used which connect a set of memories to a set of processors (see Figure 13. Thus.17. Besides this directory in the main memory. These networks do not have a convenient broadcasting mechanism required by snoopy cache protocols. however. In addition to the N bits for directory entries. If the kth processor's cache has this block.16 A shared memory multiprocessor using an interconnection network.

the programming model is a shared memory . As we saw. For greater details of this protocol. the number of PEs in a shared memory computer using a shared bus is limited to around 16 due to the saturation of the bus. In this system. When a processor wants to write data in its local cache line. In this case also it is not possible to increase the number of PEs beyond 64 due to high cost of designing high bandwidth interconnection networks. a problem. and difficulty in maintaining cache coherence. the controller grants exclusive access to the requestor. This has led to the development of Distributed Shared Memory (DSM) parallel computer in which a number of PEs with their own private memories (known as Computing Elements or CEs) are interconnected. the reader may refer to[44]. 13. The requestor writes the data in its cache line and with a write through protocol also in the main memory. Scalability is. it must request exclusive access to the line from the central memory controller.Parallel Computers One bit per processors cache 455 N bits L M One directory entry per block in main memory Lock bit Modified bit (a) Main memory directory entry V Valid bit Modified bit (b) Cache memory directory entry M One directory entry per cache line in the processor FIGURE 13.6 DISTRIBUTED SHARED MEMORY PARALLEL COMPUTERS A shared memory parallel computer is easy to program as it gives a user a uniform global address space. Before granting exclusive access. however. the controller sends a message to all processors with a cached copy of this line invalidating their own local copies. After receiving acknowledgements back from these processors. even though memory is physically distributed.17 Directory entries in a multiprocessor. A shared memory parallel computer using interconnection network to connect PEs to memory alleviates this problem to some extent by providing multiple paths between PEs and memory.

In a DSM parallel computer each node is a full fledged computer.e. the machine as though it is a shared memory parallel computer. cache and memory. Even though each CE has its own private memory and cache. In other words.456 Computer Organization and Architecture model which is easier to program.18. It is clear when we see the structure of the machine that accessing data in the memory attached to a processor is much faster compared to accessing data in the memory belonging to another CE connected to it via the interconnection network. a user can program assuming that there is a single global address space. the system has hardware and software assists to give a user the facility to program FIGURE 13. In other words DSM parallel computers are scalable while providing a logically shared memory programming model. Thus this type of parallel computer is also known as Non Uniform Memory Access (NUMA) parallel computer. This is done using . The system. When each processor has its own cache memory it is also necessary to maintain cache coherence between the individual processors. Each computer in the system is connected to an interconnection network using a Network Interface Unit (NIU) (also called Communication Assist by some authors as it often has processing capability of its own). i. it has a processor. A general block diagram of a Distributed Shared Memory (DSM) parallel computer is shown in Figure 13. allows a much larger number of CEs to be interconnected as a parallel computer at reasonable cost without severely degrading performance. however.18 Distributed shared memory parallel computer.

Typically the time to retrieve data from the local memory is around 10 cycles and from cache 1 cycle whereas the time to obtain it from the memory of remote CE is of the order of 1000 cycles. Step 4: Retrieve data from remote CE's memory.e. If care is not taken to do this. 5 and 6. Thus it is extremely important for a programmer (or a compiler which compiles a program for NUMA machine) to ensure that data needed by a CE during computation is available in the local memory. Otherwise we have to carry out Steps 4. The programming model for CC-NUMA computer is the same as for shared memory parallel computer. If a processor issues a read request (load data from memory) it will specify the address from where data is to be retrieved.6 Command: Load from specified address into a register Step 1: Translate address: Address ¬ CE number + memory address in CE. the execution time of a program will be very large and the advantage of having multiple processors will be lost. Otherwise the node number is used to send a message via the interconnection network to the appropriate node's memory from where the data is retrieved and delivered via the interconnection network to the requesting processor.6.Parallel Computers 457 a directory scheme similar to the method described in Section 13. Procedure 13. Step 6: Load data in specified CE's processor register. Step 5: Transport data via network to requesting CE.7 gives the operations to be carried out for store (i. Observe that if the required data is in the requesting CE's memory the work is done in Step 3. Is CE local? . Procedure 13. Thus a more accurate nomenclature for a Distributed Shared Memory Computer in use today is —Cache Coherent Non Uniform Memory Access parallel computer or CC-NUMA parallel computer for short. Let us examine first how the computers cooperate to solve a problem. write register contents in a specified location in memory). The address is split by the system software into two parts.5. Else send request via network to remote CE. Thus if the node number is that of the processor issuing the read command the data is retrieved from the memory attached to it.7 Command: Store register in address Step 1: Step 2: Translate address. Address ¬ CE number + memory address in CE. The procedure is summarized in the following procedure: Procedure 13. Step 2: Is CE local? Step 3: If yes load from local memory of CE. a node number and address of location within the node.

If B is the network bandwidth (in bytes/s) and the packet size in n bytes this time is (n/B) seconds. A load request is transmitted over the network as a packet whose format is: Source CE address Destination CE address Address in memory from where data is to be loaded Retrieved data is sent over the network using the following format to the requesting CE. Time taken to retrieve word from the remote memory. This will include the time needed to decode to which remote node the request is to be sent and the time needed for formatting a packet. When a request is sent over the interconnection network for retrieving data from a remote memory. the time required for retrieval consists of the following components: 1. an acknowledgement packet is sent back to the CE originating request in the following format: Source CE address Destination CE address Store successful Error detecting bits may be added to the above packets. Store contents of register in the specified location in the memory of the remote CE. Observe that storing in memory requires one less step. Else send contents of register via network to remote CE. We have not discussed in the above procedures the format of the request packet to be sent over the network for loads and stores. Send message to CE which issued the store command that the task has been completed. Source CE address Destination CE address Data In the case of store instruction. This depends on the bandwidth of the network and the packet size. q. A fixed time T needed by the system program at the host node to issue a command over the network. the store request has the following format: Source CE address Destination CE address Address in memory where data is to be stored Data After storing the data. 3. Time taken by the load request packet to travel via the interconnection network.458 Step 3: Step 4: Step 5: Computer Organization and Architecture If yes store contents of register in specified location in memory of local CE. . 2.

A basic question which will arise in an actual system is “Can the CE requesting service from a remote CE continue with processing while awaiting the arrival of the required data or acknowledgement?” In other words. Time taken to transport the reply packet (which contains the data retrieved) over the network. Assume 32-bit words and a clock cycle time of 5 ns. The bandwidth of the interconnection network is 100 MB/s. Thus the total time is 2T + q + [(n + m)/B] The time taken to store data in a remote memory is similar. 2. Load/store time if all accesses are to local CEs. An overhead of 20 clock cycles is needed to initiate transmission of a request to a remote CE. of load/store instructions = 400. No. The memory access time for local load/store is 5 clock cycles. In a set of programs 10% of instructions are loads and 15% are stores. however. 5. Fixed time T needed by the destination CE system program to access the network. of local loads = 40000 ´ 3 = 30. (Each load takes 25 ns). If 400. Each CE has 16MB memory. Case 1: Case 2: No.000. If the size of this packet is m bytes the time needed is (m/B) s. No.Parallel Computers 459 4. EXAMPLE 13. Another issue is how frequently service requests to a remote processor are issued by a CE. (We will calculate for load and store separately No.000 ´ 5 ´ 5 ns = 2500 ms.000. of load instructions = 40. Time to execute load/ store locally = 100. The above model is a simplified model and does not claim to be an exact model. Repeat 1 if 25% of the accesses are to a remote CE. compute 1.000 instructions are executed. Yet another question is whether multiple transactions by different CEs can be supported by the network. different from the above case.000. The values of n and m for storing are. We will now consider an example to get an idea of penalty paid if services of remote CEs are needed in solving a problem.3 A NUMA parallel computer has 256 CEs. can computation and communication be overlapped? The answer depends on the program being executed. of remote loads = 10.000/4 = 100.000. 4 Time taken for local loads = 30000 ´ 25 ns = 750 ms. Request packet format is Source address 8 bits Destination address 8 bits Address in memory 24 bits .

(Remember that clock time is 5 ns and T = 20 clock cycles and memory access time is 5 clock cycles).75 = 45000 15. of store instructions No. Time taken for remote load of one word = Fixed overhead to initiate request over network + Time to transmit request packet over the network + Data retrieval time from memory at remote CE + Fixed overhead to initiate request to transmit over network by remote CE + Time to transport packet over network 5 6 È Ø 9 – 10 6  5 – 5 – 10 9  20 – 5 – 10 9  – 10 6 Ù = É 20 – 5 – 10  Ê Ú 100 100 = 335 ns.460 Computer Organization and Architecture \ Request packet length = 5 bytes Response packet length = 6 bytes (word length + source address + destination address).000 3350 ms 750 ms 4100 ms 400000 ´ 0.000 60000 ´ 0.000 45000 ´ 25 ns = 1125 ms Time taken for 1 remote store of 1 word = (Fixed overhead to initiate request over network + Time to transmit remote store packet + Data store time + Fixed overhead to initiate request for acknowledgement + Time to transmit acknowledgement packet) 9 3 È Ø – 10 6  5 – 5 – 10 9  20 – 5 – 10 9  – 10 6 Ù = É 20 – 5 – 10 9  Ê Ú 100 100 = Time Total Total Total 345 ns to store 15000 words = 345 ´ 15000 ns = 5175 ms time for stores = 1125 + 5175 = 6300 ms time for loads and stores = 4100 + 6300 = 10400 ms time for loads and stores if entirely in local CE = 2500 ms (Case 1) Total time for load and store (local and remote) Total time for load and store if entirely local 10400 2500 4.15 = 60.16 . of remote stores Time taken for local stores = = = = = = = = 10. of local stores No. Number of requests to remote CE Total time for remote load Time for local load Thus total time taken for loads No.

It is thus clear that it is important for the programmer/ compiler of the parallel program for a NUMA parallel computer to (if possible) eliminate remote accesses to reduce parallel processing time. by appropriate distribution of program and data. 13. The protocols used to maintain both cache consistency and interprocess communication must ensure sequential consistency of programs. The main difference between Distributed Shared Memory machine and this machine is the method used to program this computer. A question which may arise is: What is the effect of increasing the speed of the interconnection network? In the example we have considered if the bandwidth of the interconnection network is increased to 1 GB/s from 100 MB/s let us compute the time taken to computer for remote load and store. a programmer assumes . Unless it is also reduced.000 – É 100 – 10 9  Ê Ú 1000 1000 = 2360 ms Total time for load (local + remote) = (750 + 2360) = 3110 ms Similarly the time for remote store 9 3 È Ø = 15.Parallel Computers 461 Observe that due to remote access. It further reiterates the point that remote accesses should be eliminated. As was pointed out at the beginning of this section. if possible.11 Observe that the ratio is still high in spite of increasing the speed of the interconnection network ten-fold. the total time taken is over 4 times the time if all access were local. The reason is the fixed overhead for each transaction over the network. In DSM computer.000 – É 100 – 10 9  – 10 6  25 – 10 9  100 – 10 9  – 10 6 Ù Ê Ú 1000 1000 = 3555 ms Total time for stores = 1125 + 3555 = 4680 ms Total time for loads and stores = (3110 + 4680) = 7790 ms Total time taken if all loads and storesd are local = 2500 ms Ratio of load and store time remote/local 7790 2500 3. The time for remote load 5 6 È Ø – 10 6  25 – 10 9  100 – 10 9  – 10 6 Ù = 10. merely increasing the network speed will not help.7 MESSAGE PASSING PARALLEL COMPUTERS A general block diagram of a message passing parallel computer (also called loosely coupled distributed memory parallel computer) is shown in Figure 13. the programming model for NUMA machine is the same as for a shared memory parallel machine.19.

destination process). tag).e. n. It is a command to send n bytes of data starting at source address to a destination process using tag as identifier of the message. A synchronous send command has the general structure: Synch-Send (source address. tag) This is called asynchronous send instruction. This is also called blocked send. n. Another send command is: Asynch-Send (source address. A receive command to receive and store a message is also required. is programmed using Send-Receive primitives. It is a command to send n bytes of data starting at source address to the specified destination process using tag as the identifier of the message. node 1 CE1 node n CEn Cache PE M MB MB Cache PE M NIU NIU Interconnection Network PE MB M NIU CE : : : : : Processing Element Memory Bus Local Memory Network Interface Unit Computing Element (Observe that each CE has its own private address space) FIGURE 13. synchronous send and asynchronous send. After sending the message the sender waits until it gets confirmation from the specified receiver (i. In other words the sender suspends operation and waits. A synchronous send is initiated by a source node. It is initiated by a source node. tag) . There is one command to receive a message.462 Computer Organization and Architecture a single flat memory address space and programs it using fork and join commands. We discuss a fairly general one in which there are two types of send. n. A message passing computer. After sending the message. destination process. on the other hand. source process. the sender continues with its processing and does not wait. There are several types of send-receive used in practice.19 Message passing parallel computer. The command Receive (buffer-address. destination process.

Similarly to check whether a message has been sent to a receiver and received by it a command send-probe (tag. source process) which checks whether a message with the identifier tag has been sent by the source process. memory to receive data from the source is allocated. it “wakes up” and starts sending n bytes of data to the destination computer. .20 Synchronous message passing protocol. The major problem with synchronous send is that the sending process is forced to wait. The request message is an ‘envelope’ containing source address. After sending this request the sending process suspends its work and waits. tag) If matching receive performed. n. number of bytes to be transmitted and a tag. src-process. tag) Request to send message Wait Receive (buff-addr. Source Synch-Send (src-addr. If a programmer wants to make sure that an asynchronous send has succeeded. As soon as this command is executed. In this case the receiving process will wait until the data from the specified source process is received and stored. Hardware support is provided in message passing systems to implement both synchronous and asynchronous send command. destination process) may be issued which checks whether a message with identifier tag has been received by the destination process. These probe commands may be used with asynchronous send/receive. This is called a three phase protocol as the communication is used thrice.Parallel Computers 463 Receives a message n bytes long from the specified source process with the tag and stores it in a buffer storage starting at buffer-address.20. n. When this message is received by the source. A blocked receive may also be implemented. a receiving process may issue a command receive-probe (tag. There could also be a deadlock when two processes send messages to one another and each waits for the other to complete. In this protocol as soon as a source process executes a synch-send command it sends a request to the destination computer (via the communication network) for permission to send the data. The protocol used for a synchronous send is given in Figure 13. A message is sent back to the requesting source computer that the destination is ready to receive the data. The support is through appropriate communication protocols. set up area to receive data Ready to receive message Wake up Start sending data Receive data and store Destination Data Transfer FIGURE 13. destinaton. The receiving process waits till a receive command matching the source process and tag is executed by the destination process.

the data from the temporary buffer is stored in the destination address specified by the receive command. This protocol works as follows. the destination processor receives the message into a temporary buffer. In theory an infinite buffer space is assumed. In the optimistic protocol the source sends a message and assumes correct delivery.21 Asynchronous message passing protocol—a three-phase conservative protocol. The temporary buffer is then cleared. Hardware support is provided by implementing two communication protocols. matching receive has been executed yet. When asynchronous FIGURE 13. When the matching receive is executed. This simple protocol has several problems. The destination process (which must have a matching receive instruction) strips off the tag and source identity and matches it with its own internal table to see if a matching receive has been executed by the application process local to it. This protocol is explained using Figure 13. A major problem will occur if several processes choose to send data to the same receiver.21. typically performed by the software. the data is delivered to the address specified in the receive command. The send transaction sends information on the source identity. length of data and tag followed by the data itself. Another problem is the need to match tag and source process in order to find the address where the message is to be stored. Several buffers may have to be allocated and one may soon run out of buffer space. A more robust protocol is used by many systems which alleviates this problem at the receiver. One of them is called an optimistic protocol and the other pessimistic.464 Computer Organization and Architecture Asynchronous send is more popular for programming message passing systems. If yes. If no. . This is a slow operation.

The main problem with this three-phase protocol is the long delays which may be encountered in sending three messages using the communication network. If yes. revert to three-phase protocol. However. to transmit messages is that it is standardised and can be used with several manufacturers’ workstation.19 of a message passing computer is that the NIU is connected to the I/O bus in the cluster of workstations whereas it is connected to the memory bus in the message passing multicomputer. which is much slower than memory bus. A block diagram of such a system. When “the ready to receive message” is received by the sender. Many organizations have hundreds of workstations. If matching receive has not been executed. Otherwise it waits till such time that either buffer space is available or receive command is executed at which time a ready to receive message is sent. the time taken to pass messages between nodes will be at least 1000 times higher than in message passing parallel computers. The questions which have been asked are: Is it possible to use an interconnected set of workstations as a high speed parallel computer? Can such an interconnected set of workstations be running compute intensive application as a parallel program in the background while allowing interactive use in the foreground? The answers to both these questions are “yes”. One of them is to allocate a buffer space at each processor to receive short messages and subtract from this amount a quantity equal to the number of bytes stored. it sends a ready to receive message to the sender. the sending process continues processing. When the allocated space is exhausted. The destination stores the data at the allocated space. called a Cluster of Workstations (COW). then the receiving processor checks if enough buffer space is available to temporarily store the received data. This overhead may be particularly unacceptable if the data to be sent is a few bytes.22. When the destination processor receives the envelope it checks whether the matching receive has been executed. tag and number of bytes to be transmitted to the destination processor. 13. After sending this envelope.8 CLUSTER OF WORKSTATIONS The reduction of cost of microprocessors and increase in their speed has drastically reduced the cost of workstations. When an I/O bus is used a message transaction will be taken as an I/O transaction requiring assistance from the Operating System. If yes it sends a ready to receive message to sender.22 and Figure 13. The main reason for using I/O bus. Observe that the major difference between Figure 13. This has an impact on communication speed. it invokes the procedure to send data and transmits the data to the destination. Any transaction requiring Operating System . Workstations typically use a high performance CPU such as Alpha 21164 or Pentium Pro which work at clock speeds of around 2 GHz and have built-in network interface to a Gigabit ethernet. is given in Figure 13.Parallel Computers 465 send is executed it sends an ‘envelope’ containing source process name. Thus it is clear that a COW may be programmed using message passing method. In such a case the single phase protocol (explained earlier) may be used with some safety feature.

In other words this structure will only be effective for SPMD style programming or programs in which the grain size is very coarse. Another interesting software called CONDOR[46] is widely used to “steal” the processing power of idle workstations in large networks to solve compute intensive problems. As it is a very inexpensive method of building a parallel computer with no special hardware many laboratories have developed software systems and parallel application programs to use such a system. Thus running a parallel program which requires too many send/receive commands on workstations connected by a Local Area Network (LAN) will not give good speed-up.9 COMPARISON OF PARALLEL COMPUTERS We saw in this chapter that there are five basic types of parallel computers. A recent well publicised COW is called Beowulf[45]. a gigabit ethernet connection and Message Passing Interface Library for parallel programming. They are—vector computers.466 Computer Organization and Architecture node 1 node n Processor Memory P M MB I/O Bridge I/O Bus I/O Bridge MB I/O Bus NIU Disk Disk NIU Interconnection Network (Ethernet or FDDI or ATM net) MB : Memory Bus P : Processor NIU : Network Interface Unit M : Memory FIGURE 13. . shared memory parallel computers. as computing nodes. Linux Operating System. 13. support will require several thousand clock cycles. It uses Intel’s Pentium based personal computers. a stripped down kernel based on Linux for managing intercomputer communication.22 A cluster of workstations. On some programs. this COW is able to give gigaflops speed. array processors.

Heterogenous workstations can be interconnected as a COW because I/O buses are standardized. There are two major types of message passing multicomputers. Shared memory computers use either a bus or a multistage interconnection network to connect a common global memory to the processors. Maintaining cache coherence in a shared bus-shared memory parallel computer is relatively easy as bus transactions can be monitored by the cache controllers of all the processors. With the increase in speed of workstations and LANs. the penalty suffered due to message passing is coming down making COWs very popular as low cost. however. Vector computers use temporal parallelism whereas array processors use fine grain data parallelism. Shared memory multiprocessors and message passing multicomputers may use any type of parallelism. In a shared memory parallel computer using an interconnection network to connect global memory to processors maintaining cache coherence is more difficult. however. parallel computers. the programming model is a shared memory model. This led to the development of message passing distributed memory parallel computers. can themselves use vector processing. high performance.3 we give a chart comparing various architectures discussed in this chapter. not scalable to larger number of processors. In this system. even though the memory is physically distributed. parallel machines using hundreds of CEs) using CC-NUMA architecture due to the difficulty of maintaining cache coherence. In one of them the interconnection network connects the memory buses of CEs. however. parallel programs are easy to write for such systems. slower as messages are taken as I/O transactions requiring assistance from the operating system. As shared memory computers have a single global address space which can be uniformly accessed. DSM parallel computers. also known as CCNUMA machines. They are. Individual processors have their own cache memory and it is essential to maintain cache coherence. are scalable while providing a logically shared memory programming model. COWs are.Parallel Computers 467 distributed shared memory parallel computers and message passing parallel computers. . This led to the development of Distributed Shared Memory parallel computers in which a number of CEs are interconnected by a high speed interconnection network. however. Thousand of CEs distributed on a LAN may be interconnected as a COW. allows a larger number of processors to be used because the interconnection network provides multiple paths from processors to memory. for instance.e. These machines require good task allocation to CEs which reduce the need to send messages to remote CEs. The other type called a cluster of workstations (COW) interconnects CEs via their I/O buses. These machines do not require cache coherence as each CE runs programs using its own memory systems and cooperate with other CEs by exchanging messages. In Table 13. It is expensive and difficult to design massively parallel computers (i. This structure. Message passing is. scalable. Increasing the number of processors in this sytem beyond about 16 is difficult as a single bus will saturate degrading the performance. COWs are cheaper. This allows faster exchange of message. Individual processors in a shared memory computer or message passing computer.

how they access memory and quantum of work performed by each PE before it . Suitable only for identical tasks Relatively easy due to availablility of global address space Same as above Scalability Very good Very good Medium grain Poor Shared memory using interconnection network Distributed shared memory CC-NUMA Message passing multicomputer Cluster of workstations (COW) MIMD Medium grain Moderate MIMD Medium grain Same as above Good MIMD Coarse MIMD Very coarse as O. Parallel computers have been classified by Flynn as Single Instruction Multiple Data stream (SIMD). Single Instruction Single Data stream (SISD–single processor).468 Computer Organization and Architecture TABLE 13.3 Comparison of Parallel Computer Architectures Criteria for Comparison Type of Parallel Computer Vector Computers Array Processors Shared memory using bus Type of Parallelism Exploited Temporal Data parallelism (SIMD) MIMD Suitable Task Size Fine grain Fine grain Programming Ease Easy Easy.S. assistance needed Needs good task allocation and reduced inter-CE communication Same as above Very good Very good (low cost system) SUMMARY 1. 2. A parallel computer may be defined as an interconnected set of Processing Elements (PEs) which cooperate by communicating with one another to solve large problems fast. 3. and Multiple Instruction Multiple Data Stream (MIMD) machines. We can also classify parallel computers by how the processor are coupled.

A commonly used parallel computer architecture is called a shared memory parallel computer. It uses several pipelined arithmetic units. Cache coherence is ensured by the main memory controller and the cache controllers of each PE.8. A parallel computer using temporal parallelism is called a vector computer. If the memory is distributed among processors but a programmer has a view of the system as a single shared memory space it is called Distributed Shared Memory machine and is said to have Non Uniform Memory Access (NUMA). 7. The cache coherence protocol for a machine which shares a main memory by using an interconnection network employs a directory containing the status of each cache line. If all the PEs share a common memory it is a shared memory computer which is tightly coupled. The access time to data required by a processor depend . The cache coherence protocol for a shared bus machine is called snoopy bus protocol and is easy to design. A shared memory computer with a single common shared memory is called Uniform Memory Access (UMA) machine. the scalability (i. It is relatively easy to program this machine. Early supercomputers used this idea extensively. If each computer is independent and cooperates with others by exchanging messages it is loosely coupled. A Distributed Shared Memory Parallel Computer consists of independent computers with their own memories and caches.e. It is necessary to make sure that a PE accesses the most recently updated value of a variable from its private cache. These computers are interconnected by a fast interconnection network. However. The other one uses an interconnection network for sharing the memory. An array processor is an SIMD computer. Instructions are broadcast to all the PEs in the array. This is called cache coherence problem. The data to be processed by each machine is stored in its local memory. In a shared memory parallel computer. This taxonomy is summarized in Figure 13. Directory based protocol is more difficult to design. There are two types of shared memory parallel computers. seeks cooperation from another PE. In this architecture 4 to 64 PEs share a common memory. This architecture provides a global address space for writing parallel programs. 6. In this organization a large number of PEs with their own memory are connected as an array. 8. 5. One of them uses a shared bus to access the common main memory. 9. the number of PEs which constitute the parallel machine) when an interconnection network is used is much higher than in a shared bus parallel computer.Parallel Computers 469 4. This directory is stored in the main memory. Therefore individual caches must know when a shared variable is updated and act accordingly. individual PEs have their own cache to speed up program execution. These instructions are carried out simultaneously by all the PEs using data stored locally. The programming model is that of a shared memory machine as the program uses a single global logical memory address pace.

000 Intel 8086 based microcomputers). independent computers are interconnected using an interconnection network. However. In this architecture also. is very popular as it is scalable to thousands of computers in a single parallel computing structure. 12. This architecture.g. what are the main objectives of this classification method? 6. What is the difference between loosely coupled and tightly coupled parallel computers? Give one example of each of these parallel computer structures. 8 processor Cray) or a large number of low speed microprocessors (say 10. In this architecture a set of off-the-shelf workstations (or server processor boards) are interconnected using an interconnection mechanism such as a gigabit Ethernet. 10. 5. explain why it is not necessarily so? . 4.3 of the text. When a parallel computer has a single global address space is it necessarily a Uniform Memory Access computer? If not. EXERCISES 1. 3. 11. This architecture is thus known as Non Uniform Memory Access (NUMA) shared memory machine. the programming model is different.470 Computer Organization and Architecture on whether it is stored in its own memory or in a memory belonging to a remote processor. Give some representative applications in which MISD processing will be very effective. In this case also cache coherence has to be maintained by appropriate protocol. Another popular parallel computer architecture is known as message passing parallel computer. A very popular parallel computer is called a cluster of workstations (COW). Message passing systems are easy to design but difficult to program. Programs executing in each computer communicates and cooperates with other computers using primitive called send and receive message. It uses a standard operating system such as Linux and standard programming primitives which have been developed for message passing computers. Messages are used to send or receive results computed in remote computers. Are they orthogonal? If not. There is no single common memory address space. Give some representative applications in which SIMD processing will be very effective. What are the advantages and disadvantages of these two approaches? 2. A comparative chart of parallel computers is given in Table 13. Parallel computers can be made using a small number of very powerful processors (e. Flynn’s classification is based on 4 characteristics. however.

The time to access shared main memory is 10 clock cycles and the processors are capable of carrying out 1 instruction every clock cycle. adding mantissas 1 ns and normalizing result 0. Repeat Exercise 7 assuming that the processors cooperate by exchanging messages and each message transaction takes 100 cycles. 9. Until then processors wait in busy wait loops. Each PE has a cache and the hit ratio to cache is 98%. what should be the grain size of computation? 8. Assume a 4 PE shared memory. . What do you understand by grain size of computation? A 4 processor computer with shared memory carries out 100. Develop a block diagram for a pipelined multiplier to multiply two floating point numbers. cache coherent parallel computer.Parallel Computers 471 7. Assume that exponent matching takes 0. shared bus. A pipelined floating point adder is to be designed. A vector machine has a 6-stage pipelined arithmetic unit and a 10 ns clock. If the loss of speed-up due to communication is to be kept below 15%.2 ns. mantissa alignment 0. A shared bus parallel computer with 16 PEs is to be designed using 32-bit RISC processor running at 500 MHz.5 clocks per instruction. 11.2 ns.1 ns. The time required to interpret and start executing a vector instruction is 60 ns. What should be the length of vectors to obtain 95% efficiency on vector processing? 12. If 10% of the instructions are loads and 15% are stores.000 instructions. 14. Normally a counter is decremented by each process which reaches a barrier and when the count becomes 0 all processes cross the barrier. What are the similarities and differences between Vector processing and Array processing. In Section 13. The main memory need to be accessed only for inter-processor communication. What is the highest clock speed which can be used to drive the adder? If two vectors of 100 components each are to be added using this adder what will be the addition time? 10. Write an assembly language program with a read-modifywrite atomic instruction to implement barrier synchronization. 15.5. Explain how lock and unlock primitives work with write-invalidate cache coherence protocol. 13. Assume average 1. Give an application of each of these modes of computation in which their unique characteristics are essential. Assuming times for each stage similar to the ones used in Exercise 9 determine the time to multiply two 100 component vectors. what should be the bandwidth of the bus to ensure processing without bus saturation? Assume reasonable values of other necessary parameters (if any).1 we stated that an atomic read-modify-write machine level instruction would be useful to implement barrier synchronization. The system is to use write-invalidate protocol.

Distinguish between UMA.472 Computer Organization and Architecture 16. (a) MESI protocol (b) Write update protocol 17. Assume write-invalidate protocol and write through caches. An overhead of 10 clock cycles is incurred for every remote transaction. . Give one example block diagram of each of these architectures. If 15% of the accesses are to remote CEs. Each CE has 32 MB memory. Assume that 100 cycles are needed to replace a cache line and 50 cycles for any transaction on the bus. A program has 500. List the advantages and disadvantages of shared memory parallel computers using a bus for sharing memory and an interconnection network for sharing memory.000 instructions out of which 20% are loads and 15% stores. A shared memory parallel computer has 128 PEs and shares a main memory of 1GB using an interconnection network. Take the data of Example 13. The cache block size is 64B. 19. Assume that if there is a cache hit the read/write time is 1 cycle and that all the read/writes are to the same location in memory. Obtain an algorithm to ensure cache coherence in a CC-NUMA parallel computer. Assume that all caches are initially empty. Where (RPE1) means read from PE1 and (WPE1) means write into PE1 memory. NUMA and CC-NUMA parallel computer architectures. A NUMA parallel computer has 64 CEs. The time to read/write in local memory is 2 cycles. If the network bandwidth is 50 MB/s and 30% of the accesses are to remote computers what is the total load/store time? How much will it be if all accesses are to local memory? 22. A 4 PE shared bus computer executes the streams of instructions given below: Stream 1 : (R PE1)(RPE2)(W PE1)(W PE2)(R PE3)(W PE3)(R PE4)(W PE4) Stream 2 : (R PE1)(RPE2)(R PE3)(R PE4)(W PE1)(W PE2)(W PE3)(W PE4) Stream 3 : (R PE1)(W PE1)(R PE2)(W PE2)(R PE3)(W PE3)(R PE4)(W PE4) Stream 4 : (R PE2)(W PE2)(R PE1)(R PE3)(W PE1)(R PE4)(W PE3) (W PE4) All references are to the same location in memory. what should be the bandwidth of the interconnection network to keep the penalty due to remote access below 2? What should be the fixed overhead if the bandwidth is given as 1GB/s? 21. The word length is 64 bits.3 given in the text except the bandwidth of the interconnection network. Estimate the number of cycles required to execute the streams above for the following two protocols. What are the advantages and disadvantages of each of these architectures? Do all of these architectures have a globally addressable shared memory? What is the programming model used by these parallel computers? 20. 18. What is the size of the directory used to ensure cache coherence? Discuss methods of reducing directory size.

Compare the results with Exercise 25 and comment.000 instructions. Assume that the network bandwidth remains the same. What are the advantages and disadvantages of COW when compared with a message passing parallel computer? 25. 27.75 instructions are carried out each cycle.000 cycles. Calculate the speed-up of this parallel machine. . What are the advantages of using a NIC (which has a small microprocessor) in designing a COW.000 instructions long and each CE carries out 50. The bandwidth of the LAN is 100 MB/s. The fixed overhead for each message is 50. 26.Parallel Computers 473 23.000 bytes. Make appropriate assumptions about NIC and solve Exercise 25. Each CE has a 750 MHz clock and an average of 0. There are 16 CEs in the COW. Repeat Exercise 25 assuming that the LAN bandwidth is 1GB/s. The program is 800. What are the major differences between a message passing parallel computer and a NUMA parallel computer? 24.000 instructions on the average. Assume that all messages do not occur at the same time but are staggered. A program is parallelised to be solved on a COW. The average message size is 1. A message is sent by each CE once every 10.

The table is divided into four quadrants by double lines (see Figure A. If z overflows add the overflow bit 475 . These are recorded on the right half of a decision table as a series of decision rules. The actions are listed in the order they have to be performed. The first rule is interpreted as follows: A FIGURE A.1) has 4 rules. taken together. The construction of a decision table begins with the listing of all the conditions relevant to the procedure and all the actions to be performed.Appendix DECISION TABE TERMINOLOGY A decision table defines a logical procedure by means of a set of conditions and their related actions. The next step is to determine which conditions.. sign bit of x and y are equal) then add x and y including the sign bit and call it z.e.1). All the conditions relevant to the procedure are listed in the condition stub and all the actions to be performed by the procedure are listed in an action stub. The decision table (Table A. If the operation is add and s(x) = s(y) (i.1  Decision table notation. should lead to which actions.

s(z) equals s(x). we will consider Rule 3. namely. a – as an action entry indicates that the specified action is not carried out. declare z as the answer and stop else declare the answer as wrong “out of range” and stop. In Table T2 we go to Rule 1 which states “if sign bit of the sum z.1 A Decision Table for One’s Complement Addition Rule Operation s(x) = s(y) Complement y z = x + y (Add including sign bit) Add carry (if any) to z Declare z as answer Go To T2 Stop T2: s(z) = s(x) Answer wrong (result out of range) Declare z as answer Stop Y — X X X X — X 1 Add Y — X X — X — Rule 2 Add N — X X X — X Rule 3 Sub Y X X X X — X Rule 4 Sub N X X X — X — . As another example. This rule states that if the operation is subtract and if s(x) = s(y) then complement y (namely take the 1’s complement of y). Also observe that in Rule 1 one of the actions is go to Table T2 for further checking. On the other hand. carry (if any) is added and the sum obtained (including the sign bit) is the answer z. To this sum. Observe that X as an action entry indicates that the action specified must be carried out.476 Appendix A to z and go to Table T2. TABLE A. In this case there is no need to go to Table T2. add the complement of y to x including the sign bit.

the assumptions and thinking process when developing the program must be explicitly documented. Every program you write can be viewed as a means to solve ‘some problem’. PROGRAMMING AND DEVELOPING AN ASSEMBLY LANGUAGE PROGRAM B. One way to clarify your 477 . (4) Correspondences between the data names in your program and the names of data used in the ‘Problem Domain’. (2) Constraints on the input under which the program will function well. software systems are thousands (kilo) of lines of code referred to as KLOC and they are developed by a team of people working and sharing codes among them. First think and understand clearly what this problem is. books are available from the field of software engineering which teaches how software by one person can be systematically developed. etc. Of late. Thus. For example consider the following: (1) The algorithm used to solve the problem. (3) The list of test cases you have used for testing your program. This course is the right place for you to start practising good documentation habits. the functions they perform. (5) Names of the subroutines. inputs to and outputs from each subroutine.1 B PROGRAM DEVELOPMENT The programs you develop in this course are small in size and they are developed by one person. and this process is called Personal Software Process [47]. You can document a variety of information to make your program readable and modifiable at a later time. What one programmer writes or develops must be readable and understandable to a fellow team member.Appendix PREPARATION. In many real life applications.

The marks and ID numbers are stored in two integer arrays for any other processing for which my program may be modified in future. or double word. When you write down the algorithm you may be making several implicit assumptions regarding the scope of the problem. . strings of characters.1 l This program is written to find an average marks of the students in the class l l l l l SOEN 228. you can also describe in words or in a diagrammatic form which subroutine calls which other subroutines and what parameters are passed to the code and what results are returned by the code. It can also be expressed in a graphical or chart form like a ‘flow chart’. Assembly language supports ‘primitive data types’. In order to solve this problem we need an algorithm. and how the caller and called subroutines share the responsibility of maintaining the consistency of the ‘context’. FIGURE B. In the case of assembly language programming with NASM you have an option to decide the data size as well. floating point numbers. it is a good idea to start with identifying the input data needed and the outputs that will be generated. There are other diagrammatic methods like UML diagrams which you will learn later in Software Engineering courses. the size and nature of the data it can handle. the generality or speciality of the solution method chosen. and give them names. full word. It takes marks as integers and computes the average. etc. or linear arrays of such elements. The marks can not be negative. They may be integers.2 PROGRAM PREPARATION Once you have selected the algorithm to solve the given problem. Blanks will be treated as zero.478 Appendix B understanding is to write down the statement of the problem in plain English or French or in your own language. Each input has two integers: the student’s ID number and his or her marks. Think about these things and write them down in a narrative form in small paragraphs. if you are optimizing anything in your solution method. Generally speaking there will be many different ways to solve a given problem and many algorithms may exist. These arrays are called IDNUM and MARKS. You will learn about many other data structures in advanced courses. When the program is large involving several subroutines.1 Assumptions made in an algorithm. I am assuming that the number of students in the class is known and denoted as ‘Numb’ and it should be greater than zero. For example integers can be half word. An example is shown in Figure B. The result is called AVER and it is displayed on the screen. B. or matrices. The chosen algorithm can be expressed in a step by step manner as a narrative text.

make a copy of the file (d) Learn the basic operations needed in using an Editor supported in the Linux environment such as the ‘vi’ editor. It would be a good idea to describe the detailed algorithm in the form of a ‘flow chart’. sub-directories. make a new directory. B.Appendix B 479 As a next step. Linux and Windows are two popular operating systems. Should I use. quit the editor. delete a line. The NASM assembler is software that runs under Linux. . consider the following two questions: “Should I use an ADD instruction or an INC instruction when I want to realize x=x+1?” “I want to initialize SUM to 0. cut and paste. you will start writing assembly language instructions that will accomplish the steps in your flow chart. For example at several universities the Redhat Linux is installed. There can be hundreds and thousands of files in a computer system. or an image in digital form is normally stored as a ‘file’ in the computer storage. left.3 STORING THE PROGRAM AS A FILE An entity such as an assembly language program. This leads to the idea of a ‘directory’. and sub-sub-directories when the collection is huge. right. (e) Also learn how to invoke the NASM assembler under the Linux version that is installed on your computer. There are directories. or by yourself at a later date. down. You might find more than one way to solve a problem and some may be better than others in some sense. In this case you will use Linux operating system. Thus it becomes necessary to organize the collection of files in a hierarchical fashion. a ‘move instruction’ or XOR SUM with SUM which will make it 0?” At the end of this step you should have a complete assembly language program written. move the cursor up. (c) Learn some of the basic operating system commands such as list the files in a directory. Some people find it easier to use a paper-pencil-eraser method at this stage to create the program. This is a creative process. insert a line. File organization and directory organization are supported by the underlying Operating System that is used with a computer. save the edited file. (b) Learn to log-on and log-off. change the directory. The first few things that you would learn are: (a) How to get user account and password from your college administration. insert within a line and delete characters within a line. you will develop a detailed description of the algorithms using the data names you have created. create a new file. then enter the program using an ‘editor software’. a text document. In the detailing process you might require several intermediate data items. whereas others may use a text editor to create the program using a computer online. a collection of data. If you have followed the former approach. As the third step. Some relevant operations that you must learn to do are: invoke the editor to edit a given file name. You will name them mnemonically in a meaningful manner so as to make your program readable by anybody. Each file is normally associated with a file extension that denotes the type of that file. As an example.

o This command invokes the linking step. there are no syntactic errors in the program. (2) Linking. In this command line. It takes the file named hello. (3) Loading and executing. (ii) $ ld –o hello hello. the file hello. the beginning character $ is the prompt character from the NASM software: (i) $ nasm -f elf hello. NASM develops an output file from this assembly process that is default named as.o as the input and links with the other necessary modules to create a file that can be loaded and executed. the instructions are executed as per the program you have written until termination. the computer does not come to a halt state. Once again –o is a parameter that we do not discuss here but must be used as is. it should be named ‘hello. finding the ‘bugs’ that cause them and then correcting the bugs so that the final program behaves the way it is expected to behave is commonly known as “testing and debugging”.asm This command invokes the assembling process. When the program terminates. In the following commands given to the NASM assembler.asm’. From then on. you need ‘test data’. That file is also called ‘hello’ which has no file extension.asm is taken as the input.o. instead the control is transferred to that part of the operating system known as the ‘OS kernel’./hello This command to NASM direct it to load the executable file called ‘hello’ into RAM and transfer the execution control to the first instruction.asm’. In order to test your program. For this process.480 Appendix B B. Automatically. The file extension for that file name must be ‘. hello. You need to . Suppose you want to call that file as ‘hello’. In the development of an assembly language program there are four stages: (1) Assembling. and finally (4) Debugging and Testing. Note that only the executable files can be run or put into execution state.4 PREPARATIONS TO EXECUTE THE PROGRAM The program that you have entered and stored in the above step is called the ‘assembly language module’. -f and elf are the parameters that we won’t discuss here and you are recommended to use them as they are. For further details. Every time you change the instructions in your program you will have to go through these four stages cyclically. more the care you take in the first instance. B. Therefore.5 TESTING AND DEBUGGING When the program starts to execute. Detecting the existence of logic errors. less will be the time spent in the total development cycle. (iii) $ . But there may be semantic or logic errors. you should refer to the NASM manual.

you must know the expected behaviour of your program. View. Data. and some reasoning. A computer tool called ‘debugger’ can assist you in the above human process. below the menu bar you notice a row of ‘icons’ and ‘break’ is one of them which can be used to set break points. In Figures B. and we mention below just four of those possibilities. In this command window you will see commands like Run. intellectual guessing. Edit. Using ‘ddd’ you can do several things to your program that is under execution. otherwise there is a bug somewhere in your program and you have to find the location of the bug and the cause for it. Explore these tabs and menu options./testprogram The above command will invoke the ‘ddd’ debugger on the ‘testprogram’. $ ddd . observe the row of tabs in the menu bar (the top row): The tabs are named: File. Commands. In Figure B. Program. Status.2 and B. There are two options: stepi & nexti. . Source. Help. Learning to use the DDD debugger requires both your exploration ability and some amount of guidance from your tutors in a tutorial class.Appendix B 481 create a ‘set of test data’. In Figure B.2. you can execute your program one ‘step’ at a time and keep examining the registers and key variables and observe if they change as per your expectations.3.3 we have shown the various windows and sub windows in the DDD: Debugger console. a dropped down menu appears. there is no bug. When the execution reaches that point in the program. l You can set a ‘break point in your program’. A popular debugger is called ‘gdb’ and its GUI version is what you will be using which is called ‘ddd’. Selecting the “Machine Code Window” menu option from the “View” tab would bring up a command window. l Resume execution after the break point l View the contents of the registers when the program is paused.3. In Figure B. This human process involves trial and error. If the expected behaviour and the actual behaviour of the program match with each other. ‘Stepi’ goes one instruction at a time whereas ‘nexti’ treats a function call as if it is a single instruction. You can do many things using the ‘ddd’ commands at that stage. Cont etc. The reader should explore the various options by examining the ‘DDD’ manual and graphical display called ‘DDD: Debugger Console’. Stepi. Nexti. you will notice that several sub windows are labeled for convenience. Learn to use the simple things first like viewing the register contents and selected memory locations. the program will ‘pause’ the execution and control will be given to the debugger. or view the contents of selected memory locations corresponding to some key data names. l From that point onwards. Interrupt. For each test data in the set. so that it can be ‘debugged’. When you select one of the tabs.

3 DDD: Debugger console showing the sub windows.2 DDD: Debugger console showing the tabs. FIGURE B. .482 Appendix B FIGURE B.

San Mateo. Computer Organization. The Intel Microprocessors—(8086/88.G. New York. Pentium and Pentium Proprocessor Pentium II. Morgan Kauffman. McGraw-Hill.J. CA. Pentium III. http://www. “Assembly Language”. C.K. 1978.. T. Introduction to System Software. 1976. Academic Press.com/. 1975. pp.G.R. Programming and Interfacing. Jordan. A. New Delhi. Prentice-Hall of India. D..G. Brey. Computer Systems Design and Architecture. 2005. 5th ed. March 1981. 483 .. Patterson. V. [12] IA32 Intel Architecture—Software Developer’s Manual... and T. J. Blakeslee. Hayes. New York.. Hamacher.. McGraw-Hill Intl. Virtual Memory. Paul. September 1970. New Delhi. 1994. Designing Logic Systems Using State Machines. 80186/80188. Digital Design with Standard MSI and LSI. Denning. 2nd ed.htm [13] IEEE 754 Floating Point Standard.. Foundations of Microprogramming Architecture..B. Computer. Prentice-Hall of India. 1990. 80486)..paulcarter. Zaky. 7th ed. and S. Ranscher. 2002. Dhamdhere. Vranesic.REFERENCES [1] [2] [3] Agarwala. J. 14(3). [11] Heuring. ACM Computing Survey.C. P. Clare. http://www.intel. [4] [5] [6] [7] [8] [9] [10] Hennessy. 80286. New Delhi. Computer Organization and Design: The Hardware/Software Interfaces. McGraw-Hill. 153–187. Pentium 4: Architecture.A. B.P. and H. Z.R. 80386. 2. New York. Tata McGraw-Hill. and D.F.P. Carter. 2006. New York.L..com/ design/pentium4/manuals/index_new.M. Wiley-Interscience. Computer Architecture and Organization. 1973. V.

[20] Peterson. McGrawHill. . 3rd ed. 1988. V. Fundamentals of Computers. [24] Silberschatt. Prentice-Hall of India. 1961. 4th ed. 2001.. J. Reading.. Addison-Wesley. Kim. [32] Geppart. [25] Stallings.. 1998. New Delhi. Electronics Information and Planning (Department of Electronics. 2001. New Delhi.. [26] Stallings. pp.... and C. A.. PrenticeHall. and T. J. Switching and Finite Automata Theory. Digital Design—Principles and Practices. Digital System Design Using Programmable Logic Devices. and P.L. [22] Rajaraman. 2006. 37(5). R. February 1992. 2002. New Delhi. 2006. [31] Rajaraman. and T. MIT Press. R. and T.. John Wiley & Sons. Prentice-Hall of India.. http://www.. Perry. New York. 2004. 4th ed. Logic and Computer Design Funamentals. New Delhi. Tata McGraw-Hill. IEEE Spectrum. 5th ed. [16] Kurtz. 1970. 2nd ed. Radhakrishnan. 1990. 2005.. Nehrstedt.PCGuide. Prentice-Hall. Essentials of Assembly Language Programming for the IBM PC. MA. Convey. Prentice-Hall of India. [28] Wakerly. USA.F. 23–26. M. 1975. 2000. Government of India). Prentice-Hall of India. Computer Systems Performance.L. [18] Mano.. Digital Logic and Computer Organization. Transmeta’s Magic Show. and T. 1980..W. Error Correcting Codes. Computer System Architectures. 221–239. [23] Rajaraman. New Jersey. W. Operating Systems. Operating Systems. and K. Prentice-Hall of India.S.K. 2000. New Delhi. V. V.F. Prentice-Hall of India. Galvin. 2001. [17] Lala.. New Delhi. New York. [21] Rajaraman.484 References [14] “Indian Script Code for Information Interchange–ISCII”.B. Z. Communication and Applications. H. [19] PC Guide. Interfacing Techniques in Digital Design. Radhakrishnan.R. New Delhi. [29] Hellerman.com. [27] Steinmetz. [15] Kohavi. Pearson Education. P. Introduction to Information Technology. Multimedia: Computing. V. L. Prentice-Hall of India. New Delhi. Computer Organization and Architecture: Designing for Performance. W. 5th ed. W.. New Jersey. pp. New Delhi. Computer Science Press. [30] Baer.. 7th ed. Cambridge.

690–91. 8–21.edu/~parallel/ equip/beowulf.. New Delhi. Prentice-Hall of India. 1990. W.E.. M. [47] Humphrey. IEEE Trans.A. on Computers.. pp. and A. M. pp. Feautrier. 30(7). 23(6). Singh. CA. pp. Siva Ram Murthy. 948–60.. al. 1999. [38] Flynn.. Morgan Kauffman. 478–90. ACM.. Condor—A Hunter of Idle Workstations. 2002. Parallel Computer Architecture: A Hardware-Software Approach. IEEE Trans.L. How To Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. P. IEEE Trans on Computing. Universities Press.kent. V. 12–29. A Survey of Cache Coherence Protocols for Multiprocessors. Parallel Computers—Architecture and Programming. [44] Stenstrom. Trace Scheduling.P. Supercomputers.S. San Francisco. [46] Litzikocs. IEEE Conf.pdf/ [35] Rajaraman. on Computers. CA. of Computers. pp. Addison-Wesley Longman. 28(1). [34] See http://www. 1990. . 1979. M. P... 1981. L. 28(9). on Distributed Computer Systems. [40] Hord. V. L. Commn. San Mateo. Morgan Kauffman.R. et. Reduced Instruction Computers. and D.A. 21(9). [45] For information of Beowulf clusters see: http://dune... J. D.References 485 [33] Patterson. 2006. and C. 1112–18. 27(12). [39] Rajaraman.com/design/intarch/papers/cache6. pp. and P. Computer Architecture—A Quantitative Approach. J.J. [42] Lamport. Gupta.mcs. Some Computer Organizations and Their Effectiveness. Hyderabad. CRC Press. Proc.intel. July 1988. pp. 1978. Patterson. pp. Parallel Supercomputing in SIMD Architectures. Introduction to Personal Software Process. 104–111. 1972. Boston. IEEE Trans. A New Solution to Cache Coherence Problems in Multiprocessor Systems. IEEE Computer. 1990. 1985. [37] Fisher. [43] Censier. [41] Culler. 1999. [36] Hennessy. J.

33 Assembler (NASM). 265 storing program. 217 indexed. 478 Audio data. 260 arithmetic shift. 192. 241 address modification. 138 carry look ahead adder. 248 arithmetic instructions. 266 macro pre-processor.INDEX Add/Subtract rules  one’s complement. 219 PC relative. 261 call and return. 136. 408 by software methods. 159 in hardware description language. 136 Address clusters data. 139 Array processors. 62 with NOR. 135 full. 300 Address trace. 216 base addressing. 134 ripple carry adder. 149 for binary multiplier. 246. 63 Arithmetic logic unit. 257 executing program. 217 indirect addressing. 102 two’s complement. 264 macros. 189. 479 testing program. 154 state transitions in. 218 Advanced commercial processors Pentium processor. 262 array manipulation. 386 AND gate realization with NAND. 480 using debugger. 263 instructions. 156 chart. 139. 155. 250. 421 Power PC 620. 220 segment addressing. 245 macro assembler. 151 for comparator. 98 MSI unit. 32 ALU chip. 100 gate realization. 476 sign magnitude. 150. 107. 242 Assembly language. 152 states. 152 designing digital system using. 137 chip for. 481 input and output. 216 encoding of. 186. 251. 150 symbols used. 420 Algorithm. 441 ASCII code. 158 for serial adder. 138 four bit. 32 487 . 217 direct. 141 Amdahl’s law. 299 Addressing modes. 219 roles of. 220 immediate addressing. 480 writing program. 300 instruction. 412 Alphanumeric codes. 2 Algorithmic state machine. 10. 150 Alleviating pipeline delays by hardware methods. 252 bit oriented instructions. 255 allocating memory. 265 OS kernel. 109 Adder 16-bit hybrid.

212 Beowoulf COW. 432 based on grain size. 358. 364 PCI bus. 305 in Pentium. 17 Benchmark programs. 305 cache line. 454 MESI protocol. 309 block-set-associative mapping. 40 theorems of. 311 data. 309 comparing mapping methods. 226 data lines. 304 Classification of parallel computers based on coupling between processors. 226 Base. 363 protocol. 43 canonical form. 298 miss. 50. 315 memory. 23 Binary division. 447 snoopy cache protocols. 360 bus transaction. 46 minterms of. 52 Boolean functions. 375 Clock. 59 FPLA. 435 Flynn’s classification. 465 . 357 address bus. 57 maxterms. 43 Branch prediction buffer. 306 hit. 304 cache controller. 409 Branch target buffer. 358 data bus. 109 algorithm for. 49 postulates of. 59 realizing with multiplexers. 42 operators of. 310. 316 Cache and virtual memory combining. 111 registers for. 359 types of. 40 duality Principle. 226 handshaking protocol. 228 synchronous. 63 PALs. 304 L2 cache. 307. 357 ISA bridge. 310 write back policy. 436 based on mode of accessing memory. 18 Boolean algebra. 41. 54. 358. 313 line replacement rule. 315 LRU block. 466 Binary addition. 364 ISA bus. 327 comparison. 305 instruction. 226 asynchronous. 45. 223 Cluster of workstations COW.488 Index Buses adapter. 111 of signed numbers. 314 L1. 226 control lines. 326 Cache coherence. 313 direct mapping. 358 227 Cache altered bit. 45 incompletely specified. 315 LRU policy. 228 address lines. 304 L2. 304 tag bits. 44. interfacing. 99 Binary coded decimals. 113 algorithm for. 410 Bus structure. 45 realization with AND/OR. 66 simplifying. 306 associative mapping. 62. 358 control bus. 448 Cache memory organization. 306. 316 write through policy. 449 in shared bus parallel computers. 305 L1 cache. 358 synchronous. 48. 66 NAND/NOR. 357 arbitration. 101 Binary system. 48 gate realization. 70 multiplexers. 364 multiplexed bus. 15. 358 PCI bridge. 435 Client server computing. 42 Boolean expressions. 115 Binary multiplication. 115 registers for. 112 Binary operators. 361 asynchronous. 308 dirty bit. 307 directory scheme. 47 Binary subtraction. 40. 18 Bit. 72 truth table.

343 Fixed point representation. 289 refresh cycle. 77 charge controlled JK. 286 double data rate (DDR). 30 Error detecting codes. 82 T. 122 biased exponent. 390 Control characters. 293 Flip-flops. 88 synchronous. 81 D. 465 DMA based data transfer. 355 operation. 391 External interrupts. 122. 121. 405 due to exceptions. 103 two’s complement. 289 Error check bits. 120 excess representation of exponent. 124 big endian for significand. 407 due to data dependency. 131 Floating point multiplication. 78 Floating point adder. 8. 224. 28 Evaluation of CPUs. 236 op-code. 164 Data processor. 43 489 Designing digital system. 229 Decimal to binary conversion. 80 clocked JK. 159 Controller design for comparator. 277 Dynamic random access memory (DRAM). 161 for serial adder. 132 Floating point division. 129 algorithm for. 16 Computer organization. 224 Control signals. 2 Data path. 352 Don’t care conditions. 127 binary. 83 RS. 236 Delay in pipeline execution. 335 address decoder. 165 Counters. 19 Decision table. 226 CPU performance. 33 Control flow. 117 Flash memory. 390 transaction processing benchmarks. 79 master slave JK. 103 one’s complement. 336 Dataflow graph. 354 cycle stealing. 78 RS latch. 85 binary. 159 Device controller. 58 Dual ported RAM. 130 algorithm for. 5 Controller. 131 Floating point representation of numbers. 90 controlled binary. 86 CPU buses. 475 Decoder addressing mode. 85 controlled. 334 Digital logic. 5. 103 Complex instruction set computer. 353 for fast I/O devices. 390 SPEC benchmarks. 29. 86 shift register. 127 algorithms for. 163 for serial multiplier. 390 kernels. 165 Floating point addition/subtraction. 336 Device interfacing. 415 due to resource constraints. 160 Data register. 229 Data processor diagram for comparator. 403 DeMorgan’s laws. 390 synthetic benchmarks. 351 controller. 388 Cyclic codes. 276. 91 ripple carry. 233 Control path. 26 Data. 224 Control unit. 84 JK. 100. 403 due to branch instructions. 31 Error correcting code. 160 for serial adder. 7. 162 serial multiplier. 393 Computer architecture. 124 . 455 cluster of workstations. 7 Distributed shared memory parallel computer. 117 base of exponent. 293 Dynamic memory cell. 16 Computer system performance.Index Complement representation of numbers.

336. 116 big endian. 432 single instruction single data (SISD). 157 Hexadecimal system. 120 IEEE standard. 432 multiple instruction single data (MISD). 176 Instruction fetch cycle. 370 CSMA/CD protocol. 424 IEEE standard  for double precision. 16 Hardware description language. 345 daisy chaining. 189 Indexing. 340 Interrupted process. 341 Interrupt structures. 439 multiple instruction multiple data (MIMD). 337. 369 ethernet packet. 370 network interface unit. 374 . 348 priority encoder. 339. 356 IA-64. 372 media access control. 351 program controlled data transfer. 336 direct memory access. 166 for multiplier. 100. 100  gate realization. 369. 350 software polling. 336. 118. 373 ethernet transmission media. 157 VHDL. 127 mantissa. 123 little endian for significand. 188 Indirect addressing. 372 ethernet physical wiring. 102 Hamming code. 344 bus arbitration. 342 Interrupt priority. 370 collision detection. 117 Internal interrupts. 340 Interrupting process. 348 programmable peripheral interface unit. 133 Full subtractor. 344 I/O processor. 123 for single precision. 351 interrupt driven data transfer. 157 for comparator. 432 simple program multiple data (SPMD). 370 wiring. 332 Input unit. 189 Local area network. 118 Flynn’s classifications of parallel computers. 186 Index register. 158 for floating point adder. 346 multiple interrupt lines. 216 of SMAC+. 119 significand. 333 functions of. 132 Half subtractor. 176 Instruction format. 343 Interrupt classes. 34 Karnaugh map. 373 ethernet STAR topology. 338. 212 of IBM-7090. 333 parts of. 388 Instruction set. 29 exponent. 124 Immediate instruction. 343 Interrupt service program. 347 VLSI chip for. 211 Integer representation. 18 Hit ratio. 302 Hypothetical computer. 369 base band transmission. 340 ISCII code. 158 for serial adder. 432 single instruction multiple data (SIMD). 343 Interrupt controlled data transfer. 433 Full adder. 100 gate realization. 159 Verilog. 30 Hardware. 214 of IBM-370. 192 Input output (I/O) organization.490 Index Indexed addressing. 118 normalized. 174 Instruction mixes. 349 Interrupt levels. 125 extended real. 102 Half adder. 370 carrier sense. 371 ethernet. 5 Instruction execution cycle. 335 I/O methods. 116 little endian. 126 for floating point numbers. 344 vectored interrupts. 337. 171 I/O interface unit.

435 Flynn’s classification. 233 Miss ratio. 172 Memory cell reading from. 90 Multiplier 4-bit combinatorial. 10 Parato principle (20/80 law). 142. 17 Printable characters. 64. 422 Pentium registers. 453 Pentium memory. 324 Parallel computer classification of. 41 Output unit. 60 Non-volatile memory. 431 message passing. 385 491 . 11. 17 Operating system. 172 Memory buffer register. 28 Pentium based parallel computer. 461 taxonomy. 465 comparison of. 91 realization of controlled shift register. 386 prevailing technology. 60 Numbering systems. 33 Processor (ALU). 343 Multiplexers. 301 temporal. 242 eax register. 274 NOR gate. 386 application requirements. 464 synchronous message passing protocol. 209 Operations of switching algebra AND. 461 asynchronous message passing protocol. 5 Processor architecture 20-80 law. 403 pipeline stall. 231 Microprogram. 382 main determinants. 421 branch prediction. 10 Memory. 323 MIN. 387 Parity bit. 232 random. 299. 431 distributed shared memory. 384 upward compatibility. 403 control hazard. 32 Pipeline hazards. 243 segments. 435 cluster of workstations. FIFO.Index Locality principle. 207 Operation code types fixed length. 242 EFLAGS register. 383 parallelism in applications. 423 pairing of instructions. 403 structural hazard. 172 Message passing parallel computer. 5 Page replacement policies. 403 Pipelining. 234 realization of Boolean expressions. 232. 397 Positional system. 384 programming convenience. 278 Memory data register. 243 Pentium processor architecture of. 302 Multiple interrupts. 301 Locality of reference. 242 stack pointer. 228 Micro-operation. 463 Micro-engine. 172 Memory organization. 5 Operation code encoding. 242 parts of eax. 387 Amdahl’s law. 244 Picture data. 243 segment base. 403 data hazard. 432 generalized structure of. 231. 279 writing in. 453 quard-core processor. 323 working set. 28 odd. 318 spatial. 439 Parallelism in applications. 5 Memory address register. 243 EIP register. 41 OR. 468 definition. 66 realization of controlled counter. 208 variable length. 207 parts of. 466. 143 parallel. 28 even. 142 NAND gates. 41 NOT. 323 323 LRU.

244 Self-complementing codes. 276. 395 delay in pipeline execution. 221 segment. 184 Single level interrupt. 182 execution tracing. 177 registers of. 274 memory write time. 178 instruction set.5D organization. 198 symbolic instructions. 280 memory access time. 25 Sequential consistency. 341 SMAC+. 71 PAL. 291 programmable. 179 instructions and data representation of. 154. 221 pointer. 199 Software. 44 Program. 275 refreshing memory. 403 instruction pipeline. 3. 217 general purpose. 190 pointer. 446 Sequential switching circuits. 365. 190. 365 Serial multiplier. 277 Register forwarding in RISC. 406 Register sets. 446 Simulation algorithm of SMAC+. 190 use in subroutine linking. 275 non-destructive readout. 286 dual ported. 275 DRAM. 195 return. 274 memory buffer register. 10 Stack segment. 252 EEPROM. 367 wide area network. 158. 275 memory data register. 276 Read only memory (ROM). 284 address space. 175. 290 applications of. 176 SMAC++. 162 Serial data communication. 282 2. 454 process synchronization. 395 instruction structures. 366 asynchronous communication interface adapter ACIA. 365. 275 memory address register. 173 memory organization of. 256 Static memory cell. 221 index . 442 using interconnection network. 17 Random access memory (RAM). 293 error detection and correcting in. 365 asynchronous. 400. 177 instruction format of. 221 RISC processor. 173 execution phases of. 417 Product of sum form. 16 Software hardware trade-off. 2 Protocol CSMA/CD. 273 2D organization. 396 . 175. 4 Segment descriptor table. 194 call. 227 Quine-Mcluskey chart. 184 instruction fetch and execution cycles. 221 base. 276 Status register. 72 Programming language. 4. FPLA. 371 handshaking. 274. 164 Shared memory parallel computers. 198 instruction set of. 50 70 Radix. 369 modem. 291 UVEPROM. 196. 442 shared bus architecture. 193 linking. 393. 291 Reduced instruction set computer. 244 operations on. 402 Refreshing memory. 172 input/output of. 336 Stored program concept. 274 memory cycle time.492 Index Secondary memory. 366 LAN. 277 SRAM. 244 Stacks. 7 Subroutines. 74 Serial adder. 289 IC chips for. 195 Sum of products form. 15 Programmable logic devices. 277 Static random access memory (SRAM). 155. 402. 401. 177. 189. 291 programming. 44 Super pipeline processing. 159.

323 page size. 418 Video data. 321 page replacement algorithm. 319 page fault. 303 cache. 320 page table size. 50 for three variables. 302 321 Upward compatibility. 322 segmentation. 319 mapping virtual memory to main memory. 303 access time ratio. 326 Translation look aside buffer (TLB). 319 TLB. Traps. 439 Vector operations. 321 Volatile memory. 318 address translation. pipelining in. 335 416 493 Textual data. 274 Weighted codes. 320 paging. 10 Veitch-Karnaugh map. 321. 25 Writable control store. 343 Truth tables. 418 System bus. 33 Virtual memory. 51 for four variables. 55 Very long instruction word processor VLIW. 43 Two level memories. 188 291 . 301 access efficiency. 32 Thrashing in virtual memory. Vector computers. 326 page table. 54.Index Superscalar processors.