You are on page 1of 44

80386

1.Salient features Some of the limitations of the 80286 microprocessor are that it has only a 16-bit ALU, its maximum segment size is 64 Kbytes, and it cannot easily be switched back and forth between real and protected modes. The Intel 80386 microprocessor was designed to overcome these limits, while maintaining software compatibility with the 80286 and earlier processors. The 80386 has a 32-bit ALU, so it can operate directly on 32-bit data words. The processor can address up to four gigabytes of physical memory and 64 terabytes (2 ^ (46) bytes) of virtual memory. 80386 segments can be as large as 4 Giga Bytes and a program can save as many as 16384 segments. The virtual address then is 16384 segments * 4 GBytes, or about 64TBytes. The 80386 has a virtual mode which allows it to easily switch back and forth between 80386 protected mode tasks and 80386 real mode tasks. The on-chip memory-management facilities of 80386 include address translation registers, advanced multitasking hardware, a protection mechanism, and paged virtual memory.

The 80386 processor is available in two different versions, the 386DX and the 386SX. The 386DX has a 32-bit address bus and a 32-bit data bus.The 386SX has an identical architecture as 386DX with difference that it has only 16-bit data bus and 24-bit address bus.The 80386DX addresses 16M bytes of memory with its 32-bit data bus and 32-bit address.The 80386SX,more like 80286,addresses 16M bytes of memory with its 24-bit address bus via its 16bit data bus.The 80386SX was developed after the 80386Dx for applications that require the full 32-bit bus version.The 80386 has three processing modes: 1. Protected Mode. 2. Real-Address Mode. 3. Virtual 8086 Mode.

Protected mode is the natural 32-bit environment of the 80386 processor. In this mode all instructions and features are available. Real-address mode (often called just "real mode") is the mode of the processor immediately after RESET. In real mode the 80386 appears to

programmers as a fast 8086 with some new instructions. Most applications of the 80386 will use real mode for initialization only. Virtual 8086 mode (also called V86 mode) is a dynamic mode in the sense that the processor can switch repeatedly and rapidly between V86 mode and protected mode. The CPU enters V86 mode from protected mode to execute an 8086 program, then leaves V86 mode and enters protected mode to continue executing a native 80386 program.

2.Architecture of 80386 The Internal Architecture of 80386 is divided into 3 sections. Central processing unit Memory management unit Bus interface unit

Central Processing Unit:Central processing unit is further divided into Execution unit and Instruction unit. Execution unit has 8 General purpose and 8 Special purpose registers which are either used for handling data or calculating offset addresses. The Instruction unit decodes the opcode bytes received from the 16-byte instruction code queue and arranges them in a 3- instruction decoded instruction queue. After decoding them pass it to the control section for deriving the necessary control signals. The barrel shifter increases the speed of all shift and rotate operations. The multiply / divide logic implements the bit-shift-rotate algorithms to complete the operations in minimum time. Even 32bit multiplications can be executed within one microsecond by the multiply /divide logic. Memory management unit:The Memory management unit consists of a Segmentation unit and a Paging unit. Segmentation unit allows the use of two address components, viz. segment and offset for relocability and sharing of code and data. Segmentation unit allows segments of size 4Gbytes at maximum.The Paging unit organizes the physical memory in terms of pages of 4kbytes size each. Paging unit works under the control of the segmentation unit, i.e. each segment is further divided into pages. The virtual memory is also organizes in terms of segments and pages by the memory management unit. The Segmentation unit provides a 4 level protection mechanism for protecting and isolating the system code and data from those of the application program. Paging unit converts linear addresses into physical addresses. The control and attribute PLA checks the privileges at the page level. Each of the pages maintains the paging information of the task. The limit and attribute PLA checks segment limits and attributes at segment level to avoid invalid accesses to code and data in the memory segments. Bus Control Unit The Bus control unit has a prioritizer to resolve the priority of the various bus requests. This controls the access of the bus. The address driver drives the bus enable and address signal A0 A31. The pipeline and dynamic bus sizing unit handle the related control signals. The data buffers interface the internal data bus with the system bus.

3. Signal Descriptions of 80386

CLK2 :The input pin provides the basic system clock timing for the operation of 80386. D0 D31:These 32 lines act as bidirectional data bus during different access cycles. A31 A2: Address bus connections address any of the 4GB memory locations found in 80386 memory system. A0 and A1 are encoded in the bus enable(BE0 toBE3 ) to select any or all four bytes in a 32-bit-wide memory location.

BE0 toBE3 :Bank Enable signals select the access of a byte,word,or double word of data. The 32- bit data bus supported by 80386 and the memory system of 80386 can be viewed as a 4 8-bit-

wide memory banks. The 4 byte bank enable lines BE0 to BE3 , may be used for enabling these 4 banks. Using these 4 enable signal lines,the CPU may transfer 1 byte / 2 / 3 / 4 byte of data simultaneously.

M/IO:Memory/IO selects a memory device when a logic 1 or an I/O device when a logic 0.

W/R:Write/Read indicates that the current cycle is a write when a logic 1 or a read when a logic 0. ADS#: The address data strobe output pin indicates that the address bus and bus cycle definition pins( W/R#, D/C#, M/IO#, BE0# to BE3# ) are carrying the respective valid signals. The 80383 does not have any ALE signals and so this signals may be used for latching the address to external latches. READY#: The ready signals indicates to the CPU that the previous bus cycle has been terminated and the bus is ready for the next cycle. The signal is used to insert WAIT states in a bus cycle and is useful for interfacing of slow devices with CPU. RESET: Reset initializes the 80386,causing it to begin executing software at memory location FFFFFFF0H.The 80386 is reset to real mode. LOCK: Lock becomes a logic 0 whenever an instruction is prefixed with the LOCK: prefix. D/C: Data/control indicates that the data bus contain data for or from memory or I/O when a logic 1. VCC: These are system power supply lines. VSS: These return lines for the power supply. BS16#: The bus size 16 input pin allows the interfacing of 16 bit devices with the 32

bit wide 80386 data bus. Successive 16 bit bus cycles may be executed to read a 32 bit data from a peripheral. HOLD: The bus hold input pin enables the other bus masters to gain control of the system bus if it is asserted. HLDA: The bus hold acknowledge output indicates that a valid bus hold request has been received and the bus has been relinquished by the CPU. BUSY#: The busy input signal indicates to the CPU that the coprocessor is busy with the allocated task. ERROR#: The error input pin indicates to the CPU that the coprocessor has encountered an error while executing its instruction. PEREQ: The processor extension request output signal indicates to the CPU to fetch a data word for the coprocessor. INTR: This interrupt pin is a maskable interrupt, that can be masked using the IF of the flag register. NMI: A valid request signal at the non-maskable interrupt request input pin internally generates a non- maskable interrupt of type2. N / C : No connection pins are expected to be left open while connecting the 80386 in the circuit. NA:Next address causes the 80386 to output the address of the next instruction or data in the current bus cycle.

4.Register Organisation

Fig:Register Bank of 80386

Fig:Flag register of 80386

The 80386 has eight 32 - bit general purpose registers which may be used as either 8 bit or 16 bit registers. A 32 - bit register known as an extended register, is represented by the register name with prefix E.Example : A 32 bit register corresponding to AX is EAX, similarly BX is EBX etc.The 16 bit registers BP, SP, SI and DI in 8086 are now available with their extended size of

32 bit and are names as EBP,ESP,ESI and EDI.AX represents the lower 16 bit of the 32 bit register EAX. The 16-bit registers BP,SP,SI and DI are available with their extended size of 32 bits and are named as EBP,ESP,ESI and EDI.BP, SP, SI, DI represents the lower 16 bit of their 32 bit counterparts, and can be used as independent 16 bit registers. The six segment registers available in 80386 are CS, SS, DS, ES, FS and GS.The CS and SS are the code and the stack segment registers respectively, while DS, ES,FS, GS are 4 data segment registers. A 16 bit instruction pointer IP is available along with 32 bit counterpart EIP. Flag Register of 80386: The Flag register of 80386 is a 32 bit register. Out of the 32 bits, Intel has reserved bits D18 to D31, D5 and D3, while D1 is always set at 1.Two extra new flags are added to the 80286 flag to derive the flag register of 80386. They are VM and RF flags. VM - Virtual Mode Flag: If this flag is set, the 80386 enters the virtual 8086 mode within the protection mode. This is to be set only when the 80386 is in protected mode. In this mode, if any privileged instruction is executed an exception 13 is generated. This bit can be set using IRET instruction or any task switch operation only in the protected mode. RF- Resume Flag: This flag is used with the debug register breakpoints. It is checked at the starting of every instruction cycle and if it is set, any debug fault is ignored during the instruction cycle. The RF is automatically reset after successful execution of every instruction, except for IRET and POPF instructions. Also, it is not automatically cleared after the successful execution of JMP, CALL and INT instruction causing a task switch. These instruction are used to set the RF to the value specified by the memory data available at the stack. Segment Descriptor Registers: This registers are not available for programmers, rather they are internally used to store the descriptor information, like attributes, limit and base addresses of segments.The six segment registers have corresponding six 73 bit descriptor registers. Each of them contains 32 bit base address, 32 bit base limit and 9 bit attributes. These are automatically loaded when the corresponding segments are loaded with selectors.

Control Registers: The 80386 has three 32 bit control registers CR0, CR2 and CR3 to hold global machine status independent of the executed task. Load and store instructions are available to access these registers.CR0 represents the Machine Status Word in 80286.CR1 is not used in the 80386,but it is reserved for future products.CR2 hold the linear page address of the last page accessed before a page fault interrupt.CR3 holds the base address of the page directory. Register CR0 contains a number of special control bits that are defined as follows: 1.PG:Selects a page table translation of linear address into physical address when PG=1. 2.ET:Selects the 80287 coprocessor when ET=0 or the 80387 coproccesor when ET=1. 3.TS:indicates the 80386 has switched tasks. 4.EM:Emulate bit is set to cause a type 7 interrupt for each ESC instruction. 5.MP:is set to indicate that the arithmetic coprocessor is present in the system. 6.PE:is set to select the protection mode of operation of 80386

System Address Registers: Four special registers are defined to refer to the descriptor tables supported by 80386.The 80386 supports four types of descriptor table, viz. global descriptor table (GDT),interrupt descriptor table (IDT), local descriptor table (LDT) and task state segment descriptor (TSS).The system address registers and system segment registers hold the addresses of these descriptor tables and corresponding segments. These registers are known as GDTR, IDTR, LDTR and TR

respectively. The GDTR and IDTR are called as system address and LDTR and TR are called system segment registers.

Debug and Test Registers: Intel has provided a set of 8 debug registers for hardware debugging. Out of these eight registers DR0 to DR7, two registers DR4 and DR5 are Intel reserved. The initial four registers DR0 to DR3 store four program controllable breakpoint addresses, while DR6 and DR7 respectively hold breakpoint status and breakpoint control information.The breakpoint addresses,which may locate an instruction ,are constantly compared with address generated by the program.If a match occurs,80386 will cause a type 1 interrupt(TRAP or debug interrupt) to occur.The debugging addresses are useful in debugging faulty software.

Test Registers: Two more test register are provided by 80386 for page cacheing namely test control and test status register(TR6 and TR7).Test registers are used to test the translation look aside buffer(TLB).The TLB is used with the paging unit within 80386.The TLB,which hold most commonly used page table address translations,reduces the number of memory reads required for looking up page translation addresses in the page translation tables.TR6 holds the tag field(linear address) of the TLB and TR7 holds the physical address of the TLB.

The bits found in TR6 and TR7 are; V:shows that the entry in TLB is valid. D:indicates that the entry in TLB is invalid or dirty. U:A bit for the TLB. W:indicates the area addressed by the TLB entry is writable. C:selects a write(0) or immediate lookup(1) for the TLB. PL:indicates a hit if a logic 1. REP:selects which block of TLB is written

5.Addressing Modes:The 80386 supports overall eleven addressing modes to facilitate efficient execution of higher level language programs. The 80386 has all the addressing modes, which were available with 80286. In case of all those modes, the 80386 can now have 32-bit immediate or 32-bit register operands or displacements. Besides these, the 80386 has a family of scaled modes. N case of scaled modes, any of the index register values can be multiplied by a valid scale factor to obtain the displacement. The valid scale factors re 1,2,4, and 8. The different scaled modes are discussed as follows. Scaled Indexed Modes : Contents of an index register are multiplied by a scale factor that may be added further to get the operand offset. Example: Mov EBX, LIST [ESI * 2 ] Based Scaled Indexed Mode: Contents of an index register are multiplied by a scale factor and then added to base register to obtain the offset. Example: Mov EBX, [EDX*4 ] [ECX] Based Scaled Indexed Mode with Displacement: The contents of an index register are multiplied by a scaling factor and the result is added to a base register and a displacement to get the offset of an operand Example: Mov EAX, LIST [ESI * 2 ] [EBX + 0800] The displacement may be any 8-bit or 32-bit immediate number. The base and index register may be any general purpose register except ESP.

6.Data Types:1. Bit 2. Bit Field A group of at the most 32 bits 3. Bit string A string of contiguous bits of maximum 4Gbytes in length. 4. Signed Byte Signed bytedata 5. Unsigned Byte Unsigned bytedata 6. Integer word Signed 16-bit data 7. Long Integer 32-bit signed data represented in 2s complement form 8. Unsigned Integer word Unsigned 16-bit data 9. Unsigned Long Integer Unsigned 32-bit data

10. Signed Quad word A signed 64-bit data or four word data. 11. Unsigned Quad word An unsigned 64-bit data. 12. Offset A 16 or 32-bit displacement that references a memory location using any of the addressing modes 13. Pointer This consists of a pair of 16-bit selector and 16/32 bit offset. 14. Character An ASCII equivalent to any of the alphanumeric or control characters. 15. Strings- These are the sequence of bytes, words or double words. A string may contain minimum one byte and maximum 4 Giagabytes. 16. BCD decimal digits from 0-9 represented by unpacked bytes 17. Packed BCD This represents two packed BCD digits using a byte, i.e. form 00 to 99.

7.Real Address Mode of 80386: After reset, the 80386 starts from memory location FFFFFFF0H under the real address

mode. In the real mode, 80386 works as a fast 8086 with 32-bit registers and data types. In real mode, the default operand size is 16 bit but 32- bit operands and addressing modes may be used with the help of override prefixes . The segment size in real mode is 64k, hence the 32-bit effective addressing must be less than 0000FFFFFH. The real mode initializes the 80386 and prepares it for protected mode.

Memory Addressing in Real Mode: In the real mode, the 80386 can address at the most 1Mbytes of physical memory using address lines A0-A19.Paging unit is disabled in real addressing mode, and hence the real addresses are the same as the physical addresses. To form a physical memory address, appropriate segment registers contents (16-bits) are shifted left by four positions and then added to the 16-bit offset address formed using one of the addressing modes, in the same way as in the 80386 real address mode. The segment in 80386 real mode can be read, write or executed, i.e. no protection is available. Any fetch or access past the end of the segment limit generate exception 13 in real

address mode. The segments in 80386 real mode may be overlapped or non-overlapped. The interrupt vector table of 80386 has been allocated 1Kbyte space starting from 00000H to 003FFH.

8.Protected Mode of 80386 All the capabilities of 80386 are available for utilization in its protected mode of operation. The 80386 in protected mode support all the software written for 80286 and 8086 to be executed under the control of memory management and protection abilities of 80386. The 80386 can address 4 Gigabytes of physical memory and 64 terrabytes of virtual memory per task ADDRESSING IN PROTECTED MODE: In this mode, the contents of segment registers are used as selectors to address descriptors which contain the segment limit, base address and access rights byte of the segment. The effective address (offset) is added with segment base address to calculate linear address. This linear address is further used as physical address, if the paging unit is disabled, otherwise the paging unit converts the linear address into physical address. The paging unit is a memory management unit enabled only in protected mode. The paging mechanism allows handling of large segments of memory in terms of pages of 4Kbyte size.

9.Segmentation Segmentation is a way of offering protection to different types of data and code. DESCRIPTOR TABLES: These descriptor tables and registers are manipulated by the operating system to ensure the correct operation of the processor, and hence the correct execution of the program. Three types of the 80386 descriptor tables are listed as follows: GLOBAL DESCRIPTOR TABLE ( GDT ) LOCAL DESCRIPTOR TABLE ( LDT ) INTERRUPT DESCRIPTOR TABLE ( IDT ) DESCRIPTORS: The 80386 descriptors have a 20-bit segment limit and 32-bit segment address. The descriptor of 80386 are 8-byte quantities access right or attribute bits along with the base and limit of the segments. Descriptor Attribute Bits: The A (accessed) attributed bit indicates whether the segment has been accessed by the CPU or not.

The TYPE field decides the descriptor type and hence the segment type. The S bit decides whether it is a system descriptor (S=0) or code/data segment descriptor( S=1). The DPL field specifies the descriptor privilege level. The D bit specifies the code segment operation size. If D=1, the segment is a 32-bit operand segment, else, it is a 16-bit operand segment. The P bit (present) signifies whether the segment is present in the physical memory or not. If P=1, the segment is present in the physical memory. The G (granularity) bit indicates whether the segment is page addressable. The zero bit must remain zero for compatibility with future process.If G=0,the number stored in the limit is interpreted directly as a limit,allowing it to contain any limit between 00000H and FFFFFH for a segment size up to 1MB.If G=1,the number stored in the limit is interpreted as 00000XXXH-FFFFFXXXH,where XXX is any value between 000H and FFFH. The AVL (available) field specifies whether the descriptor is for user or for operating system.

The 80386 has five types of descriptors listed as follows: 1.Code or Data Segment Descriptors. 2.System Descriptors. 3.Local descriptors.

4.TSS (Task State Segment) Descriptors. 5.GATE Descriptors. The 80386 provides a four level protection mechanism exactly in the same way as the 80286 does.

10.Paging 10.1 PAGING OPERATION: Paging is one of the memory management techniques used for virtual memory multitasking operating system. The segmentation scheme may divide the physical memory into a variable size segments but the paging divides the memory into a fixed size pages. The segments are supposed to be the logical segments of the program, but the pages do not have any logical relation with the program. The pages are just fixed size portions of the program module or data. The advantage of paging scheme is that the complete segment of a task need not be in the physical memory at any time. Only a few pages of the segments, which are required currently for the execution, need to be available in the physical memory. Thus the memory requirement of the task is substantially reduced, relinquishing the available memory for other tasks. Whenever the other pages of task are required for execution, they may be fetched from the secondary storage. The previous page which is executed, need not be available in the memory, and hence the space occupied by them may be relinquished for other tasks. Paging Unit: The paging unit of 80386 uses a two level table mechanism to convert a linear address provided by segmentation unit into physical addresses. The paging unit converts the complete map of a task into pages, each of size 4K. The task is further handled in terms of its page, rather than segments. The paging unit handles every task in terms of three components namely page directory, page tables and page itself.

Paging Descriptor Base Register: The control register CR2 is used to store the 32-bit linear address at which the previous page fault was detected. The CR3 is used as page directory physical base address register, to store the physical starting address of the page directory. The lower 12 bit of the CR3 are always zero to ensure the page size aligned directory. A move operation to CR3 automatically loads the page table entry caches and a task switch operation, to load CR0 suitably. Page Directory : This is at the most 4Kbytes in size. Each directory entry is of 4 bytes, thus a total of 1024 entries are allowed in a directory. The upper 10 bits of the linear address are used as an index to the corresponding page directory entry. The page directory entries point to page tables. Page Tables: Each page table is of 4Kbytes in size and many contain a maximum of 1024 entries. The page table entries contain the starting address of the page and the statistical information about the page. The upper 20 bit page frame address is combined with the lower 12 bit of the linear address. The address bits A12- A21 are used to select the 1024 page table entries. The page table can be shared between the tasks. The P bit of the above entries indicate, if the entry can be used in address translation. If P=1, the entry can be used in address translation, otherwise it cannot be used. The P bit of the currently executed page is always high. The accessed bit A is set by 80386 before any access to the page. If A=1, the page is accessed, else unaccessed. The D bit ( Dirty bit) is set before a write operation to the page is carried out. The D-bit is undefined for page director entries. The OS reserved bits are defined by the operating system software.

The User / Supervisor (U/S) bit and read/write bit are used to provide protection. These bits are decoded to provide protection under the 4 level protection model.The level 0 is supposed to have the highest privilege, while the level 3 is supposed to have the least privilege.

10.2 Conversion of a Linear address to a physical address:-

The paging unit receives a 32-bit linear address from the segmentation unit. The upper 20 linear address bits (A12-A31) are compared with all the 32 entries in the translation look aside buffer to check if it matches with any of the entries. If it matches, the 32-bit physical address is calculated from the matching TLB entry and placed on the address bus. For optimizing the conversion process, a 32-entry page table cache is provided which stores the 32 recently accessed page table entries. Whenever a linear address is to be converted to physical address, it is first checked to see, whether it corresponds to any of the page table cache entries. This page table cache is known as Translation look aside buffer (TLB). If the page table entry is not in the TLB, the 80386 reads the appropriate page directory entry. It then checks the P-bit of the directory entry. If P=1, it indicates that the page table is in the memory. Then 80386 refers to the appropriate page table entry and sets the accessed bit A. If P=1, in the page table entry, the page is available in the memory. Then the processor updates the A and D bits and accesses the page. The upper 20 bits of the linear address, read from the page table are stored in TLB for future possible access. If P=0,the processor generates a page fault exception number 14.

11. Virtual 8086 Mode:-

In its protected mode of operation, 80386DX provides a virtual 8086 operating environment to execute the 8086 programs. The real mode can also used to execute the 8086 programs along with the capabilities of 80386, like protection and a few additional instructions. Once the 80386 enters the protected mode from the real mode, it cannot return back to the real mode without a reset operation. Thus, the virtual 8086 mode of operation of 80386, offers an advantage of executing 8086 programs while in protected mode. The address forming mechanism in virtual 8086 mode is exactly identical with that of 8086 real mode. In virtual mode, 8086 can address 1Mbytes of physical memory that may be anywhere in the 4Gbytes address space of the protected mode of 80386. Like 80386 real mode, the addresses in virtual 8086 mode lie within 1Mbytes of memory. In virtual mode, the paging mechanism and protection capabilities are available at the service of the programmers. The 80386 supports multiprogramming, hence more than one programmer may be use the CPU at a time. Paging unit may not be necessarily enable in virtual mode, but may be needed to run the 8086 programs which require more than 1Mbyts of memory for memory management function. In virtual mode, the paging unit allows only 256 pages, each of 4Kbytes size. Each of the pages may be located anywhere in the maximum 4Gbytes physical memory.The virtual mode allows the multiprogramming of 8086 applications. The virtual 8086 mode executes all the programs at privilege level 3.Any of the other programmes may deny access to the virtual mode programs or data. However, the real mode programs are executed at the highest privilege level, i.e. level 0. The virtual mode may be entered using an IRET instruction at CPL=0 or a task switch at any CPL, executing any task whose TSS is having a flag image with VM flag set to 1. The IRET instruction may be used to set the VM flag and consequently enter the virtual mode. The PUSHF and POPF instructions are unable to read or set the VM bit, as they do not access it. Even in the virtual mode, all the interrupts and exceptions are handled by the protected mode interrupt handler.

To return to the protected mode from the virtual mode, any interrupt or execution may be used. As a part of interrupt service routine, the VM bit may be reset to zero to pull back the 80386 into protected mode. The main difference between 80386 protected mode and virtual 8086 mode is the way the segment registers are interpreted by microprocessor

12.Enhanced Instruction Set:The instruction set of 80386 contains all the instructions supported by 80286. The 80286 instructions are designed to operate with 8-bit or 16-bit data, while the same mnemonics for 80386 instructions set may be executed over 32-bit operands, besides 8-bit and 16-bitoperands. The newly added instructions may be categorized into the following functional groups.

1. Bit scan instructions BSF (bit scan forward), BSR (bit scan reverse).They scan operand for a 1bit without actually rotating it.BSF scan from right to left. If 1, zero flag is set and position stored in destination operand. If not, zero flag is reset.BSR scan form left to right. 2. Bit test instruction BT ( test a bit), BTC (test a bit and complement), BTR (test and reset a bit) and BTS ( test and set a bit).These instructions test a bit position in destination operand, specified by source operand. If bit position of destination operand specified by source operand satisfies the condition specified in mnemonic, carry flag is affected. If bit position is 1, carry flag is set. 3. Conditional set byte instructions Set all operand bits, if condition specified by mnemonics is true. SET (Set on Overflow), SETNO (set on no overflow), and so on. 4. Shift double instructions- Shift specified number of bits from source operand to destination operand.SHLD (shift left double), SHRD (shift right double).SHLD shifts specified number of bits from MSB of source to LSB of destination.SHRD from LSB to MSB. 5. Control transfer via gates instructions -.CALL and JUMP instructions in protected mode. Transfer control at same or different privilege level.

80486 80486 is a highly integrated device, containing well over 1.2 million transistors. It contains memory management unit (MMU),a complete numeric coprocessor and a high speed cache memory of 8KB size. The 80486 is available as an 80486DX and 80486SX.The only difference is that 80486SX does not contain the numeric coprocessor, which reduces its price. 80486DX is the first CPU with an on chip floating-point unit. For fast execution of complex instructions of xxx86 family, the 80486 has introduced five stage pipelines. Two out of the five stages are used for decoding the complex instructions of xxx86 architecture. This feature, which has been, used widely in RISC architectures results in very fast instruction execution. The 80486 is also the first amongst the xxx86 processor to have an on-chip cache. This 8Kbytes cache is a unified data and code cache and acts on physical addresses.

Architecture of 80486

The 32-bit pipelined architecture of Intels 80486 is shown in Figure. The internal architecture of 80486 can be broadly divided into three sections, namely bus interface unit, execution and control unit and floating point unit. The bus interface unit is mainly responsible for coordinating all the bus activities. The address driver interfaces the internal 32-bit address output of cache unit with the system bus. The data bus transreceivers interface the internal 32-bit data bus with the system bus. The 48X80 write data buffer is a queue of four 80-bit registers, which hold the 80-bit data to be written to the memory. The bus control and request sequences handles the signals like ADS#, W/R#, D/C#, M/IO#, PCD, PWT, RDY#, LOCK#, PLOCK#, BOFF#, A20M#, BREQ, HOLD, HLDA, RESET, INTR, NMI, FERR# and IGNNE# which basically control the bus access and operations. The burst control signal BRDY# informs the processor that the burst is read. The BLAST# output indicates to the external system that the previous burst cycle is over. The bus size control signals BS16# and BS8# are used for dynamic bus sizing. The cache control signals KEN#, FLUSH, AHOLD and EADS# control and maintain the in coordination with the cache control unit. The parity generation and control unit maintain the parity and carry out the related checks during the processor operation. The boundary scan control unit, that is built in 50MHZ and advanced versions only, subject the processor operation to boundary scan tests to ensure the correct operation of various components of the circuit on the mother board, provided the TCK input is not tied high.

The prefetcher unit fetches the codes from the memory ahead of execution time and arranges them in a 32-byte code queue. The instruction decoder gets the code from the code queue and then decodes it sequentially. The output of the decoder drives the control unit to derive the control signals required for the execution of the decoded instructions. But prior to execution, the protection unit checks, if there is any violation of protection norms. If any protection norm is violated, an appropriate exception is generated. The control ROM store a microprogram for deriving control signals for execution of different instructions. The registers bank and ALU are used for their conventional usages. The barrel shifter helps in implementing the shift and rotate algorithms. The segmentation unit, descriptor registers, paging unit,

translation look aside buffer and limit and attribute PLA work together to manage the virtual

memory of the system and provide adequate protection to the codes or data in the physical memory. The floating-point unit with its register bank communicates with the bus interface unit under the control of memory management unit, via its 64-bit internal data bus. The floatingpoint unit is responsible for carrying out mathematical data processing at a higher speed as compared to the ALU, with its built in floating-point algorithms.

Signal Descriptions of 80486

Timing Sigel CLK : This input provides the basic system timing for the operation of 80486. Address Bus : A31- A2 These are the address lines of the microprocessor, and are used for selecting memory I/O devices. However, for memory/IO addressing we also need another set of signals known as byte enable signals BE0 BE3. These active-low byte enable signals (BE0# BE3#) indicate which byte of the 32-bit data bus is active during the read or write cycle. Data Bus : D0-D31 This is bi-directional data bus with D0 as the least and D31 as the most significant data bit. Data Parity Group: The pins of this group of signals are extremely important, because they are used to detect the parity during the memory read and write operations. DP0-Dp3: These four data parity input/output pins are used for representing the individual parity of 4bytes (32bits) of the data bus.Data parity I/O provides even parity for a write operation and check parity for a read operation. BRDY:The burst ready input is used to signal microproceesor that a burst cycle is complete. BREQ:The bus request output indicates that 80486 has generated a internal bus request. BLAST:The burst last output shows that the burst bus cycle is complete on the next activation of the BRDY signal. M/IO#: This output pin differentiates between memory and I/O operations. D/C#: This output pin differentiates between data/control operations. W/R#: This output pin differentiates between read and write bus cycle. PLOCK#: This pseudo lock pin indicates that the current operation may require more than one bus cycle for its completion. The bus is to be locked until then. LOCK# : This output pin indicates that the current bus cycle is locked.

ADS# : The address status output pin indicates that a valid bus cycle definition and addresses are currently available on the corresponding pins. RDY# : This input pin acts as a ready signal for the current non-burst cycle. BRDY# & BLAST# : refer Architecture of 80486 RESET : This input pin reset the processor, if its goes high INTR : This is a maskable interrupt input that is controlled by the IF in the flag register NMI : This is a non-maskable interrupt input, of type 2. BREQ : This active high output indicates that the 80486 has generated a bus request. HOLD : This pin acts a a local bus hold input, to be activated by another bus master like DMA controller, to enable to gain the control of the system bus. HLDA : This is an output that acknowledges the receipt of a valid HOLD request. BOFF#: When a CPU requests the access of the bus, and if the bus is granted to it, then the current bus master which is currently in charge of the bus will be asked to back off or release the bus. AHOLD: The address holds request input pin enables other bus masters to use the 80486 system bus during a cache invalidation cycle. EADS#: The external address input signal indicates that a valid address for external bus cycle is available on the address bus. KEN# : The cache enable input pin is used to determine whether the current cycle is cacheable or not. FLUSH#: The cache flush input, if activated, clears the cache contents and validity bits. PCD, PWT : The page cache disables and page write-through output pins reflect the status of the corresponding bits in page table or page directory entry. FPU: Error Group FERR : The FERR output pin is activated if the floating point unit reports any error. IGNNE: If ignore numeric processor extension input pin is activated, the 80486 ignores the numeric processor errors and continues executing non-control floating-point instructions. BS8# and BS16# : The bus size-8 and bus size-16 inputs are used for the dynamic bus sizing feature of 80486. These two pins enable 80486 to be interfaced with 8-bit or 16-bit devices though the CPU has a bus width of 32-bits.

A20M3 : If this input pin is activated, the 80486 masks the physical address line A20 before carrying out any memory or cache cycle. Vcc : In all 24 pins are allocated for the power supply. Vss : These act as return lines for the power supply. In all 28 pins are allocated for the power supply return lines. N/C : No connection pins are expected to be left open while connecting the 80486 in the circuit.

Features of 80486:1.Parity checker/generator:Parity is often used to determine if data are correctly read from a memory location. To facilitate this, an internal parity generator is added. Parity is generated by 80486 during each write cycle. Parity is generated ad even parity, and a parity bit is provided for each byte of memory. The parity check bit appear on pins DP0-DP3,which are also parity inputs as well as outputs. On a read, the microprocessor checks parity and generates a parity check error, if it occurs, on the PCHK pin. A parity error causes no change in processing unless the user applies the PCHK signal to an interrupt input.

2. Cache Memory:-

The cache memory system caches data used by a program and also the instructions of the program. Control register 0(CR0) is used to control the cache with two new control bits: Cache disable (CD) and Noncache write-through (NW).If the CD bits 1,all cache operations are inhibited. The NW bit is used to inhibit cache write through operations.

MODULE 2:REDUCED INSTRUCTION SET COMPUTERS

1.Instruction execution characteristics:One of the most visible forms of evolution associated with computers is that of programming languages. As the cost of hardware has dropped, the relative cost of software has risen. High-level languages (HLLs) allow the programmer to express algorithms more concisely, take care of much of the detail, and often support naturally the use of structured programming or object-oriented design. This solution gave rise to another problem, known as the semantic gap, the difference between the operations provided in HLLs and those provided in computer architecture. Symptoms of this gap are alleged to include execution inefficiency, excessive machine program size, and compiler complexity. Key features include large instruction sets, dozens of addressing modes, and various HLL statements implemented in hardware. Such complex instruction sets are intended to Ease the task of the compiler writer. Improve execution efficiency, because complex sequences of operations can be implemented in microcode. Provide support for even more complex and sophisticated HLLs. The Instruction Characteristics are 1.operations performed:These determine the functions to be formed by the processor and its interaction with memory. 2.operands used:The type of operands and the frequency of their use determine the memory organization for storing them and the addressing modes for accessing them. 3.Execution sequencing:This determine the control and pipeline organization. a.Operations: Assignment statements predominate, suggesting that the simple movement of data is of high importance. There is also a preponderance of conditional statements (IF, LOOP). These statements are implemented in machine language with some sort of compare and branch instruction. b. Operands Further, more than 80% of the scalars were local (to the procedure) variables. In addition, references to arrays/structures require a previous reference to their index or pointer, which again is usually a local scalar. Thus, there is a preponderance

of references to scalars, and these are highly localized. A prime candidate of optimization is the mechanismfor storing and accessing local scalar variables. Procedure Calls procedure calls and returns are an important aspect of HLL programs. these are the most timeconsuming operations in compiled HLL programs.Thus, it will be profitable to consider ways of implementing these operations efficiently. Two aspects are significant: the number of parameters and variables that a procedure deals with, and the depth of nesting. Most programs do not do a lot of calls followed by lots of returns Most variables are local THE USE OF A LARGE REGISTER FILE
The reason that register storage is indicated is that it is the fastest available storage device, faster than both main memory and cache. The register file is physically small, on the same chip as the ALU and control unit, and employs much shorter addresses than addresses for cache and memory.Thus, a strategy is needed that will allow the most frequently accessed operands to be kept in registers and to minimize register-memory operations. Two basic approaches are possible, one based on software and the other on hardware. The software approach is to rely on the compiler to maximize register usage. The compiler will attempt to allocate registers to those variables that will be used the most in a given time period.This approach requires the use of sophisticated program-analysis algorithms.The hardware approach is simply to use more registers so that more variables can be held in registers for longer periods of time.

Register Windows
the use of a large set of registers should decrease the need to access memory Because most operand references are to local scalars, the obvious approach is to store these in registers, with perhaps a few registers reserved for global variables. The problem is that the definition of local changes with each procedure call and return, operations that occur frequently. On every call, local variables must be saved from the registers into memory, so that the registers can be reused by the called program. Furthermore, parameters must be passed. On return, the variables of the parent program must be restored (loaded back into registers) and results must be passed back to the parent program. The solution depends on the number

of parameters and variables that a procedure deals with, and the depth of nesting. To exploit these
properties, multiple small sets of registers are used, each assigned to a different procedure. A procedure call automatically switches the processor to use a different fixed-size window of registers, rather than saving registers in memory. Windows for adjacent procedures are overlapped to allow parameter passing. The concept is illustrated in Figure 13.1. At any time, only one window of registers is visible and is addressable as if it were the only set of registers (e.g., addresses 0 through N 1).The window is divided into three fixed-size areas. Parameter registers hold parameters passed down from the procedure that called the current procedure and hold results to be passed back up. Local registers are used for local variables, as assigned by the compiler.Temporary registers are used to exchange parameters and results with the next lower level (procedure called by current procedure).The temporary

registers at one level are physically the same as the parameter registers at the next lower level.This overlap permits parameters to be passed without the actual movement of data.

the register windows can be used to hold the few most recent procedure activations. Older activations must be saved in memory and later restored when the nesting depth decreases.Thus, the actual organization of the register file is as a circular buffer of overlapping windows.

The circular organization is shown in Figure 13.2, which depicts a circular buffer of six windows. The buffer is filled to a depth of 4 (A called B; B called C; C

called D) with procedure D active.The current-window pointer (CWP) points to the window of the currently active procedure. Register references by a machine instruction are offset by this pointer to determine the actual physical register. The savedwindow pointer (SWP) identifies the window most recently saved in memory. If procedure D now calls procedure E, arguments for E are placed in Ds temporary registers (the overlap between w3 and w4) and the CWP is advanced by one window. If procedure E then makes a call to procedure F, the call cannot be made with the current status of the buffer.This is because Fs window overlaps As window. If F begins to load its temporary registers, preparatory to a call, it will overwrite the parameter registers of A (A.in).Thus, when CWP is incremented (modulo 6) so that it becomes equal to SWP, an interrupt occurs, and As window is saved. Only the first two portions (A.in and A.loc) need be saved.Then, the SWP is incremented and the call to F proceeds. A similar interrupt can occur on returns.

Global Variables
The window scheme does not address the need to store global variables, those accessed by more than one procedure.Two options suggest themselves. First, variables declared as global in an HLL can be assigned memory locations by the compiler, and all machine instructions that reference these variables will use memory-reference operands. However, for frequently accessed global variables, this scheme is inefficient. An alternative is to incorporate a set of global registers in the processor.These registers would be fixed in number and available to all procedures. A unified numbering scheme can be used to simplify the instruction format.

Large Register File versus Cache


Large Register File 1. The window-based
register file holds all the local scalar variables (except in the rare case of window overflow) of the most recent N 1 procedure activations.

Cache 1. The cache holds a selection of


recently used scalar variables.

2. The register file should save time, because all


local scalar variables are retained.

2. the cache may make more efficient


use of space, because it is reacting to the situation dynamically

3. the register file contains only those variables in 3. Data are read into the cache in blocks of
use memory

4. Compiler-assigned global variables 5. With the register file, the movement of data
between registers and memory is determined by the procedure nesting depth.

4. Recently-used global variables 5. Save/Restore based on cache


replacement algorithm

6. Register addressing

6. Memory addressing

To reference a local scalar in a windowbased register file, a virtual register number and a window number are used.These can pass through a relatively simple decoder to select one of the physical registers. To reference a memory location in cache, a full-width memory address must be generated. The complexity of this operation depends on the addressing mode. In a set associative cache, a portion of the address is used to read a number of words and tags equal to the set size. Another portion of the address is compared with the tags, and one of the words that were read is selected

COMPILER-BASED REGISTER OPTIMIZATION

assume that only a small number (e.g., 1632) of registers is available on the target RISC machine. In this case, optimized register usage is the responsibility of the compiler. A program written in a high-level language has, of course, no explicit references to registers. The objective of the compiler is to keep the operands for as many computations as possible in registers rather than main memory, and to minimize load-and-store operations. Each program quantity that is a candidate for residing in a register is assigned to a symbolic or virtual register. The compiler then maps the unlimited number of symbolic registers into a fixed number of real registers. Symbolic registers whose usage does not overlap can share the same real register. If, in a particular portion of the program, there are more quantities to deal with than real registers, then some of the quantities are assigned to memory locations. Load-and-store instructions are used to position quantities in registers temporarily for computational operations. The essence of the optimization task is to decide which quantities are to be assigned to registers at any given point in the program.The technique most commonly used in RISC compilers is known as graph coloring, Given a graph consisting of nodes and edges, assign colors to nodes such that adjacent nodes have different colors, and do this in such a way as to minimize the number of different colors. First, the program is analyzed to build a register interference graph.The nodes of the graph are the symbolic registers. If two symbolic registers are live during the same program fragment, then they are joined by an edge to depict interference. An attempt is then made to color the graph with n colors, where n is the number of registers. Nodes that share the same color can be assigned to the same register. If this process does not fully succeed, then those nodes that cannot be colored must be placed in memory, and loads and stores must be used to make space for the affected quantities when they are needed. Figure 13.4 is a simple example of the process. Assume a program with six symbolic registers to be compiled into three actual registers. Figure 13.4a shows the time sequence of active use of each symbolic register. The dashed horizontal lines indicate successive instruction executions. Figure 13.4b shows the register interference graph (shading and cross-hatching are used instead of colors). A possible coloring with three colors is indicated. Because symbolic registers A and D do not interfere, the compile can assign both of these to physical register R1. Similarly, symbolic registers C and E can be assigned to register R3. One symbolic register, F, is left uncolored and must be dealt with using loads and stores.

REDUCED INSTRUCTION SET ARCHITECTURE


certain characteristics are common to all of them: One instruction per cycle Register-to-register operations Simple addressing modes Simple instruction formats

The first characteristic listed is that there is one machine instruction per machine cycle.A machine cycle is defined to be the time it takes to fetch two operands from registers, perform an ALU operation, and store the result in a register. A second characteristic is that most operations should be register to register, with only simple LOAD and STORE operations accessing memory.This design feature simplifies the instruction set and therefore the control unit. A third characteristic is the use of simple addressing modes. Almost all RISC instructions use simple register addressing. Several additional modes, such as displacement and PCrelative, may be included. Other, more complex modes can be synthesized in software from the simple ones. Again, this design feature simplifies the instruction set and the control unit. A final common characteristic is the use of simple instruction formats. Generally, only one or a few formats are used. Instruction length is fixed and aligned on word boundaries. Field locations, especially the opcode, are fixed. This design feature has a number of benefits.With fixed fields, opcode decoding and register operand accessing can occur simultaneously. Simplified formats simplify the control unit. Instruction fetching is optimized because word-length units are fetched. ADVANTAGES:_

First, more effective optimizing compilers can be developed.With more-primitive instructions, there are more opportunities for moving functions out of loops, reorganizing code for efficiency, maximizing register utilization, and so forth. A second point, already noted, is that most instructions generated by a compiler are relatively simple anyway. It would seem reasonable that a control unit built specifically for those instructions and using little or no microcode could execute them faster than a comparable CISC. A third point relates to the use of instruction pipelining. RISC researchers feel that the instruction pipelining technique can be applied much more effectively with a reduced instruction set RISC processors are more responsive to interrupts because interrupts are checked between rather elementary operations. Architectures with complex instructions either restrict interrupts to instruction boundaries or must define specific interruptible points and implement mechanisms for restarting an instruction. the following are considered typical of a classic RISC: 1. A single instruction size. 2. That size is typically 4 bytes. 3. A small number of data addressing modes, typically less than five.This parameter is difficult to pin down. In the table, register and literal modes are not counted and different formats with different offset sizes are counted separately. 4. No indirect addressing that requires you to make one memory access to get the address of another operand in memory. 5. No operations that combine load/store with arithmetic (e.g., add from memory, add to memory). 6. No more than one memory-addressed operand per instruction. 7. Does not support arbitrary alignment of data for load/store operations. 8. Maximum number of uses of the memory management unit (MMU) for a data address in an instruction. 9. Number of bits for integer register specifier equal to five or more. This means that at least 32 integer registers can be explicitly referenced at a time. 10. Number of bits for floating-point register specifier equal to four or more.This means that at least 16 floating-point registers can be explicitly referenced at a time.

RISC PIPELINING
Most instructions are register to register, and an instruction cycle has the following two stages: I: Instruction fetch. E: Execute. Performs an ALU operation with register input and output. For load and store operations, three stages are required: I: Instruction fetch. E: Execute. Calculates memory address D: Memory. Register-to-memory or memory-to-register operation.

Figure 13.6a depicts the timing of a sequence of instructions using no pipelining. Clearly, this is a wasteful process. Even very simple pipelining can substantially improve performance. Figure 13.6b shows a two-stage pipelining scheme, in which the I and E stages of two different instructions are performed simultaneously. The two stages of the pipeline are an instruction fetch stage, and an execute/memory stage that executes the instruction, including register-to-memory and memorytoregister operations. Thus we see that the instruction fetch stage of the second instruction can e performed in parallel with the first part of the execute/memory stage. However, the execute/memory stage of the second instruction must be delayed until the first instruction clears the second stage of the pipeline.This scheme can yield up to twice the execution rate of a serial scheme. Two problems prevent the maximum speedup from being achieved. First, we assume that a single-port memory is used and that only one memory access is possible per stage. This requires the insertion of a wait state in some instructions. Second, a branch instruction interrupts the sequential flow of execution.To accommodate this with minimum circuitry, a NOOP instruction can be inserted into the instruction stream by the compiler or assembler. Pipelining can be improved further by permitting two memory accesses per stage. This yields the sequence shown in Figure 13.6c. Now, up to three instructions can be overlapped, and the improvement is as much as a factor of 3. Again, branch instructions cause the speedup to fall short of the maximum possible.Also, note that data dependencies have an effect. If an instruction needs an operand that is altered by the preceding instruction, a delay is required. Again, this can be accomplished by a NOOP. The pipelining discussed so far works best if the three stages are of approximately equal duration. Because the E stage usually involves an ALU operation, it may be longer. In this case, we can divide into two substages: Register file read ALU operation and register write Because of the simplicity and regularity of a RISC instruction set, the design of the phasing into three or four stages is easily accomplished. Figure 13.6d shows

the result with a four-stage pipeline.

Optimization of Pipelining
DELAYED BRANCH To compensate for these dependencies, code reorganization techniques

have been developed. First, let us consider branching instructions. Delayed branch, a way of increasing the efficiency of the pipeline, makes use of a branch that does not take effect until after execution of the following instruction (hence the term delayed). The instruction location immediately following the branch is referred to as the delay slot.This strange procedure is illustrated in Table 13.8. In the column labeled normal branch, we see a normal symbolic instruction machine-language program. After 102 is executed, the next instruction to be executed is 105. To regularize the pipeline, a NOOP is inserted after this branch. However, increased performance is achieved if the instructions at 101 and 102 are interchanged. Figure 13.7 shows the result. Figure 13.7a shows the traditional approach to pipelining, of the type discussed in Chapter 12 (e.g., see Figures 12.11 and 12.12).

The JUMP instruction is fetched at time 3. At time 4, the JUMP instruction is executed at the same time that instruction 103 (ADD instruction) is fetched. Because a JUMP occurs, which updates the program counter, the pipeline must be cleared of instruction 103; at time 5, instruction 105, which is the target of the JUMP, is loaded. Figure 13.7b shows the same pipeline handled by a typical RISC organization. The timing is the same. However, because of the insertion of the NOOP instruction, we do not need special circuitry to clear the pipeline; the NOOP simply executes with no effect. Figure 13.7c shows the use of the delayed branch. The JUMP instruction is fetched at time 2, before the ADD instruction, which is fetched at time 3. Note, however, that the ADD instruction is fetched before the execution of the JUMP instruction has a chance to alter the program counter.Therefore, during time 4, the ADD instruction is executed at the same time that instruction 105 is fetched.Thus, the original semantics of the program are retained but one less clock cycle is required for execution.

This interchange of instructions will work successfully for unconditional branches, calls, and returns. For conditional branches, this procedure cannot be blindly applied. If the condition that is tested for the branch can be altered by the immediately preceding instruction, then the compiler must refrain from doing the interchange and instead insert a NOOP.

MIPS R4000
It has substantially the same architecture and instruction set of the earlier MIPS designs: the R2000 and R3000. The most significant difference is that the R4000 uses 64 rather than 32 bits for all internal and external data paths and for addresses, registers, and the ALU. The use of 64 bits has a number of advantages over a 32-bit architecture. It allows a bigger address spacelarge enough for an operating system to map more than a terabyte of files directly into virtual memory for easy access.With 1-terabyte and larger disk drives now common, the 4-gigabyte address space of a 32-bit machine becomes limiting. Also, the 64-bit capacity allows the R4000 to process data such as IEEE doubleprecision floating-point numbers and character strings, up to eight characters in a single action. The R4000 processor chip is partitioned into two sections, one containing the CPU and the other containing a coprocessor for memory management.The processor has a very simple architecture. The processor supports thirty-two 64-bit registers. It also provides for up to

128 Kbytes of high-speed cache, half each for instructions and data. The relatively large cache (the IBM 3090 provides 128 to 256 Kbytes of cache) enables the system to keep large sets of program code and data local to the processor.

Instruction Set

All processor instructions are encoded in a single 32-bit word format. All data operations are register to register; the only memory references are pure load/store operations. The R4000 makes no use of condition codes. If an instruction generates a condition, the corresponding flags are stored in a general-purpose register. This avoids the need for special logic to deal with condition codes as they affect the pipelining mechanism and the reordering of instructions by the compiler. As with most RISC-based machines, the MIPS uses a single 32-bit instruction length. This single instruction length simplifies instruction fetch and decode, and it also simplifies the interaction of instruction fetch with the virtual memory management unit (i.e., instructions do not cross word or page boundaries). The three instruction formats (Figure 13.9) share common formatting of opcodes and register references, simplifying instruction decode. The effect of more complex instructions can be synthesized at compile time.

Instruction Pipeline
To improve on this performance, two classes of processors have evolved to offer execution of multiple instructions per clock cycle: superscalar and superpipelined architectures. In essence, a superscalar architecture replicates each of the pipeline stages so that two or more instructions at the same stage of the pipeline can be processed simultaneously. A superpipelined architecture is one that makes use of more, and more fine-grained, pipeline stages.With more stages, more instructions can be in the pipeline at the same time, increasing parallelism. Both approaches have limitations.With superscalar pipelining, dependencies between instructions in different pipelines can slow down the system. Also, overhead logic is required to coordinate these dependencies.With superpipelining, there is overhead associated with transferring instructions from one stage to the next. The MIPS R4000 is a good example of a RISC-based superpipeline architecture.

The R4000 has eight pipeline stages, meaning that as many as eight instructions can be in the pipeline at the same time. The pipeline advances at the rate of two stages per clock cycle.The eight pipeline stages are as follows: Instruction fetch first half: Virtual address is presented to the instruction cache and the translation lookaside buffer. Instruction fetch second half: Instruction cache outputs the instruction and the TLB generates the physical address. Register file: Three activities occur in parallel: Instruction is decoded and check made for interlock conditions (i.e., this instruction depends on the result of a preceding instruction). Instruction cache tag check is made. Operands are fetched from the register file. Instruction execute: One of three activities can occur: If the instruction is a register-to-register operation, the ALU performs the arithmetic or logical operation. If the instruction is a load or store, the data virtual address is calculated. If the instruction is a branch, the branch target virtual address is calculated and branch conditions are checked. Data cache first: Virtual address is presented to the data cache and TLB. Data cache second: The TLB generates the physical address, and the data cache outputs the instruction. Tag check: Cache tag checks are performed for loads and stores. Write back: Instruction result is written back to register file.