Professional Documents
Culture Documents
Architecture:
Pentium processor is a complex machine with many interlocking parts. At the heart of the
processors are the two integer pipelines, the U pipeline and the V pipeline. These pipelines are
responsible for executing 80x86 instructions. A floating-point unit is included on the chip to execute
instructions previously handled by the external 80x87 math coprocessors. During execution, the U and V
pipelines are capable of executing two integer instructions at the same time, under special conditions, or
one floating-point instruction.
The Pentium communicates with the outside world via a 32-bit address bus and a 64-bit data
bus. The bus unit is capable of performing burst reads and writes of 32 bytes to memory, and through
bus cycle pipelining, allows two bus cycles to be in progress simultaneously.
An 8KB instruction cache is used to provide quick access to frequently used instructions.
When an instruction is not found in the instruction cache, it is read from the external data bus and a
copy placed into the instruction cache for future references. The branch target buffer and prefetch
buffers work together with the instruction cache to fetch instructions as fast as possible. The prefetch
buffers maintain a copy of the next 32 bytes of prefetched instruction code, and can be loaded from the
cache in a single clock cycle, due to the 256-bit wide data output of the instruction cache.
A separate 8KB data cache stores a copy of the most frequently accessed memory data. Since
memory accesses are significantly longer than processor clock cycles, it pays to keep a copy of memory
data in a fast-reading cache. The data and instruction caches may both be enabled/disabled with
hardware or software. Both also employ the use of a translation look aside buffer, which converts logical
addresses into physical addresses when virtual memory is employed.
The Pentium uses a technique called branch prediction to maintain a steady flow of
instructions into the pipelines. To support branch prediction, the branch target buffer maintains a copy
of instructions in a different part of the program located at an address called the branch target.
The floating point unit of the Pentium maintains a set of floating point registers and provides
80- bit precision when performing high-speed math operations. The floating-point unit uses hardware in
the U and V pipelines to perform the initial work during a floating point instruction (such as fetching a
64- bit operand). And then uses its own pipeline to complete the operation. Since both integer pipelines
are used, only one floating point instruction may be executed at a time.
2
3
Superscalar Architecture and Pipelining
In Pentium processor, the integer instructions traverse a five-stage pipeline. The pipeline
stages are as follows:
PF – Prefetch
D1 – Instruction Decode
D2 – Address Generate
EX – Execute – ALU and Cache Access
WB – Write-Back
Pentium processor is a superscalar machine, capable of executing two instructions in parallel.
The five stage pipelines operate in parallel allowing integer instructions to execute in a single
clock in each pipeline. The pipelines in Pentium processor are called U and V pipes and the
process of issuing two instructions in parallel is termed as 2 Issue superscalar. There are two
execution units in Pentium and the instruction pairing allows each unit to complete the
execution of an instruction at the same time.
The Figure 1.2 depicts how ten instructions move through the pipeline of Pentium processor.
The five clock cycles are used to perform five pipeline stages. In the clock cycle 1, the
prefetch (PF) action is implemented. A pair of instructions is prefetched from the on- chip code
cache during clock 1. This first pair is issued in parallel to the U and V pipelines for decoding
purpose (D1 stage), while another pair is being prefetched (PF stage) during the clock 2 cycle.
4
In clock 3 cycle, the first instruction pair moves to decode 2 (D2) stage, while the second pair is
now issued to the decode 1 (D1) stage of both the pipelines and the third pair of instructions is
being fetched (PF stage). In this way, each pair of instructions can proceed to the next stage in
the pipeline with each cycle of the processor clock (PCLK). During clock cycle 5, the first
instruction pair completes its execution. If we observe the column of CLK5, the first pair is in the
last stage (WB) of the pipeline whereas the second pair is implementing the 4th stage (EX) and
the third instruction pair is at the 3rd stage (D2) of the pipeline and so on. Thus, ten different
instructions are present at the various pipeline stages during a single clock cycle. After the clock
cycle 5, each succeeding clock cycle shows the completion of another instruction pair.
There are two prefetch buffer/queues present in Pentium and at a time, one of them is active.
The active queue fetches the instruction codes from the on-chip cache or memory until the
branch prediction logic predicts that a branch will be taken when the branch instruction
reaches the execution stage. During the normal pipeline operation, this active queue
supplies two consecutive instructions to U and V pipelines.
2. Decode 1 (D1):
Stage Two pipelines filled with instructions are decoded in D1 stage. The instructions are first
checked for the pairability beside branch prediction.
▪ Instruction Pairing:
The two instructions are pairable only if they satisfy the following conditions
a) Both instructions in the pair must be simple. The instructions, which are completely
hardwired, are called Simple Instructions. They do not require any microcode control
and execute in 1,2 or at the most 3 clock cycle.
b) No register dependencies/contention between them.⎫
If the two instructions are not pairable, I2 instruction in the V pipeline’s D1 stage is deleted and
shifted to the D1 stage of the U pipeline when I1 is moved to the D2 stage of U pipeline.
▪ Branch Prediction:
The Pentium processor includes branch prediction logic, allowing it to avoid pipeline stalls
if it correctly predicts whether or not the branch will be taken when the branch
instruction is executed. When a branch operation is correctly predicted, no performance
penalty is incurred. However, when branch prediction is not correct, a three cycle
penalty is incurred if the branch is executed in the U pipeline and a four cycle penalty if
the branch is in the V pipeline.
5
3. Decode 2 or D2 Stage:
The D1 stage is followed by D2 stage in which the instructions are further decoded and the
addresses of memory resident operands are calculated. It performs segmentation
addressing. The address calculation at this stage is much faster. Pentium requires a single
clock cycle to calculate the address for the instructions containing a base and
index-addressing mode with displacement and an immediate addressing mode. During
the D2 stage, the processor also performs the segmentation protection checks required
when the processor forming memory addresses in protected mode.
4. Execution or EX-Stage:
The execution stage is comprised of the arithmetic logic unit, or ALU. The U pipeline’s ALU
incorporates a barrel shifter, while the V pipeline’s does not. It is obvious, then, that the
U pipeline can handle instructions that cannot be handled in the V pipeline. When
necessary, data cache accesses (on a cache hit) or memory accesses (on a cache miss)
are performed in this stage. Access to the data cache can be made by the U pipeline and
V pipeline simultaneously. Both instructions enter the execution stage at the same time.
If the instruction in the V pipeline stalls, the U pipeline instruction is permitted to
proceed to the write-back stage (i.e. the last stage in integer pipeline). However, if the U
pipeline instruction stalls, the V pipeline instruction will not proceed to the write-back
stage.
6
5. Write-Back or WB Stage:
This is the final stage of integer instruction execution. In WB stage, the processor state is
modified by updating target registers and EFLAGS register (if necessary).
Most floating-point instructions are issued singly to the U pipeline and cannot be paired
with integer instructions. It consists of eight pipeline stages. The first four stages are shared with
integer pipeline and the last four reside within the floating-point unit itself.
i. FP instructions are normally issued to the U pipeline singly as they do not get paired
with integer instructions. However, a limited pairing of two FP instructions can be
performed.
ii. Pairing can occur only if the first instruction issued to the U pipeline is a simple set F
instruction and the second instruction is the floating point exchange, FXCH
instruction. The F set or simple instructions are FLD single/ double precision, FLDST
(i) and all forms of FADD, FSUB, FMUL, FDIV, FCOM, FUCOM, FABS, and FCHS.
7
The 8 pipeline stages are:
Inside the FPU, all the resources are allocated to one or more instructions at one time. This
permits pipeline execution within the FPU. This is explained with the help of three examples:
i. FDIV instruction cannot be executed with any other instruction, since FDIV requires
all of the FPU resources.
ii. Similarly, two consecutive FMUL instructions cannot be executed simultaneously,
iii. FMUL instruction can be executed in parallel with one or two FADD instructions.
iv. Three FADD instructions can be executed simultaneously.
The Intel x86 architectures register set is subdivided into the following groups:
8
IV. Segment registers (6x16 bit)
Six 16-bit segment registers CS, SS, DS, ES, FS, and GS hold segment selector values
identifying the currently addressable memory segments. The selector in CS indicates
the current code segment, the selector in SS indicates the current stack segment, and
the selectors in DS, ES, FS and GS indicate the current four data segments.
2. System registers
9
3. Floating-point registers
The on-chip FPU includes eight 80-bit data registers R0 to R7, a 16, bit tag word, a 16-bit control
registers, a 16-bit status register, a 48-bit instruction pointer, and a 48-bit data pointer.
I. Data registers
II. Tag word
III. Status word
IV. instruction and data pointers
4. Debug registers
The base architecture and floating-point registers are accessible by applications programs. The
system and debug registers are accessible only by system programs (such as OS), running on the
highest privilege level.
It is a 32-bit register called EFLAGS. The specified bits and bit fields of EFLAGS control a number
of operations and indicate the status of the processor. The lower 16 bits of EFLAGS, called
FLAGS, are used when executing 8086 or 80286 code.
10
Memory Management:
Segmented memory is utilized by protected mode to allow tasks to have their own separate
memory spaces, which are protected from access by other tasks. A segment can be from 1
byte to 4 GB long. Segments can start at any base address in memory, and storage
overlapping between segments is allowed.
Each segment has a segment descriptor associated with it the segment descriptor is 8 bytes
long and contains the following information about the segment:
In protected mode, the Intel Architecture provides a protection mechanism that operates
at both the segment level and the page level. This protection mechanism provides the
ability to limit access to certain segments or pages based on privilege levels (four
privilege levels for segments and two privilege levels for pages). For example, critical
operating-system code and data can be protected by placing them in more privileged
segments than those that contain applications code. The processor’s protection
mechanism will then prevent application code from accessing the operating-system code
and data in any but a controlled, defined manner.
Segment and page protection can be used at all stages of software development to assist
in localizing and detecting design problems and bugs. It can also be incorporated into
end-products to offer added robustness to operating systems, utilities software, and
applications software. When the protection mechanism is used, each memory reference
is checked to verify that it satisfies various protection checks. All checks are made before
the memory cycle is started; any violation results in an exception. Because checks are
performed in parallel with address translation, there is no performance penalty.
12
The protection checks that are performed fall into the following categories:
• Limit checks.
• Type checks.
• Privilege level checks.
• Restriction of addressable domain.
• Restriction of procedure entry-points
• Restriction of instruction set
1. The IOPL field in the EFLAGS register defines the right to use I/O-related instructions.
2. The I/O permission bit map of a TSS segment defines the right to use ports in the I/O
address space.
These mechanisms operate only in protected mode, including virtual 8086 mode; they do not
operate in real mode. In real mode, there is no protection of the I/O space; any procedure can
execute I/O instructions, and any I/O port can be addressed by the I/O instructions.
13
Task Management in x86:
The x86 architecture was particularly designed for efficient handling of tasks in a
multitasking environment. A task can be defined as an instance of the execution of a program. A
very important attribute of any multitasking, multi-user OS is the ability to switch rapidly
between tasks. The x86 supports the task switching operation in hardware. The task switch
operation saves the entire state of the machine (all the registers, the address space, and a link
to the previous task), loads a new execution state, performs protection checks, and begins
execution of the new task.
The task switch operation is invoked by executing an inter segment JMP or CALL instruction,
which refers to a task state segment (TSS) or a task gate descriptor in the GDT or LDT.
Conclusion:
14