AMD-K6™

MMX™ Enhanced Processor

x86 Code Optimization
Application Note
Publication # 21828 Rev: A Issue Date: August 1997 Amendment/0 This document contains information on a product under development at Advanced Micro Devices (AMD). The information is intended to help you evaluate this product. AMD reserves the right to change or discontinue work on this proposed product without notice.

© 1997 Advanced Micro Devices, Inc. All rights reserved. Advanced Micro Devices, Inc. ("AMD") reserves the right to make changes in its products without notice in order to improve design or performance characteristics. The information in this publication is believed to be accurate at the time of publication, but AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication or the information contained herein, and reserves the right to make changes at any time, without notice. AMD disclaims responsibility for any consequences resulting from the use of the information included in this publication. This publication neither states nor implies any representations or warranties of any kind, including but not limited to, any implied warranty of merchantability or fitness for a particular purpose. AMD products are not authorized for use as critical components in life support devices or systems without AMD’s written approval. AMD assumes no liability whatsoever for claims associated with the sale or use (including the use of engineering samples) of AMD products except as provided in AMD’s Terms and Conditions of Sale for such product.

Trademarks AMD, the AMD logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. RISC86 is a registered trademark, and K86, AMD-K5, AMD-K6, and the AMD-K6 logo are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks, and Windows NT is a trademark of Microsoft Corporation. Pentium is a registered trademark and MMX is a trademark of the Intel Corporation. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Contents
1 Introduction 1
Purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The AMD-K6™ Family of Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The AMD-K6™ MMX™ Enhanced Processor. . . . . . . . . . . . . . . . . . . . 2

2

The AMD-K6™ Processor RISC86® Microarchitecture

3

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 RISC86® Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3

AMD-K6™ Processor Execution Units and Dependency Latencies

9

Execution Unit Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Six-Stage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Integer and Multimedia Execution Units . . . . . . . . . . . . . . . . . . . . . . 11 Load Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Store Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Branch Condition Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Floating Point Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Latencies and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Resource Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 5

Instruction Dispatch and Execution Timing x86 Optimization Coding Guidelines

23 51

General x86 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . 51 General AMD-K6™ Processor x86 Coding Optimizations. . . . . . . . . 53 AMD-K6™ Processor Integer x86 Coding Optimizations . . . . . . . . . 57 AMD-K6™ Processor Multimedia Coding Optimizations . . . . . . . . . 61 AMD-K6™ Processor Floating-Point Coding Optimizations . . . . . . . 62

6

Considerations for Other Processors

67
iii

Contents

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

iv

Contents

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

List of Tables
Table 1. Table 2. Table 3. Table 4. Table 5. Table 6. Table 7. Table 8. Table 9. Table 10. Table 11. Table 12. Table 13. Table 14. RISC86® Execution Latencies and Throughput . . . . . . 16 Sample 1 – Integer Register Operations . . . . . . . . . . . . 19 Sample 2 – Integer Register and Memory Load Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Sample 3 – Integer Register and Memory Load/Store Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Sample 4 – Integer, MMX™, and Memory Load/Store Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . 46 Decode Accumulation and Serialization . . . . . . . . . . . . 54 Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors . . . . . . . . . . . . . . . . . . . . . . . 67 AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines . . . . 69 AMD-K6™ Processor and Pentium® Processor with Optimizations for MMX™ Instructions . . . . . . . . . . . . . 71 AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations. . . . . . . . . . . . . . . . . . 71 AMD-K6™ Processor and Pentium® Pro with Optimizations for MMX™ Instructions . . . . . . . . . . . . . 73

List of Tables

v

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

vi

List of Tables

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

List of Figures
Figure 1. AMD-K6™ MMX™ Enhanced Processor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2. AMD-K6™ Processor Pipeline . . . . . . . . . . . . . . . . . . . . . 11 Figure 3. Integer/Multimedia Execution Unit . . . . . . . . . . . . . . . . . 12 Figure 4. Load Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 5. Store Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

List of Figures

vii

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

viii

List of Figures

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Revision History
Date August 1997 Rev A Initial Release Description

Revision History

ix

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

x

Revision History

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

1
Introduction
Purpose
The AMD-K6™ MMX™ enhanced processor is the newest microprocessor in the AMD K86™ family of microprocessors. The AMD-K6 processor can efficiently execute code written for previous-generation x86 processors. However, there are many ways to get higher performance from the AMD-K6 processor. This document contains information to assist programmers in creating optimized code for the AMD-K6 processor. This document is targeted at compiler/assembler designers and assembly language programmers writing high-performance code sequences. It is assumed that the reader possesses an in-depth knowledge of the x86 microarchitecture.

The AMD-K6™ Family of Processors
Processors in the AMD-K6™ family use a decoupled instruction decode and execution superscalar microarchitecture, including s t a t e -o f -t h e -a r t R I S C d e s i g n t e ch n i q u e s , t o d e l ive r sixth-generation performance with full x86 binary software Chapter 1 Introduction 1

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

compatibility. An x86 binary-compatible processor implements the industry-standard x86 instruction set by decoding and executing the x86 instruction set as its native mode of operation. Only this native mode permits delivery of maximum performance when running PC software.

The AMD-K6™ MMX™ Enhanced Processor
The AMD-K6 MMX enhanced processor, the first in the AMD-K6 family, brings superscalar RISC performance to desktop systems running industry-standard x86 software. This processor implements advanced design techniques such as instruction pre-decoding, multiple x86 opcode decoding, single-cycle internal RISC operations, multiple parallel execution units, out-of-order execution, data-forwarding, register renaming, and dynamic branch prediction. In other words, the AMD-K6 processor is capable of issuing, executing, and retiring multiple x86 instructions per cycle, resulting in superior scaleable performance. Although the AMD-K6 processor is capable of extracting code parallelism out of off-the-shelf, commercially available x86 software, specific code optimizations for the AMD-K6 processor can result in even higher delivered performance. This document describes the RISC86 ® microarchitecture in the AMD-K6 processor and makes recommendations for optimizing execution of x86 software on the processor. The coding techniques for achieving peak performance on the AMD-K6 processor include, but are not limited to, those recommended for the Pentium® and Pentium Pro processors. However, many of these optimizations are not necessary for the AMD-K6 processor to achieve maximum performance. Due to the more flexible pipeline control of the AMD-K6 microarchitecture, the AMD-K6 processor is not as sensitive to instruction selection and the scheduling of code. This flexibility is one of the distinct advantages of the AMD-K6 processor microarchitecture.

2

Introduction

Chapter 1

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

2
The AMD-K6™ Processor RISC86® Microarchitecture
Overview
Whe n d i sc u s s in g p ro c e s so r d e si g n , it is i m p o r t a n t t o understand the terms architecture, microarchitecture, and design implementation. The term architecture refers to the instruction set and features of a processor that are visible to software p rog ra ms r unni ng on the proces s o r. The archi te c t u re determines w hat softwa re the processor can run. The a rch i t e c t u re o f t h e A M D -K 6 M M X p r o c e s s o r i s t h e industry-standard x86 instruction set. The term microarchitecture refers to the design techniques used in the processor to reach the target cost, performance, and functionality goals. The AMD-K6 processor is based on a sophisticated RISC core known as the enhanced RISC86 microarchitecture. The enhanced RISC86 microarchitecture is an advanced, second-order decoupled decode/execution design approach that enables industry-leading performance for x86-based software. The term design implementation refers to the actual logic and circuit designs from which the processor is created according to the microarchitecture specifications.

Chapter 2

The AMD-K6™ Processor RISC86® Microarchitecture

3

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

RISC86® Microarchitecture
The e nh a n c e d R I S C8 6 m i c ro a rchi t e c t u re d e f i n e s t he characteristics of the AMD-K6 MMX enhanced processor. The innovative RISC86 microarchitecture approach implements the x86 instruction set by internally translating x86 instructions into RISC86 operations. These RISC86 operations were specially designed to include direct support for the x86 instruction set while observing the RISC performance principles of fixed-length encoding, regularized instruction fields, and a large register set. The enhanced RISC86 microarchitecture used in the AMD-K6 enables higher processor core performance and promotes straightforward extendibility in future designs. Instead of executing complex x86 instructions, which have lengths of 1 to 15 bytes, the AMD-K6 processor executes the simpler fixed-length RISC86 opcodes, while maintaining the instruction coding efficiencies found in x86 programs. Th e AM D -K6 proces s or i nc lud es para l l el decode rs , a centralized scheduler, and seven execution units that support superscalar operation— multiple decode, execution, and retirement— of x86 instructions. These elements are packed into an aggressive and very efficient six-stage pipeline. Decoding of the x86 instructions into RISC86 operations begins when the on-chip level-one instruction cache is filled. Predecode logic determines the length of an x86 instruction on a byte-by-byte basis. This predecode information is stored, alongside the x86 instructions, in a dedicated level-one predecode cache to be used later by the decoders. Up to two x86 instructions are decoded per clock on-the-fly, resulting in a maximum of four RISC86 operations per clock with no additional latency. The AMD-K6 processor categorizes x86 instructions into three types of decodes — short, long, and vector. The decoders process either two short, one long, or one vectored decode at a t i m e . Th e t h re e t y p e s o f d e c o d e s h ave t h e fo l l ow i n g characteristics:
s

Short decode—common x86 instructions less than or equal to 7 bytes in length that produce one or two RISC86 operations. Chapter 2

4

The AMD-K6™ Processor RISC86 ® Microarchitecture

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

s

s

Long decode—more complex and somewhat common x86 instructions less than or equal to 11 bytes in length that produce up to four RISC86 operations. Vectored decode—complex x86 instructions requiring long sequences of RISC86 operations.

Short and long decodes are processed completely within the decoders. Vectored decodes are started by the decoders with the generation of an initial set of four RISC86 operations, and then completed by fetching a sequence of additional operations from an on-chip ROM (at a rate of four operations per clock). RISC86 operations, whether produced by decoders or fetched from ROM, are then sent to a buffer in the centralized scheduler for dispatch to the execution units. The internal RISC86 instruction set consists of the following six categories or types of operations (the execution unit that handles each type of operation is displayed in parenthesis):
s s s s s s

Memory load operations (load) Memory store operations (store) Integer register operations (alu/alux) MMX register operations (meu) Floating-point register operations (float) Branch condition evaluations (branch)

The following example shows a series of x86 instructions and the corresponding decoded RISC86 operations.
x86 Instructions MOV ADD CMP JZ CX, [SP+4] AX,BX CX,[AX] foo RISC86 Operations Load Alu (Add) Load Alu (Sub) Branch

The MOV instruction converts to a RISC86 load that requires indirect data to be loaded from memory. The ADD instruction converts to an alu function that can be sent to either of the integer units. The CMP instruction converts into two RISC86 instructions. The first RISC86 load operation requires indirect data to be loaded from memory. That value is then compared (alu function) with CX.

Chapter 2

The AMD-K6™ Processor RISC86® Microarchitecture

5

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Once the RISC86 operations are in the centralized scheduler buffer they are ready for the scheduler to issue them to the appropriate execution unit. The AMD-K6 processor contains seven execution units — Integer X, Integer Y, Multimedia, Load, Store, Branch, and Floating-Point. Figure 1 shows a block diagram of these units.
Predecode Logic Level-One Instruction Cache (32KByte + Predecode) 16-Byte Fetch Level-One Cache Controller Dual Instruction Decoders x86 to RISC86 Four RISC86 Decode Scheduler Buffer (24 RISC86) Instruction Control Unit 64-Entry ITLB Branch Logic (8192-Entry BHT) (16-Entry BTC) (16-Entry RAS)

Socket 7 Bus Interface

Out-of-Order Execution Engine Six RISC86 Operation Issue

Load Unit

Store Unit

Integer X (Register) Unit

Multimedia Unit

Integer Y (Register) Unit

Floating-Point Unit

Branch (Resolving) Unit

Store Queue

Level-One Dual-Port Data Cache (32KByte)

128-Entry DTLB

Figure 1. AMD-K6™ MMX™ Enhanced Processor Block Diagram The centralized scheduler buffer, in conjunction with the instruction control unit (ICU/scheduler), buffers and manages up to 24 RISC86 operations at a time (which equates with up to 12 x86 instructions). This buffer size (24) is well matched to the processor’s six-stage RISC86 pipeline and seven parallel execution units. On every clock, the centralized scheduler buffer can accept up to four RISC86 operations from the decoders and issue up to six RISC86 operations to corresponding execution units. (Six RISC86 operations can be issued at a time because the alux and multimedia execution units share the same pipeline.) When managing the 24 RISC86 operations, the scheduler uses 4 8 p hy s i c a l re g i s t e r s c o n t a i n e d w i t h i n t h e R I S C 8 6 6 The AMD-K6™ Processor RISC86® Microarchitecture Chapter 2

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

microarchitecture. The 48 physical registers are located in a general register file and are grouped as 24 committed or architectural registers plus 24 rename registers. The 24 architectural registers consist of 16 scratch registers and eight re g i st e rs t h a t c o r re s p o n d t o t h e x 8 6 g e n e ra l-pu rp o se registers— EAX, EBX, ECX, EDX, EBP, ESP, ESI, and EDI. The AMD-K6 processor offers sophisticated dynamic branch logic that includes the following elements:
s s s

Branch history/prediction table Branch target cache Return address stack

These components serve to minimize or eliminate the delays due to the branch instructions (jumps, calls, returns) common in x86 software. The AMD-K6 processor implements a two-level branch prediction scheme based on an 8192-entry branch history table. The branch history table stores prediction information that is used for predicting the direction of conditional branches. The target addresses of conditional and unconditional branches are not predicted, but instead are calculated on-the-fly during instruction decode by special branch target address ALUs. The branch target cache augments performance of taken branches by avoiding a one-cycle cache-fetch penalty. This specialized target cache does this by supplying the first 16 bytes of target instructions to the decoders when a branch is taken. The return address stack serves to optimize CALL/RET instruction pairs by remembering the return address of each CALL within a nested series of subroutines and supplying it as the predicted target address of the corresponding RET instruction. As shown in Figure 1 on page 6, the high-performance, out-of-order execution engine is mated to a split 64-Kbyte (Harvard architecture) writeback level-one cache with 32 Kbytes of instruction cache and 32 Kbytes of data cache. The level-one instruction cache feeds the decoders and, in turn, the decoders feed the scheduler. The ICU controls the issue and retirement of RISC86 operations contained in the centralized scheduler buffer. The level-one data cache satisfies most memory reads and writes by the load and store execution units. Chapter 2 The AMD-K6™ Processor RISC86® Microarchitecture 7

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

The store queue temporarily buffers memory writes from the store unit until they can safely be committed into the cache (that is, when all preceding operations have been found to be free of faults and branch mispredictions). The system bus i n t e r f a c e i s a n i n d u s t ry -s t a n d a rd 6 4 -b i t Pe n t i u m processor-compatible demultiplexed address/data system bus. Th e A M D -K 6 p ro c e s s o r u s e s t h e l a t e s t i n p ro c e s s o r microarchitecture techniques to provide the highest x86 performance for today’s PC. In short, the AMD-K6 processor offers true sixth-generation performance and full x86 binary software compatibility.

8

The AMD-K6™ Processor RISC86® Microarchitecture

Chapter 2

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

3
AMD-K6™ Processor Execution Units and Dependency Latencies
The AMD-K6 MMX enhanced processor contains seven specialized execution units—store, load, integer X, integer Y, multimedia, floating-point, and branch condition. Each unit operates independently and handles a specific group of the RISC86 instruction set. This chapter describes the operation of these units, their execution latencies, and how concurrent dependency chains affect those latencies. A dependency occurs when data needed in one execution unit is being processed in another unit (or the same unit). Additional latencies can occur because the dependent execution unit must wait for the data. Table 1 on page 16 provides a summary of the execution units, the operations performed within these units, the operation latency, and the operation throughput.

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

9

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Execution Unit Terminology
The execution units operate on two different types of register values — operands and results. There are three types of operands and two types of results. Operands The three types of operands are as follows:
s

Address register operands—used for address calculations of load and store operations Data register operands—used for register operations Store data register operands—used for memory stores

s s

Results

The two types of results are as follows:
s s

Data register results—from load or register operations Address register results—from Lea or Push operations

The following examples illustrate the operand and result definitions:
Add AX, BX

The Add operation has two data register operands (AX, and BX) and one data register result (AX).
Load BX, [SP+4·CX+8]

The Load operation has two address register operands (SP and CX as base and index registers, respectively) and a data register result (BX).
Store [SP+4·CX+8], AX

The Store operation has a store data register operand (AX) and two address register operands (SP and CX as base and index registers, respectively).
Lea SI, [SP+4·CX+8]

The Lea operation (a type of store operation) has address register operands (SP and CX as base and index registers, respectively), and an address register result.

10

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Six-Stage Pipeline
To help visualize the operations within the AMD-K6 processor, Figure 2 illustrates the six-stage pipeline design. This is a simplified illustration in that the AMD-K6 contains multiple parallel pipelines (starting after common instruction fetch and x86 decode pipe stages), and these pipelines often execute operations out-of-order with respect to each other. This view of the AMD-K6 execution pipeline illustrates the effect of execution latencies for various types of operations. For register operations that only require one execution cycle, this pipeline is effectively shorter due to the absence of execution stage 2. The sam ple s st arting on page 19 assume t ha t the x86 instructions have already been fetched, decoded, and placed in the centralized scheduler buffer. The RISC86 operations are waiting to be dispatched to the appropriate execution units.

Instruction x86—>RISC86 Fetch Decode

RISC86 Issue

Execution Stage 1

Execution Stage 2

Retire

Figure 2. AMD-K6™ Processor Pipeline

Integer and Multimedia Execution Units
The integer X execution unit can execute all ALU operations, multiplies and divides (signed and unsigned), shifts, and rotates. Data register results are available after one clock of execution latency. The multimedia execution unit (meu) executes all MMX operations and shares pipeline control with the integer X execution unit (an integer X operation and an MMX operation cannot be dispatched simultaneously). In most cases, data register results are available after one clock and after two clocks for PMULH and PMADD operations.

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

11

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

The integer Y execution unit can execute the basic word and doubleword ALU operations (ADD, AND, CMP, OR, SUB and XOR) and zero and sign-extend operations. Data register results are available after one clock. Figure 3 shows the architecture of the single-stage integer execution pipeline. The operation issue and fetch stages that precede this execution stage are not part of the execution pipeline. The data register operands are received at the end of the operand fetch pipe stage, and the data register result is produced near the end of the execution pipe stage.

Data Register Operands

Execution Stage (Integer X, Integer Y, Multimedia)

Data Register Result

Figure 3. Integer/Multimedia Execution Unit

Load Unit
The load unit is a two-stage pipelined design that performs data memory reads. This unit uses two address register operands and a memory data value as inputs and produces a data register result. The load unit has a two-clock latency from the time it receives the address register operands until it produces a data register result. Memory read data can come from either the data cache or the store queue entry for a recent store. If the data is forwarded from the store queue, there is a zero additional execution latency. This means that a dependent load operation can complete its execution one clock after a store operation completes execution.

12

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Figure 4 shows the architecture of the two-stage load execution pipeline. The operation issue and fetch stages that precede this execution stage are not part of the execution pipeline. The address register operands are received at the end of the operand fetch pipe stage, and the data register result is produced near the end of the second execution pipe stage.

Address Register Operands (Base and Index)

Execution Stage 1 Address Calculation Stage

Memory data from Data Cache or Store Queue

Execution Stage 2 Data Cache/ Store Queue Lookup

Data Register Result

Figure 4. Load Execution Unit

Store Unit
The store execution unit is a two-stage pipelined design that performs data memory writes and/or, in some cases, produces an address register result. For inputs, the store unit uses two address register operands and, during actual memory writes, a store data register operand. This unit also produces an address register result for some store unit operations. For most store operations, which actually write to memory, the store unit produces a physical memory address and the associated bytes of data to be written. After execution completes, these results are entered in a new store queue entry. The store unit has a one-clock execution latency from the time it receives address register operands until the time it produces an address register result. The most common examples are the Load Effective Address (Lea) and Store and Update (Push) RISC86 operations, which are produced from the x86 LEA and PUSH instructions, respectively. Most store operations do not Chapter 3 AMD-K6™ Processor Execution Units and Dependency Latencies 13

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

produce an address register result and only perform a memory write. The Push operation is unique because it produces both an address register result and performs a memory write. The store unit has a one-clock execution latency from the time it receives address register operands until it enters a store memory address and data pair into the store queue. The store unit also has a three-clock latency occurring from the time it receives address register operands and a store data register operand until it enters a store memory address and data pair into the store queue. Note: Address register operands are required at the start of execution, but register data is not required until the end of execution. Figure 5 shows the architecture of the two-stage store execution pipeline. The operation issue and fetch stages that precede this execution stage are not part of the execution pipeline. The address register operands are received at the end of the operand fetch pipe stage, and the new store queue entry is created upon completion of the second execution pipe stage.
Address Register Operands (Base and Index)

Execution Stage 1 Address Calculation Stage Address Register Result Execution Stage 2 Store Data Register Operand Address Data Store Queue Entry

Figure 5. Store Execution Unit

14

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Branch Condition Unit
The branch condition unit is separate from the branch prediction logic, which is utilized at x86 instruction decode time. This unit resolves conditional branches, such as JCC and LOOP instructions, at a rate of up to one per clock cycle.

Floating Point Unit
The floating-point unit handles all register operations for x87 instructions. The execution unit is a single-stage design that takes data register operands as inputs and produces a data register result as an output. The most common floating-point instructions have a two clock execution latency from the time it receives data register operands until the time it produces a data register result.

Latencies and Throughput
Table 1 on page 16 summarizes the latencies and throughput of each execution unit.

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

15

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 1.
Execution Unit

RISC86® Execution Latencies and Throughput
Operations Integer ALU Latency Throughput 1 2–3 1 1 1 1/2 2 1 2 0 1 1 3 1 2 2 1 2–3 1 1 1 1/2 2 1 1 1 1 1 1 1 2 2

Integer X

Integer Multiply Integer Shift MMX ALU MMX Shifts, Packs, Unpack MMX Multiply Low/High MMX Multiply-Accumulate Basic ALU (16– and 32– bit operands) From Address Register Operands to Data Register Result Memory Read Data from Data Cache/Store Queue to Data Register Result From Address Register Operands to Address Register Result From Store Data Register Operands to Store Queue Entry From Address Register Operands to Store Queue Entry Resolves Branch Conditions FADD, FSUB FMUL

Multimedia

Integer Y Load

Store Branch FPU
Note:

No additional latency exists between execution of dependent operations. Bypassing of register results directly from producing execution units to the operand inputs of dependent units is fully supported. Similarly, forwarding of memory store values from the store queue to dependent load operations is supported.

16

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Resource Constraints
To optimize code effectively, consider not only the latencies of critical dependencies, but also execution resource constraints. Due to a fixed number of execution units, only so many operations can be issued in each cycle (up to 6 RISC86 operations per cycle), even though, based on dependencies, more execution parallelism may be possible. For example, if code contains three consecutive integer operations that do not have co-dependencies, they cannot execute in parallel because there are only 2 integer execution units. The third operation is delayed by one cycle. Contention for execution resources causes delays in the issuing and execution of instructions. In addition, stalls due to resource constraints can combine with dependency latencies to cause or exacerbate stalls due to dependencies. In general, constraints that delay non-critical instructions do not impact performance because such stalls typically overlap with the execution of critical operations.

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

17

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Code Sample Analysis
The samples in this section show the execution behavior of severa l se ries o f instructio ns as a functio n of decode constraints, dependencies, and execution resource constraints. The sample tables show the x86 instructions, the RISC86 operation equivalents, the clock counts, and a description of the events occurring within the processor. The following nomenclature is used to describe the current location of a RISC86 operation (RISC86op):
s s s s s s s s s s s s s s s

D IX

— —

Decode stage Issue stage of integer X unit Operand fetch stage of integer X unit Execution stage 1 of integer X unit Issue stage of integer Y unit Operand fetch stage of integer Y unit Execution stage 1 of integer Y unit Issue stage of load unit Operand fetch stage of load unit Execution stage 1 of load unit Execution stage 2 of load unit Issue stage of store unit Operand fetch stage of store unit Execution stage 1 of store unit Execution stage 2 of store unit

OX — EX1 — IY —

OY — EY1 — IL OL — —

EL1 — EL2 — IS OS — —

ES1 — ES2 —

Note: Instructions execute more efficiently (that is, without delays) when scheduled apart by suitable distances based on dependencies. In general, the samples in this section show poorly scheduled code in order to illustrate the resultant effects.

18

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 2.

Sample 1 – Integer Register Operations
Clocks Instruction IMUL EAX, EBX RISC86op alux alux alux 1 D 2 D 3 IX 4 OX IX D D D D IY D D IX OY IY OX IX OY IY EX1 OX EY1 OY EY1 EX1 IY 5 EX1 OX IX OY EX1 OX EY1 EX1 6 7 8 9

Instruction Number 1

2 3 4 5 6 7

INC MOV SHL OR ADD SUB

ESI EAX, 8 EAX, 0x0F ESI, EDX EDI, ECX

alu alux alu alu alu

EDI, 0x07F4 limm

Comments for Each Instruction Number 1 It takes two decode cycles because IMUL is vector decoded. The IMUL instruction is executable only in the integer X unit. It is a non-pipelined 2–3 cycle latency register operation that is equivalent to three serially dependent register operations (the result of the second and third operations are AX and DX, respectively) This simple alu operation ends up in the Y pipe. A load immediate (limm) RISC86 operation does not require execution. The result value is immediately available to dependent operations Shift instructions are only executable in the integer X unit. Issue is delayed by preceding IMUL operations due to a resource constraint of the integer X unit. The register operation is ‘bumped’ out of the integer Y unit in clock 6 because it must wait for more than one cycle for its dependencies to resolve. It is re-issued in the next cycle to the integer X unit (just in time for availability of its operands) This add alu falls through to the integer Y unit right behind the first issuance of operation #5 without delay (as a result of operation #5 being bumped out of the way). The issuance of the subtract register operation is delayed in clock 6 due to the resource constraints of the integer Y unit.

2 3 4 5

6 7

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

19

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 3.

Sample 2 – Integer Register and Memory Load Operations
Clocks Instruction DEC MOV SUB EDX EDI, [ECX] EAX, [EDX+20] RISC86op alu load load alu 1 D D 2 IX IL D 3 4 5 6 7 8 9 10 11 OX EX1 OL EL1 EL2 IL OL EL1 EL2 IX OX IX OX EX1 IX OX IX OX EX1 D IL IY D D D OL EL1 EL2 OY IY OY EY1 IY OY EY1 IL OL EL1 EL2 IL OL IX OL EL1 EL2 OX OX OX EX1

Instruction Number 1 2 3

4 5

SAR ADD

EAX, 5 ECX, [EDI+4]

alux load alu

D

6 7 8

AND MOV OR

EBX, 0x1F ESI, [0x0F100]

alu load

ECX, [ESI+EAX*4+8] load alu

Comments for Each Instruction Number 1 2 3 This simple alu operation ends up in the X pipe. This operation will occupy the load execution unit. The register operand for the load operation is bypassed, without delay, from the result of instruction #1’s register operand. In clock 4, the register operation is ‘bumped’ out of the integer X unit while waiting for the previous load operation result to complete. It is re-issued just in time to receive the bypassed result of the load. Shift instructions are only executable in the integer X unit. The register operation is bumped in clock 5 while waiting for the result of the receding instruction #3. The register operand for the load operation is bypassed, without delay, from the result of instruction #2’s register operand. Note how this and most surrounding load operations are generated by instruction decoders, and issued and executed by the load unit “smoothly” at a rate of one clock per cycle. In clock 5, the register operation is bumped out of the integer Y unit while waiting for the previous load operation result to complete. The register operation falls through into the integer Y unit right behind instruction #5’s register operation. This operation falls into the load unit behind the load in instruction #5 The operand fetch for the load operation is delayed because it needs the result of the immediately preceding load operation #7 as well as the results from earlier instructions #3 and #4.

4 5

6 7 8

20

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 4.

Sample 3 – Integer Register and Memory Load/Store Operations
Clocks Instruction MOV ADD EDX, [0xA0008F00] [EDX+16], 7 RISC86op load load alu store 1 D 2 IL D 3 IL IX IS D 4 5 6 7 8 9 10 11

Instruction Number 1 2

OL EL1 EL2 OL EL1 EL2 OX IX OX EX1 OS OS ES1 ES2 ES2 IL IL OL EL1 EL2 EL2 IX D IS D D IY OX IS OX EX1 OS ES1 ES2 ES2 ES2 IS OS OS OS ES1 ES2 OY OY OY OY OY EY1 IX IX OX

3

SUB

EAX, [EDX+16]

load alu

4 5 6 1 2

PUSH EAX LEA MOV

store

EBX, [ECX+EAX*4+3] store EDI, EBX alu

Comments for Each Instruction Number This operation will occupy the load unit. This long decoded ADD instruction takes a single clock to decode. The operand fetch for the load operation is delayed waiting for the result of the previous load operation from instruction #1. The store operation completes concurrent with the register operation. The result of the register operation is bypassed directly into a new store queue entry created by the store operation. The issue of the load operation is delayed because the operand fetch of the preceding load operation from instruction #2 was delayed. The completion of the load operation is held up due to a memory dependency on the preceding store operation of instruction #2. The load operation completes immediately after the store operation, with the store data being forwarded from a new store queue entry. Completion of the store operation is held up due to a data dependency on the preceding instruction #3. The store data is bypassed directly into a new store queue entry from the result of instruction #3’s register operation. The Lea RISC86 operation is executed by the store unit. The operand fetch is delayed waiting for the result of instruction #3. The register result value is produced in the first execution stage of the store unit. This simple alu operation is stalled due to the dependency of the BX result in instruction #5.

3

4

5 6

Chapter 3

AMD-K6™ Processor Execution Units and Dependency Latencies

21

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 5.

Sample 4 – Integer, MMX™, and Memory Load/Store Operations
Clocks Instruction MOVQ PSUBSW MM0, [EAX] MM0, [EAX+16] RISC86op mload mload alux 1 D 2 IL D 3 IL IX D IY D D 4 5 6 7 8 9 10 OL EL1 EL2 OL EL1 EL2 OX OX OX EX1 OY EY1 IX IS D D IX IX OX EX1 OS ES1 ES2 IX OX EX1 EX1 IY OY EY1 D D IS IY OS ES1 ES2 ES2 OY EY1

Instruction Number 1 2

3 4 5 6 7 8 9

ADD PADDSW PUSH

EBX, ECX MM1, MM2 EBX

alu alux store alux alu mstore alu

PMADDWD MM0, MM1 ADD MOVQ ADD EAX, 32 [EDI], MM0 EDI, 8

Comments for Each Instruction Number 1 2 This multimedia operation will occupy the load unit. Instruction #2 could not be decoded along with the preceding instruction because only MMX instructions can be decoded in the first decode position. The MMX register operation is executable only by the integer X unit. The operand fetch is delayed because of the dependency of the load. This instruction can be decoded in parallel with instruction #2 because it is not an MMX instruction. It is issued to the integer Y unit in parallel with the issuing of the preceding MMX register operation in instruction #2. This instruction is only executable in the integer X unit. The issue of this MMX instruction is delayed due to the delay of the operand fetch of the preceding MMX register operation. This instruction stores the contents of BX in memory. Instruction #6 is only executable in the integer X unit. This non-pipelined unit has a two-clock execution latency for this instruction, and it is delayed due to ‘stacking-up’ behind the preceding MMX operations. This instruction is issued to the integer Y unit in parallel with the series of preceding MMX register operations being issued to the integer X unit. Completion of this store operation is held up due to a data dependency on the preceding MMX register operation from instruction #6. The store data is bypassed directly into a new store queue entry from the result of the MMX operation. This instruction is issued to the integer Y unit in parallel with the series of preceding MMX register operations being issued to the integer X unit.

3

4 5 6 7 8

9

22

AMD-K6™ Processor Execution Units and Dependency Latencies

Chapter 3

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

4
Instruction Dispatch and Execution Timing
This chapter describes the RISC86 operations executed by each instruction. Three separate tables define the integer, MMX, and floating-point instructions. The first column in these tables indicates the instruction mnemonic and operand types with the following notations:
s

s

s

s

s s s s s s s s s

reg8—byte integer register defined by instruction byte(s) or bits 5, 4, and 3 of the modR/M byte mreg8—byte integer register defined by bits 2, 1, and 0 of the modR/M byte reg16/32—word and doubleword integer register defined by instruction byte(s) or bits 5, 4, and 3 of the modR/M byte mreg16/32—word and doubleword integer register defined by bits 2, 1, and 0 of the modR/M byte mem8—byte memory location mem16/32—word or doubleword memory location mem32/48—doubleword or 6-byte memory location mem48—48-bit integer value in memory mem64—64-bit value in memory imm8/16/32—8-bit, 16-bit or 32-bit immediate value disp8—8-bit displacement value disp16/32—16-bit or 32-bit displacement value disp32/48—32-bit or 48-bit displacement value Instruction Dispatch and Execution Timing 23

Chapter 4

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

s s s s s s

eXX—register width depending on the operand size mem32real—32-bit floating-point value in memory mem64real—64-bit floating-point value in memory mem80real—80-bit floating-point value in memory mmreg—MMX register mmreg1—MMX register defined by bits 5, 4, and 3 of the modR/M byte mmreg2—MMX register defined by bits 2, 1, and 0 of the modR/M byte

s

The second and third columns list all applicable encoding opcode bytes. The fourth column lists the modR/M byte when used by the instruction. The modR/M byte defines the instruction as register or memory form. If mod bits 7 and 6 are documented as mm (memory form), mm can only be 10b, 01b, or 00b. The fifth column lists the type of instruction decode — short, long, or vectored. The AMD-K6 MMX enhanced processor decode logic can process two short, one long, or one vectored decode per clock . In addition, two short integer, one short integer and one short MMX, or one short integer and one short FPU instruction can be decoded simultaneously. Note: In order to simultaneously decode an integer with a floating-point or MMX instruction, the floating-point or MMX instruction must precede the integer instruction. The sixth column lists the type of RISC86 operation(s) required for the instruction. The operation types and corresponding execution units are as follows:
s s s s s s s s

load, fload, mload—load unit store, fstore, mstore—store unit alu—either of the integer execution units alux—integer X execution unit only branch—branch condition unit float—floating-point execution unit meu—multimedia execution unit limm—load immediate, instruction control unit

24

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

The operation(s) of most instructions form a single dependency chain. For instructions whose operations form two parallel dependency chains, the RISC86 operations and execution latency for each dependency chain is shown on a separate row. Table 6. Integer Instructions
First Second Byte Byte 37h D5h D4h 3Fh 10h 10h 11h 11h 12h 12h 13h 13h 14h 15h 80h 80h 81h 81h 83h 83h 00h 00h 01h 01h 02h 02h 03h 03h 04h 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx 0Ah 0Ah modR/M Byte Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short long short long short short short short short alux load, alux, store alu load, alu, store alux load, alux alu load, alu alux RISC86® Opcodes

Instruction Mnemonic AAA AAD AAM AAS ADC mreg8, reg8 ADC mem8, reg8 ADC mreg16/32, reg16/32 ADC mem16/32, reg16/32 ADC reg8, mreg8 ADC reg8, mem8 ADC reg16/32, mreg16/32 ADC reg16/32, mem16/32 ADC AL, imm8 ADC EAX, imm16/32 ADC mreg8, imm8 ADC mem8, imm8 ADC mreg16/32, imm16/32 ADC mem16/32, imm16/32 ADC mreg16/32, imm8 (signed ext.) ADC mem16/32, imm8 (signed ext.) ADD mreg8, reg8 ADD mem8, reg8 ADD mreg16/32, reg16/32 ADD mem16/32, reg16/32 ADD reg8, mreg8 ADD reg8, mem8 ADD reg16/32, mreg16/32 ADD reg16/32, mem16/32 ADD AL, imm8

Chapter 4

Instruction Dispatch and Execution Timing

25

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 05h 80h 80h 81h 81h 83h 83h 20h 20h 21h 21h 22h 22h 23h 23h 24h 25h 80h 80h 81h 81h 83h 83h 63h 63h 62h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh BCh BCh BDh BDh C8h C9h CAh modR/M Byte xx-xxx-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short short long short long short long short long short long short short short short short short short long short long short long vector vector vector vector vector vector vector long long long alu alu alu alu alux load, alux, store alu load, alu, store alux load, alux, store alux load, alux, store alu load, alu, store alux load, alux alu load, alu alux alu alux load, alux, store alu load, alu, store alux load, alux, store RISC86® Opcodes

Instruction Mnemonic ADD EAX, imm16/32 ADD mreg8, imm8 ADD mem8, imm8 ADD mreg16/32, imm16/32 ADD mem16/32, imm16/32 ADD mreg16/32, imm8 (signed ext.) ADD mem16/32, imm8 (signed ext.) AND mreg8, reg8 AND mem8, reg8 AND mreg16/32, reg16/32 AND mem16/32, reg16/32 AND reg8, mreg8 AND reg8, mem8 AND reg16/32, mreg16/32 AND reg16/32, mem16/32 AND AL, imm8 AND EAX, imm16/32 AND mreg8, imm8 AND mem8, imm8 AND mreg16/32, imm16/32 AND mem16/32, imm16/32 AND mreg16/32, imm8 (signed ext.) AND mem16/32, imm8 (signed ext.) ARPL mreg16, reg16 ARPL mem16, reg16 BOUND BSF reg16/32, mreg16/32 BSF reg16/32, mem16/32 BSR reg16/32, mreg16/32 BSR reg16/32, mem16/32 BSWAP EAX BSWAP ECX BSWAP EDX

26

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 9Ah E8h FFh FFh FFh 98h F8h FCh FAh 0Fh F5h 38h 11-xxx-xxx 06h 11-011-xxx 11-010-xxx mm-010-xxx CBh CCh CDh CEh CFh A3h A3h BAh BAh BBh BBh BAh BAh B3h B3h BAh BAh ABh ABh BAh BAh 11-xxx-xxx mm-xxx-xxx 11-100-xxx mm-100-xxx 11-xxx-xxx mm-xxx-xxx 11-111-xxx mm-111-xxx 11-xxx-xxx mm-xxx-xxx 11-110-xxx mm-110-xxx 11-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx modR/M Byte Decode Type long long long long long vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector vector vector vector vector vector vector vector vector short alux store alu alu alu alu alu RISC86® Opcodes

Instruction Mnemonic BSWAP EBX BSWAP ESP BSWAP EBP BSWAP ESI BSWAP EDI BT mreg16/32, reg16/32 BT mem16/32, reg16/32 BT mreg16/32, imm8 BT mem16/32, imm8 BTC mreg16/32, reg16/32 BTC mem16/32, reg16/32 BTC mreg16/32, imm8 BTC mem16/32, imm8 BTR mreg16/32, reg16/32 BTR mem16/32, reg16/32 BTR mreg16/32, imm8 BTR mem16/32, imm8 BTS mreg16/32, reg16/32 BTS mem16/32, reg16/32 BTS mreg16/32, imm8 BTS mem16/32, imm8 CALL full pointer CALL near imm16/32 CALL mem16:16/32 CALL near mreg32 (indirect) CALL near mem32 (indirect) CBW/CWDE EAX CLC CLD CLI CLTS CMC CMP mreg8, reg8

Chapter 4

Instruction Dispatch and Execution Timing

27

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 38h 39h 39h 3Ah 3Ah 3Bh 3Bh 3Ch 3Dh 80h 80h 81h 81h 83h 83h A6h A7h A7h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 99h 27h 2Fh 48h 49h 4Ah 4Bh 4Ch B0h B0h B1h B1h C7h C7h A2h 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx modR/M Byte mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx Decode Type short short short short short short short short short short short short long short short vector vector vector vector vector vector vector vector vector vector vector vector vector short short short short short alu alu alu alu alu alu load, alu alux load, alux alu load, alu alux alu alux load, alux alu load, alu load, alu load, alu RISC86® Opcodes load, alux

Instruction Mnemonic CMP mem8, reg8 CMP mreg16/32, reg16/32 CMP mem16/32, reg16/32 CMP reg8, mreg8 CMP reg8, mem8 CMP reg16/32, mreg16/32 CMP reg16/32, mem16/32 CMP AL, imm8 CMP EAX, imm16/32 CMP mreg8, imm8 CMP mem8, imm8 CMP mreg16/32, imm16/32 CMP mem16/32, imm16/32 CMP mreg16/32, imm8 (signed ext.) CMP mem16/32, imm8 (signed ext.) CMPSB mem8,mem8 CMPSW mem16, mem32 CMPSD mem32, mem32 CMPXCHG mreg8, reg8 CMPXCHG mem8, reg8 CMPXCHG mreg16/32, reg16/32 CMPXCHG mem16/32, reg16/32 CMPXCH8B EDX:EAX CMPXCH8B mem64 CPUID CWD/CDQ EDX, EAX DAA DAS DEC EAX DEC ECX DEC EDX DEC EBX DEC ESP

28

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 4Dh 4Eh 4Fh FEh FEh FFh FFh F6h F6h F7h F7h F6h F6h F7h F7h 69h 69h 69h 6Bh 6Bh 6Bh F6h F6h F7h F7h 0Fh 0Fh 40h 41h 42h 43h AFh AFh 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-110-xxx mm-110-xx 11-110-xxx mm-110-xx 11-111-xxx mm-111-xx 11-111-xxx mm-111-xx 11-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xx 11-101-xxx mm-101-xx 11-xxx-xxx mm-xxx-xxx modR/M Byte Decode Type short short short vector long vector long vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short short short short alu alu alu alu load, alu, store load, alux, store alu alu alu RISC86® Opcodes

Instruction Mnemonic DEC EBP DEC ESI DEC EDI DEC mreg8 DEC mem8 DEC mreg16/32 DEC mem16/32 DIV AL, mreg8 DIV AL, mem8 DIV EAX, mreg16/32 DIV EAX, mem16/32 IDIV mreg8 IDIV mem8 IDIV EAX, mreg16/32 IDIV EAX, mem16/32 IMUL reg16/32, imm16/32 IMUL reg16/32, mreg16/32, imm16/32 IMUL reg16/32, mem16/32, imm16/32 IMUL reg16/32, imm8 (sign extended) IMUL reg16/32, mreg16/32, imm8 (signed) IMUL reg16/32, mem16/32, imm8 (signed) IMUL AX, AL, mreg8 IMUL AX, AL, mem8 IMUL EDX:EAX, EAX, mreg16/32 IMUL EDX:EAX, EAX, mem16/32 IMUL reg16/32, mreg16/32 IMUL reg16/32, mem16/32 INC EAX INC ECX INC EDX INC EBX

Chapter 4

Instruction Dispatch and Execution Timing

29

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 44h 45h 46h 47h FEh FEh FFh FFh 0Fh 0Fh 70h 71h 71h 73h 74h 75h 76h 77h 78h 79h 7Ah 7Bh 7Ch 7Dh 7Eh 7Fh E3h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 80h 81h 82h 83h 84h 85h 08h 01h mm-111-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx modR/M Byte Decode Type short short short short vector long vector long vector vector short short short short short short short short short short short short short short short short vector short short short short short short branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch load, alu, store load, alux, store alu alu alu alu RISC86® Opcodes

Instruction Mnemonic INC ESP INC EBP INC ESI INC EDI INC mreg8 INC mem8 INC mreg16/32 INC mem16/32 INVD INVLPG JO short disp8 JB/JNAE short disp8 JNO short disp8 JNB/JAE short disp8 JZ/JE short disp8 JNZ/JNE short disp8 JBE/JNA short disp8 JNBE/JA short disp8 JS short disp8 JNS short disp8 JP/JPE short disp8 JNP/JPO short disp8 JL/JNGE short disp8 JNL/JGE short disp8 JLE/JNG short disp8 JNLE/JG short disp8 JCXZ/JEC short disp8 JO near disp16/32 JNO near disp16/32 JB/JNAE near disp16/32 JNB/JAE near disp16/32 JZ/JE near disp16/32 JNZ/JNE near disp16/32

30

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh E9h EAh EBh EFh EFh FFh FFh 9Fh 0Fh 0Fh C5h 8Dh C9h C4h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh B4h 01h B5h 01h 00h 00h 01h mm-011-xxx 11-010-xxx mm-010-xxx 11-100-xxx mm-010-xxx mm-xxx-xxx 02h 02h 11-xxx-xxx mm-xxx-xxx mm-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx 11-100-xxx mm-100-xxx 86h 87h 88h 89h 8Ah 8Bh 8Ch 8Dh 8Eh 8Fh modR/M Byte Decode Type short short short short short short short short short short short vector short vector vector vector vector vector vector vector vector short long vector vector vector vector vector vector vector vector alu store alu load, alu RISC86® Opcodes branch branch branch branch branch branch branch branch branch branch

Instruction Mnemonic JBE/JNA near disp16/32 JNBE/JA near disp16/32 JS near disp16/32 JNS near disp16/32 JP/JPE near disp16/32 JNP/JPO near disp16/32 JL/JNGE near disp16/32 JNL/JGE near disp16/32 JLE/JNG near disp16/32 JNLE/JG near disp16/32 JMP near disp16/32 (direct) JMP far disp32/48 (direct) JMP disp8 (short) JMP far mreg32 (indirect) JMP far mem32 (indirect) JMP near mreg16/32 (indirect) JMP near mem16/32 (indirect) LAHF LAR reg16/32, mreg16/32 LAR reg16/32, mem16/32 LDS reg16/32, mem32/48 LEA reg16/32, mem16/32 LEAVE LES reg16/32, mem32/48 LFS reg16/32, mem32/48 LGDT mem48 LGS reg16/32, mem32/48 LIDT mem48 LLDT mreg16 LLDT mem16 LMSW mreg16

Chapter 4

Instruction Dispatch and Execution Timing

31

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh ACh ADh ADh E2h E1h E0h 0Fh 0Fh 0Fh 0Fh 0Fh 88h 88h 89h 89h 8Ah 8Ah 8Bh 8Bh 8Ch 8Ch 8Eh 8Eh A0h A1h A2h A3h B0h B1h 03h 03h B2h 00h 00h 11-xxx-xxx mm-xxx-xxx mm-xxx-xxx 11-011-xxx mm-011-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 01h modR/M Byte mm-100-xxx Decode Type vector long long long short vector vector vector vector vector vector vector short short short short short short short short long vector vector vector short short short short short short load load store store limm limm alux store alu store alux load alu load load load alu load alu load alu branch alu RISC86® Opcodes

Instruction Mnemonic LMSW mem16 LODSB AL, mem8 LODSW AX, mem16 LODSD EAX, mem32 LOOP disp8 LOOPE/LOOPZ disp8 LOOPNE/LOOPNZ disp8 LSL reg16/32, mreg16/32 LSL reg16/32, mem16/32 LSS reg16/32, mem32/48 LTR mreg16 LTR mem16 MOV mreg8, reg8 MOV mem8, reg8 MOV mreg16/32, reg16/32 MOV mem16/32, reg16/32 MOV reg8, mreg8 MOV reg8, mem8 MOV reg16/32, mreg16/32 MOV reg16/32, mem16/32 MOV mreg16, segment reg MOV mem16, segment reg MOV segment reg, mreg16 MOV segment reg, mem16 MOV AL, mem8 MOV EAX, mem16/32 MOV mem8, AL MOV mem16/32, EAX MOV AL, imm8 MOV CL, imm8

32

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte B2h B3h B4h B5h B6h B7h B8h B9h BAh BBh BCh BDh BEh BFh C6h C6h C7h C7h A4h 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx modR/M Byte Decode Type short short short short short short short short short short short short short short short long short long long RISC86® Opcodes limm limm limm limm limm limm limm limm limm limm limm limm limm limm limm store limm store load, store alu alu load, store alu alu load, store alu alu alu load, alu alu load, alu alu load, alu alu

Instruction Mnemonic MOV DL, imm8 MOV BL, imm8 MOV AH, imm8 MOV CH, imm8 MOV DH, imm8 MOV BH, imm8 MOV EAX, imm16/32 MOV ECX, imm16/32 MOV EDX, imm16/32 MOV EBX, imm16/32 MOV ESP, imm16/32 MOV EBP, imm16/32 MOV ESI, imm16/32 MOV EDI, imm16/32 MOV mreg8, imm8 MOV mem8, imm8 MOV reg16/32, imm16/32 MOV mem16/32, imm16/32 MOVSB mem8,mem8

MOVSD mem16, mem16

A5h

long

MOVSW mem32, mem32 MOVSX reg16/32, mreg8 MOVSX reg16/32, mem8 MOVSX reg32, mreg16 MOVSX reg32, mem16 MOVZX reg16/32, mreg8 MOVZX reg16/32, mem8 MOVZX reg32, mreg16

A5h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh BEh BEh BFh BFh B6h B6h B7h 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx

long short short short short short short short

Chapter 4

Instruction Dispatch and Execution Timing

33

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh F6h F6h F7h F7h F6h F6h F7h F7h 90h F6h F6h F7h F7h 08h 08h 09h 09h 0Ah 0Ah 0Bh 0Bh 0Ch 0Dh 80h 80h 81h 81h 83h 83h 07h 17h 1Fh 11-010-xxx mm-010-xx 11-010-xxx mm-010-xx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx B7h modR/M Byte mm-xxx-xxx 11-100-xxx mm-100-xx 11-100-xxx mm-100-xx 11-011-xxx mm-011-xx 11-011-xxx mm-011-xx Decode Type short vector vector vector vector short vector short vector short short vector short vector short long short long short short short short short short short long short long short long vector vector vector alux load, alux, store alu load, alu, store alux load, alux alu load, alu alux alu alux load, alux, store alu load, alu, store alux load, alux, store alu limm alux alu alux RISC86® Opcodes load, alu

Instruction Mnemonic MOVZX reg32, mem16 MUL AL, mreg8 MUL AL, mem8 MUL EAX, mreg16/32 MUL EAX, mem16/32 NEG mreg8 NEG mem8 NEG mreg16/32 NEG mem16/32 NOP (XCHG AX, AX) NOT mreg8 NOT mem8 NOT mreg16/32 NOT mem16/32 OR mreg8, reg8 OR mem8, reg8 OR mreg16/32, reg16/32 OR mem16/32, reg16/32 OR reg8, mreg8 OR reg8, mem8 OR reg16/32, mreg16/32 OR reg16/32, mem16/32 OR AL, imm8 OR EAX, imm16/32 OR mreg8, imm8 OR mem8, imm8 OR mreg16/32, imm16/32 OR mem16/32, imm16/32 OR mreg16/32, imm8 (signed ext.) OR mem16/32, imm8 (signed ext.) POP ES POP SS POP DS

34

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh 0Fh 58h 59h 5Ah 5Bh 5Ch 5Dh 5Eh 5Fh 8Fh 8Fh 61h 9Dh 06h 0Eh 0Fh 0Fh 16h 1Eh 50h 51h 52h 53h 54h A0h A8h 11-000-xxx mm-000-xxx A1h A9h modR/M Byte Decode Type vector vector short short short short short short short short short long vector vector long vector vector vector vector long short short short short short load, store store store store store store load, store load alu load alu load alu load alu load alu load alu load alu load alu load alu load, store alu RISC86® Opcodes

Instruction Mnemonic POP FS POP GS POP EAX POP ECX POP EDX POP EBX POP ESP POP EBP POP ESI POP EDI POP mreg POP mem POPA/POPAD POPF/POPFD PUSH ES PUSH CS PUSH FS PUSH GS PUSH SS PUSH DS PUSH EAX PUSH ECX PUSH EDX PUSH EBX PUSH ESP

Chapter 4

Instruction Dispatch and Execution Timing

35

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte 55h 56h 57h 6Ah 68h FFh FFh 60h 9Ch C0h C0h C1h C1h D0h D0h D1h D1h D2h D2h D3h D3h C0h C0h C1h C1h D0h D0h D1h D1h D2h D2h D3h D3h 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-110-xxx mm-110-xxx modR/M Byte Decode Type short short short long long vector long vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector load, store RISC86® Opcodes store store store store store

Instruction Mnemonic PUSH EBP PUSH ESI PUSH EDI PUSH imm8 PUSH imm16/32 PUSH mreg16/32 PUSH mem16/32 PUSHA/PUSHAD PUSHF/PUSHFD RCL mreg8, imm8 RCL mem8, imm8 RCL mreg16/32, imm8 RCL mem16/32, imm8 RCL mreg8, 1 RCL mem8, 1 RCL mreg16/32, 1 RCL mem16/32, 1 RCL mreg8, CL RCL mem8, CL RCL mreg16/32, CL RCL mem16/32, CL RCR mreg8, imm8 RCR mem8, imm8 RCR mreg16/32, imm8 RCR mem16/32, imm8 RCR mreg8, 1 RCR mem8, 1 RCR mreg16/32, 1 RCR mem16/32, 1 RCR mreg8, CL RCR mem8, CL RCR mreg16/32, CL RCR mem16/32, CL

36

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte C2h C3h CAh CBh C0h C0h C1h C1h D0h D0h D1h D1h D2h D2h D3h D3h C0h C0h C1h C1h D0h D0h D1h D1h D2h D2h D3h D3h 9Eh C0h C0h C1h C1h 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx modR/M Byte Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector short vector alu alux RISC86® Opcodes

Instruction Mnemonic RET near imm16 RET near RET far imm16 RET far ROL mreg8, imm8 ROL mem8, imm8 ROL mreg16/32, imm8 ROL mem16/32, imm8 ROL mreg8, 1 ROL mem8, 1 ROL mreg16/32, 1 ROL mem16/32, 1 ROL mreg8, CL ROL mem8, CL ROL mreg16/32, CL ROL mem16/32, CL ROR mreg8, imm8 ROR mem8, imm8 ROR mreg16/32, imm8 ROR mem16/32, imm8 ROR mreg8, 1 ROR mem8, 1 ROR mreg16/32, 1 ROR mem16/32, 1 ROR mreg8, CL ROR mem8, CL ROR mreg16/32, CL ROR mem16/32, CL SAHF SAR mreg8, imm8 SAR mem8, imm8 SAR mreg16/32, imm8 SAR mem16/32, imm8

Chapter 4

Instruction Dispatch and Execution Timing

37

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte D0h D0h D1h D1h D2h D2h D3h D3h 18h 18h 19h 19h 1Ah 1Ah 1Bh 1Bh 1Ch 1Dh 80h 80h 81h 81h 83h 83h AEh AFh AFh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 90h 90h 91h 91h 92h 92h 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx modR/M Byte 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx Decode Type short vector short vector short vector short vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector alu alux alu RISC86® Opcodes alux

Instruction Mnemonic SAR mreg8, 1 SAR mem8, 1 SAR mreg16/32, 1 SAR mem16/32, 1 SAR mreg8, CL SAR mem8, CL SAR mreg16/32, CL SAR mem16/32, CL SBB mreg8, reg8 SBB mem8, reg8 SBB mreg16/32, reg16/32 SBB mem16/32, reg16/32 SBB reg8, mreg8 SBB reg8, mem8 SBB reg16/32, mreg16/32 SBB reg16/32, mem16/32 SBB AL, imm8 SBB EAX, imm16/32 SBB mreg8, imm8 SBB mem8, imm8 SBB mreg16/32, imm16/32 SBB mem16/32, imm16/32 SBB mreg8, imm8 (signed ext.) SBB mem8, imm8 (signed ext.) SCASB AL, mem8 SCASW AX, mem16 SCASD EAX, mem32 SETO mreg8 SETO mem8 SETNO mreg8 SETNO mem8 SETB/SETNAE mreg8 SETB/SETNAE mem8

38

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh C0h C0h C1h C1h D0h 93h 93h 94h 94h 95h 95h 96h 96h 97h 97h 98h 98h 99h 99h 9Ah 9Ah 9Bh 9Bh 9Ch 9Ch 9Dh 9Dh 9Eh 9Eh 9Fh 9Fh 01h 01h modR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx mm-000-xxx mm-001-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector short vector short alux alu alux RISC86® Opcodes

Instruction Mnemonic SETNB/SETAE mreg8 SETNB/SETAE mem8 SETZ/SETE mreg8 SETZ/SETE mem8 SETNZ/SETNE mreg8 SETNZ/SETNE mem8 SETBE/SETNA mreg8 SETBE/SETNA mem8 SETNBE/SETA mreg8 SETNBE/SETA mem8 SETS mreg8 SETS mem8 SETNS mreg8 SETNS mem8 SETP/SETPE mreg8 SETP/SETPE mem8 SETNP/SETPO mreg8 SETNP/SETPO mem8 SETL/SETNGE mreg8 SETL/SETNGE mem8 SETNL/SETGE mreg8 SETNL/SETGE mem8 SETLE/SETNG mreg8 SETLE/SETNG mem8 SETNLE/SETG mreg8 SETNLE/SETG mem8 SGDT mem48 SIDT mem48 SHL/SAL mreg8, imm8 SHL/SAL mem8, imm8 SHL/SAL mreg16/32, imm8 SHL/SAL mem16/32, imm8 SHL/SAL mreg8, 1

Chapter 4

Instruction Dispatch and Execution Timing

39

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte D0h D1h D1h D2h D2h D3h D3h C0h C0h C1h C1h D0h D0h D1h D1h D2h D2h D3h D3h 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh F9h FDh A4h A4h A5h A5h ACh ACh ADh ADh 00h 00h 01h 01h modR/M Byte mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-000-xxx mm-000-xxx 11-100-xxx mm-100-xxx Decode Type vector short vector short vector short vector short vector short vector short vector short vector short vector short vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector alu alux alu alux alu alux alu alux alu RISC86® Opcodes

Instruction Mnemonic SHL/SAL mem8, 1 SHL/SAL mreg16/32, 1 SHL/SAL mem16/32, 1 SHL/SAL mreg8, CL SHL/SAL mem8, CL SHL/SAL mreg16/32, CL SHL/SAL mem16/32, CL SHR mreg8, imm8 SHR mem8, imm8 SHR mreg16/32, imm8 SHR mem16/32, imm8 SHR mreg8, 1 SHR mem8, 1 SHR mreg16/32, 1 SHR mem16/32, 1 SHR mreg8, CL SHR mem8, CL SHR mreg16/32, CL SHR mem16/32, CL SHLD mreg16/32, reg16/32, imm8 SHLD mem16/32, reg16/32, imm8 SHLD mreg16/32, reg16/32, CL SHLD mem16/32, reg16/32, CL SHRD mreg16/32, reg16/32, imm8 SHRD mem16/32, reg16/32, imm8 SHRD mreg16/32, reg16/32, CL SHRD mem16/32, reg16/32, CL SLDT mreg16 SLDT mem16 SMSW mreg16 SMSW mem16 STC STD

40

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte FBh AAh ABh ABh 0Fh 0Fh 28h 28h 29h 29h 2Ah 2Ah 2Bh 2Bh 2Ch 2Dh 80h 80h 81h 81h 83h 83h 84h 84h 85h 85h A8h A9h F6h F6h F7h 11-000-xxx mm-000-xx 11-000-xxx 00h 00h 11-001-xxx mm-001-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx modR/M Byte Decode Type vector long long long vector vector short long short long short short short short short short short long short long short long short vector short vector long long long long long alux alu alux load, alux alu alu alux load, alux, store alu load, alu, store alux load, alux alu load, alu alux alu alux load, alux, store alu load, alu, store alux load, alux, store alux store alux store alux store alux RISC86® Opcodes

Instruction Mnemonic STI STOSB mem8, AL STOSW mem16, AX STOSD mem32, EAX STR mreg16 STR mem16 SUB mreg8, reg8 SUB mem8, reg8 SUB mreg16/32, reg16/32 SUB mem16/32, reg16/32 SUB reg8, mreg8 SUB reg8, mem8 SUB reg16/32, mreg16/32 SUB reg16/32, mem16/32 SUB AL, imm8 SUB EAX, imm16/32 SUB mreg8, imm8 SUB mem8, imm8 SUB mreg16/32, imm16/32 SUB mem16/32, imm16/32 SUB mreg16/32, imm8 (signed ext.) SUB mem16/32, imm8 (signed ext.) TEST mreg8, reg8 TEST mem8, reg8 TEST mreg16/32, reg16/32 TEST mem16/32, reg16/32 TEST AL, imm8 TEST EAX, imm16/32 TEST mreg8, imm8 TEST mem8, imm8 TEST mreg8, imm16/32

Chapter 4

Instruction Dispatch and Execution Timing

41

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 6.

Integer Instructions (continued)
First Second Byte Byte F7h 0Fh 0Fh 0Fh 0Fh 9Bh 0Fh 0Fh 0Fh 0Fh 86h 86h 87h 87h 90h 91h 92h 93h 94h 95h 96h 97h D7h 30h 30h 31h 31h 32h 32h 33h 33h 34h 35h 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx xx-xxx-xxx xx-xxx-xxx C0h C0h C1h C1h 11-100-xxx mm-100-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 00h 00h 00h 00h modR/M Byte mm-000-xx 11-100-xxx mm-100-xxx 11-101-xxx mm-101-xxx Decode Type long vector vector vector vector vector vector vector vector vector vector vector vector vector short long long long long long long long vector short long short long short short short short short short alux load, alux, store alu load, alu, store alux load, alux alu load, alu alux alu limm alu, alu, alu alu, alu, alu alu, alu, alu alu, alu, alu alu, alu, alu alu, alu, alu alu, alu, alu RISC86® Opcodes load, alu

Instruction Mnemonic TEST mem8, imm16/32 VERR mreg16 VERR mem16 VERW mreg16 VERW mem16 WAIT XADD mreg8, reg8 XADD mem8, reg8 XADD mreg16/32, reg16/32 XADD mem16/32, reg16/32 XCHG reg8, mreg8 XCHG reg8, mem8 XCHG reg16/32, mreg16/32 XCHG reg16/32, mem16/32 XCHG EAX, EAX XCHG EAX, ECX XCHG EAX, EDX XCHG EAX, EBX XCHG EAX, ESP XCHG EAX, EBP XCHG EAX, ESI XCHG EAX, EDI XLAT XOR mreg8, reg8 XOR mem8, reg8 XOR mreg16/32, reg16/32 XOR mem16/32, reg16/32 XOR reg8, mreg8 XOR reg8, mem8 XOR reg16/32, mreg16/32 XOR reg16/32, mem16/32 XOR AL, imm8 XOR EAX, imm16/32

42

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 6.

Integer Instructions (continued)
First Second Byte Byte 80h 80h 81h 81h 83h 83h modR/M Byte 11-110-xxx mm-110-xxx 11-110-xxx mm-110-xxx 11-110-xxx mm-110-xxx Decode Type short long short long short long RISC86® Opcodes alux load, alux, store alu load, alu, store alux load, alux, store

Instruction Mnemonic XOR mreg8, imm8 XOR mem8, imm8 XOR mreg16/32, imm16/32 XOR mem16/32, imm16/32 XOR mreg16/32, imm8 (signed ext.) XOR mem16/32, imm8 (signed ext.)

Table 7.

MMX™ Instructions
Instruction Mnemonic Prefix First Byte(s) Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 77h 6Eh 6Eh 7Eh 7Eh 6Fh 6Fh 7Fh 7Fh 6Bh 6Bh 63h 64h 67h 67h FCh FCh FEh FEh ECh ECh EDh 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx modR/M Byte Decode Type vector short short short short short short short short short short short short short short short short short short short short short store, mload mload mstore, load mstore meu mload meu mstore meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu RISC86® Opcodes

EMMS MOVD mmreg, mreg32 MOVD mmreg, mem32 MOVD mreg32, mmreg MOVD mem32, mmreg MOVQ mmreg1, mmreg2 MOVQ mmreg, mem64 MOVQ mmreg1, mmreg2 MOVQ mem64, mmreg PACKSSDW mmreg1, mmreg2 PACKSSDW mmreg, mem64 PACKSSWB mmreg1, mmreg2 PACKSSWB mmreg, mem64 PACKUSWB mmreg1, mmreg2 PACKUSWB mmreg, mem64 PADDB mmreg1, mmreg2 PADDB mmreg, mem64 PADDD mmreg1, mmreg2 PADDD mmreg, mem64 PADDSB mmreg1, mmreg2 PADDSB mmreg, mem64 PADDSW mmreg1, mmreg2

Chapter 4

Instruction Dispatch and Execution Timing

43

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 7.

MMX™ Instructions (continued)
Instruction Mnemonic Prefix First Byte(s) Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh EDh DCh DCh DDh DDh FDh FDh DBh DBh DFh DFh 74h 74h 76h 76h 75h 75h 64h 64h 66h 66h 65h 65h F5h F5h E5h E5h D5h D5h EBh EBh F1h F1h modR/M Byte mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx 11-xxx-xxx Decode Type short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short RISC86® Opcodes mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu

PADDSW mmreg, mem64 PADDUSB mmreg1, mmreg2 PADDUSB mmreg, mem64 PADDUSW mmreg1, mmreg2 PADDUSW mmreg, mem64 PADDW mmreg1, mmreg2 PADDW mmreg, mem64 PAND mmreg1, mmreg2 PAND mmreg, mem64 PANDN mmreg1, mmreg2 PANDN mmreg, mem64 PCMPEQB mmreg1, mmreg2 PCMPEQB mmreg, mem64 PCMPEQD mmreg1, mmreg2 PCMPEQD mmreg, mem64 PCMPEQW mmreg1, mmreg2 PCMPEQW mmreg, mem64 PCMPGTB mmreg1, mmreg2 PCMPGTB mmreg, mem64 PCMPGTD mmreg1, mmreg2 PCMPGTD mmreg, mem64 PCMPGTW mmreg1, mmreg2 PCMPGTW mmreg, mem64 PMADDWD mmreg1, mmreg2 PMADDWD mmreg, mem64 PMULHW mmreg1, mmreg2 PMULHW mmreg, mem64 PMULLW mmreg1, mmreg2 PMULLW mmreg, mem64 POR mmreg1, mmreg2 POR mmreg, mem64 PSLLW mmreg1, mmreg2 PSLLW mmreg, mem64

44

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 7.

MMX™ Instructions (continued)
Instruction Mnemonic Prefix First Byte(s) Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 71h F2h F2h 72h F3h F3h 73h E1h E1h 71h E2h E2h 72h D1h D1h 71h D2h D2h 72h D3h D3h 73h F8h F8h FAh FAh E8h E8h E9h E9h D8h D8h D9h modR/M Byte 11-110-xxx 11-xxx-xxx 11-xxx-xxx 11-110-xxx 11-xxx-xxx 11-xxx-xxx 11-110-xxx 11-xxx-xxx 11-xxx-xxx 11-100-xxx 11-xxx-xxx 11-xxx-xxx 11-100-xxx 11-xxx-xxx 11-xxx-xxx 11-010-xxx 11-xxx-xxx 11-xxx-xxx 11-010-xxx 11-xxx-xxx 11-xxx-xxx 11-010-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx Decode Type short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short RISC86® Opcodes meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu

PSLLW mmreg, imm8 PSLLD mmreg1, mmreg2 PSLLD mmreg, mem64 PSLLD mmreg, imm8 PSLLQ mmreg1, mmreg2 PSLLQ mmreg, mem64 PSLLQ mmreg, imm8 PSRAW mmreg1, mmreg2 PSRAW mmreg, mem64 PSRAW mmreg, imm8 PSRAD mmreg1, mmreg2 PSRAD mmreg, mem64 PSRAD mmreg, imm8 PSRLW mmreg1, mmreg2 PSRLW mmreg, mem64 PSRLW mmreg, imm8 PSRLD mmreg1, mmreg2 PSRLD mmreg, mem64 PSRLD mmreg, imm8 PSRLQ mmreg1, mmreg2 PSRLQ mmreg, mem64 PSRLQ mmreg, imm8 PSUBB mmreg1, mmreg2 PSUBB mmreg, mem64 PSUBD mmreg1, mmreg2 PSUBD mmreg, mem64 PSUBSB mmreg1, mmreg2 PSUBSB mmreg, mem64 PSUBSW mmreg1, mmreg2 PSUBSW mmreg, mem64 PSUBUSB mmreg1, mmreg2 PSUBUSB mmreg, mem64 PSUBUSW mmreg1, mmreg2

Chapter 4

Instruction Dispatch and Execution Timing

45

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 7.

MMX™ Instructions (continued)
Instruction Mnemonic Prefix First Byte(s) Byte 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh 0Fh D9h F9h F9h 68h 68h 69h 69h 6Ah 6Ah 60h 60h 61h 61h 62h 62h EFh EFh modR/M Byte mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short short short short short short short short short short short short short short short short short RISC86® Opcodes mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu meu mload, meu

PSUBUSW mmreg, mem64 PSUBW mmreg1, mmreg2 PSUBW mmreg, mem64 PUNPCKHBW mmreg1, mmreg2 PUNPCKHBW mmreg, mem64 PUNPCKHWD mmreg1, mmreg2 PUNPCKHWD mmreg, mem64 PUNPCKHDQ mmreg1, mmreg2 PUNPCKHDQ mmreg, mem64 PUNPCKLBW mmreg1, mmreg2 PUNPCKLBW mmreg, mem64 PUNPCKLWD mmreg1, mmreg2 PUNPCKLWD mmreg, mem64 PUNPCKLDQ mmreg1, mmreg2 PUNPCKLDQ mmreg, mem64 PXOR mmreg1, mmreg2 PXOR mmreg, mem64

Table 8.

Floating-Point Instructions
First Second Byte Byte D9h D9h D8h D8h DCh DCh DEh DFh DFh D9h E0h F0h F1h 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-100-xxx mm-110-xxx modR/M Byte Decode Type short short short short short short short vector vector short float RISC86® Opcodes float float float fload, float float fload, float float

Instruction Mnemonic F2XM1 FABS FADD ST(0), ST(i) FADD ST(0), mem32real FADD ST(i), ST(0) FADD ST(0), mem64real FADDP ST(i), ST(0) FBLD FBSTP FCHS

46

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 8.

Floating-Point Instructions (continued)
First Second Byte Byte DBh D8h D8h DCh D8h D8h DCh DEh D9h D9h D8h D8h D8h DCh DCh DCh D8h DCh DEh D8h DCh D8h DCh DEh DDh DAh DEh DAh DEh DAh DEh DAh DEh FFh F6h 11-110-xxx 11-110-xxx 11-110-xxx 11-111-xxx 11-111-xxx 11-111-xxx mm-110-xxx mm-110-xxx 11-111-xxx 11-110-xxx 11-111-xxx mm-111-xxx mm-111-xxx 11-110-xxx 11-000-xxx mm-000-xxx mm-000-xxx mm-010-xxx mm-010-xxx mm-011-xxx mm-011-xxx mm-110-xxx mm-110-xxx E2h 11-010-xxx mm-010-xxx mm-010-xxx 11-011-xxx mm-011-xxx mm-011-xxx 11-011-001 modR/M Byte Decode Type vector short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short float fload, float fload, float float fload, float fload, float float float float float float float float float float fload, float fload, float float float float fload, float fload, float float float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float RISC86® Opcodes

Instruction Mnemonic FCLEX FCOM ST(0), ST(i) FCOM ST(0), mem32real FCOM ST(0), mem64real FCOMP ST(0), ST(i) FCOMP ST(0), mem32real FCOMP ST(0), mem64real FCOMPP FCOS ST(0) FDECSTP FDIV ST(0), ST(i) (single precision) FDIV ST(0), ST(i) (double precision) FDIV ST(0), ST(i) (extended precision) FDIV ST(i), ST(0) (single precision) FDIV ST(i), ST(0) (double precision) FDIV ST(i), ST(0) (extended precision) FDIV ST(0), mem32real FDIV ST(0), mem64real FDIVP ST(0), ST(i) FDIVR ST(0), ST(i) FDIVR ST(I), ST(0) FDIVR ST(0), mem32real FDIVR ST(0), mem64real FDIVRP ST(i), ST(0) FFREE ST(I) FIADD ST(0), mem32int FIADD ST(0), mem16int FICOM ST(0), mem32int FICOM ST(0), mem16int FICOMP ST(0), mem32int FICOMP ST(0), mem16int FIDIV ST(0), mem32int FIDIV ST(0), mem16int

Chapter 4

Instruction Dispatch and Execution Timing

47

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 8.

Floating-Point Instructions (continued)
First Second Byte Byte DAh DEh DFh DBh DFh DAh DEh D9h DBh DFh DBh DFh DBh DFh DAh DEh DAh DEh D9h D9h DDh DBh D9h D9h D9h D9h D9h D9h D9h D9h D9h D8h DCh EAh E9h ECh EDh EBh EEh 11-001-xxx 11-001-xxx E8h mm-101-xxx mm-100-xxx F7h E3h mm-010-xxx mm-010-xxx mm-011-xxx mm-011-xxx mm-111-xxx mm-100-xxx mm-100-xxx mm-101-xxx mm-101-xxx 11-000-xxx mm-000-xxx mm-000-xxx mm-101-xxx modR/M Byte mm-111-xxx mm-111-xxx mm-000-xxx mm-000-xxx mm-101-xxx mm-001-xxx mm-001-xxx Decode Type short short short short short short short short vector short short short short short short short short short short short short vector short vector short short short short short short short short short fload, float float float float float float float float float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float RISC86® Opcodes fload, float fload, float fload, float fload, float fload, float fload, float fload, float

Instruction Mnemonic FIDIVR ST(0), mem32int FIDIVR ST(0), mem16int FILD mem16int FILD mem32int FILD mem64int FIMUL ST(0), mem32int FIMUL ST(0), mem16int FINCSTP FINIT FIST mem16int FIST mem32int FISTP mem16int FISTP mem32int FISTP mem64int FISUB ST(0), mem32int FISUB ST(0), mem16int FISUBR ST(0), mem32int FISUBR ST(0), mem16int FLD ST(i) FLD mem32real FLD mem64real FLD mem80real FLD1 FLDCW FLDENV FLDL2E FLDL2T FLDLG2 FLDLN2 FLDPI FLDZ FMUL ST(0), ST(i) FMUL ST(i), ST(0)

48

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 8.

Floating-Point Instructions (continued)
First Second Byte Byte D8h DCh DEh D9h D9h D9h D9h D9h D9h DDh DDh D9h D9h D9h D9h D9h D9h D9h DDh DDh D9h D9h D9h DDh D9h DDh DFh DDh D8h DCh D8h DCh DEh E0h mm-111-xxx mm-100-xxx mm-100-xxx 11-100-xxx 11-101-xxx 11-101-xxx FDh FEh FBh FAh FAh FAh mm-010-xxx mm-010-xxx 11-010xxx mm-111-xxx mm-110-xxx mm-011-xxx mm-011-xxx mm-111-xxx 11-011-xxx D0h F2h F3h F8h F5h FCh mm-100-xxx mm-110-xxx modR/M Byte mm-001-xxx mm-001-xxx 11-001-xxx Decode Type short short short short vector short short short short vector vector short short vector short short short short short short vector vector short short vector short vector vector short short short short short fload, float fload, float float float float float fstore fstore float float float fstore fstore fstore float float float float float float RISC86® Opcodes fload, float fload, float float float

Instruction Mnemonic FMUL ST(0), mem32real FMUL ST(0), mem64real FMULP ST(0), ST(i) FNOP FPTAN FPATAN FPREM FPREM1 FRNDINT FRSTOR FSAVE FSCALE FSIN FSINCOS FSQRT (single precision) FSQRT (double precision) FSQRT (extended precision) FST mem32real FST mem64real FST ST(i) FSTCW FSTENV FSTP mem32real FSTP mem64real FSTP mem80real FSTP ST(i) FSTSW AX FSTSW mem16 FSUB ST(0), mem32real FSUB ST(0), mem64real FSUB ST(0), ST(i) FSUB ST(i), ST(0) FSUBP ST(0), ST(I)

Chapter 4

Instruction Dispatch and Execution Timing

49

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 8.

Floating-Point Instructions (continued)
First Second Byte Byte D8h DCh D8h DCh DEh D9h DDh DDh DAh D9h D9h D9h D9h D9h 9Bh F4h F1h F9h E9h E5h 11-001-xxx E4h 11-100-xxx 11-101-xxx modR/M Byte mm-101-xxx mm-101-xxx 11-100-xxx 11-101-xxx 11-100-xxx Decode Type short short short short short short short short short short short vector short short vector float float RISC86® Opcodes fload, float fload, float float float float float float float float float float

Instruction Mnemonic FSUBR ST(0), mem32real FSUBR ST(0), mem64real FSUBR ST(0), ST(I) FSUBR ST(i), ST(0) FSUBRP ST(i), ST(0) FTST FUCOM FUCOMP FUCOMPP FXAM FXCH FXTRACT FYL2X FYL2XP1 FWAIT

50

Instruction Dispatch and Execution Timing

Chapter 4

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

5
x86 Optimization Coding Guidelines
General x86 Optimization Techniques
This section describes general code optimization techniques specific to superscalar processors (that is, techniques common to the AMD-K6 MMX enhanced processor, AMD-K5™ processor, and Pentium-family processors). In general, all optimization techniques used for the AMD-K5 processor, Pentium, and Pentium Pro processors either improve the performance of the AMD-K6 processor or are not required and have no effect (due to fewer coding restrictions with the AMD-K6 processor). Short Forms—Use shorter forms of instructions to increase the effective number of instructions that can be examined for decoding at any one time. Use 8-bit displacements and jump offsets where possible. Simple Instructions—Use simple instructions with hardwired decode (pairable, short, or fast) because they perform more efficiently. This includes “register ← register op memory” as well as “register←register op register” forms of instructions. Dependencies—Spread out true dependencies to increase the opportunities for parallel execution. Anti-dependencies and output dependencies do not impact performance.

Chapter 5

x86 Optimization Coding Guidelines

51

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Memory Operands — Instructions that operate on data in memory (load/op/store) can inhibit parallelism. The use of separate move and ALU instructions allows better code scheduling for independent operations. However, if there are no opportunities for parallel execution, use the load/op/store forms to reduce the number of register spills (storing values in memory to free registers for other uses). Register Operands — Maintain frequently used values in registers rather than in memory. Stack References—Use ESP for stack references so that EBP remains available. Stack Allocation —When allocating space for local variables and/or outgoing parameters within a procedure, adjust the stack pointer and use moves rather than pushes. This method of allocation allows random access to the outgoing parameters so that they can be set up when they are calculated instead of being held somewhere else until the procedure call. This method also reduces ESP dependencies and uses fewer execution resources. Data Embedding — When data is embedded in the code segment, align it in separate cache blocks from nearby code. This technique avoids some overhead when maintaining coherency between the instruction and data caches. Loops— Unroll loops to get more parallelism and reduce loop overhead, even with branch prediction. Inline small routines to avoid procedure-call overhead. For both techniques, however, consider the cost of possible increased register usage, which might add load/store instructions for register spilling. Code Alignment—Aligning at 0-mod-16 improves performance (ideally at 0-mod-32). However, there is a trade-off between execution speed and code size.

52

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

General AMD-K6™ Processor x86 Coding Optimizations
This section describes general code optimization techniques specific to the AMD-K6 MMX enhanced processor. Use short-decodeable instructions — To increase decode bandwidth and minimize the number of RISC86 operations per x86 instruction, use short-decodeable x86 instructions. See “Instruction Dispatch and Execution Timing” on page 23 for the list of short-decodeable instructions. Pair short-decodeable instructions—Two short-decodeable x86 instructions can be decoded per clock, using the full decode bandwidth of the AMD-K6 processor. Avoid using complex instructions — The more complex and uncommon instructions are vector decoded and can generate a larger ratio of RISC86 operations per x86 instruction than short-decodeable or long-decodeable instructions. 0Fh prefix usage—0Fh does not count as a prefix. Avoid long instruction length — Use x86 instructions that are less than eight bytes in length. An x86 instruction that is longer than seven bytes cannot be short-decoded. Align branch targets—Keep branch targets away from the end of a cache line. 16-byte alignment is preferred for branch targets, while 32-byte alignment is ideal. Use read-modify-write instructions over discrete equivalent— No advantage is gained by splitting read-modify-write instructions into a load-execute-store instruction group. Both read-modif y-w rite instructions and load-execute-store instruction groups decode and execute in one cycle but read-modify-write instructions promote better code density. Move rarely used code and data to separate pages— Placing code, such as exception handlers, in separate pages and data, such as error text messages, in separate pages maximizes the use of the TLBs and prevents pollution with rarely used items.

Chapter 5

x86 Optimization Coding Guidelines

53

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Avoid multiple and accumulated prefixes — In order to accomplish an instruction decode, the decoders require sufficient predecode information. When an instruction has multiple prefixes and this cannot be deduced by the decoders (due to a lack of data in the instruction decode buffer), the first decoder retires and accumulates one prefix at a time until the instruction is completely decoded. Table 9 shows when prefixes are accumulated and decoding is serialized. Table 9.
Instruction Instruction Instruction Prefix PrefixA Instruction Prefix

Decode Accumulation and Serialization
Decoder #2 Results Single instruction decoded Dual instruction decode Single instruction decode and prefix is accumulated

Decode #1

Instruction No prefix accumulation and single instruction (modified by Prefix) is decoded PrefixB Accumulate PrefixA and cancel decode of the second prefix If a prefix has already been accumulated in the previous decode cycle, accumulate PrefixB and cancel instruction decode, wait for next decode cycle to decode the instruction

PrefixB

Instruction

Avoid mixing code size types — Size prefixes that affect the length of an instruction can sometimes inhibit dual decoding. Always pair CALL and RETURN—If CALLs and RETs are not paired, the return address stack gets out of synchronization, increasing the latency of returns and decreasing performance. Exploit parallel execution of integer and floating-point multiplies — The AMD-K6 MMX enhanced processor allows simultaneous integer and floating-point multiplies using separate, low-latency multipliers. Avoid more than 16 levels of nesting in subroutines—More than 16 levels of nested subroutine calls overflow the return address stack, leading to lower performance. While this is not a problem for most code, recursive subroutines might easily exceed 16 levels of subroutine calls. If the recursive subroutine is tail recursive, it can usually be mechanically transformed into an iterative version, which leads to increased performance. 54 x86 Optimization Coding Guidelines Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Place frequently used stack data within 128 bytes of the EBP— The statically most-referenced data items in a function’s stack frame should be located from –128 to +127 bytes from EBP. This technique improves code density by enabling the use of an 8-bit sign-extended displacement instead of a 32-bit displacement. Avoid superset dependencies — Using the larger form of a register immediate after an instruction uses the smaller form creates a superset dependency and prevents parallel execution. For example, avoid the following type of code:
Avoid OR AND AH,055h EAX,1555555h

Avoid excessive loop unrolling or code inlining—Excessive loop unrolling or code inlining increases code size and reduces locality, which leads to lower cache hit rates and reduced performance. Avoid splitting a 16-bit memory access in 32-bit code — No advantage is gained by splitting a 16-bit memory access in 32-bit code into two byte-sized accesses. This technique avoids the operand size override. Avoid data dependent branches around a single instruction— Data dependent branches acting upon basically random data cause the branch prediction logic to mispredict the branch about 50% of the time. Design branch-free alternative code sequences that can replace straight forward code with data dependent branches. The effect is shorter average execution time. The following examples illustrate this concept:
s

Signed integer ABS function (x = labs(x)) Static Latency: 4 cycles
MOV MOV SAR XOR SUB MOV ECX, EBX, ECX, EBX, EBX, [x], [x] ECX 31 ECX ECX EBX ;load value

;1’s complement if x<0, else don’t modify ;2’s complement if x<0, else don’t modify ;save labs result

Chapter 5

x86 Optimization Coding Guidelines

55

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

s

Unsigned integer min function (z = x < y ? x : y) Static Latency: 4 cycles
MOV MOV SUB SBB AND ADD MOV EAX, EBX, EAX, ECX, ECX, ECX, [z], [x] [y] EBX ECX EAX EBX ECX ;load x value ;load y value ;set carry flag if y is greater than x ;get borrow out from previous SUB ;if x > y, ECX = x–y, else 0 ;if x > y, return x–y+y = x, else y ;save min (x,y)

s

Hexadecimal to ASCII conversion (y=x < 10 ? x + 0x30: x + 0x41) Static Latency: 4 cycles
MOV CMP SBB DAS MOV AL, [x] AL, 10 AL, 69h [y],AL ;load x value ;if x is less than 10, set carry flag ;0..9 –> 96h, Ah..Fh –> A1h...A6h ;0..9: subtract 66h, Ah..Fh: Subtract 60h ;save conversion in y

The [ESI] addressing mode—When using [ESI] as an indirect memory addressing mode, explicitly code [ESI] to be [ESI+0]. Doing so improves decode bandwidth. For an example, see “Floating-Point Code Sample” on page 64.

56

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

AMD-K6™ Processor Integer x86 Coding Optimizations
This section describes integer code optimization techniques specific to the AMD-K6 MMX enhanced processor. Neutral code filler — Use the XCHG EAX, EAX or NOP instruction when aligning instructions. XCHG EAX, EAX consumes decode slots but requires no execution resources. Essentially, the scheduler absorbs the equivalent RISC86 operation without requiring any of the execution units. Inline REP String with low counts — Expand REP String instructio ns into e quival ent seq uence s of sim ple x 86 instructions. This technique eliminates the setup overhead of these instructions and increases instruction throughput. Use ADD reg, reg instead of SHL reg, 1 — This optimization technique allows the scheduler to use either of the two integer adders rather than the single shifter and effectively increases overall throughput. The only difference between these two instructions is the setting of the AF flag. Access 16-bit memory data using the MOVSX and MOVZX instructions — The AMD-K6 processor has direct hardware support for extending word size operands to doubleword length. Use load-execute integer instructions — Most load-execute integer instructions are short-decodeable and can be decoded at the rate of two per cycle. Splitting a load-execute instruction into two separate instructions—a load instruction and a reg, reg instruction — reduces decoding bandwidth and increases register pressure. Use AL, AX, and EAX to improve code density—In many cases, instructions using AL and EAX can be encoded in one less byte than using a general purpose register. For example, ADD AX, 0x5555 should be encoded 05 55 55 and not 81 C0 55 55. Clear registers using MOV reg, 0 instead of XOR reg, reg— Executing XOR reg, reg requires additional overhead due to register dependency checking and flag generation. Using MOV reg, 0 produces a limm (load immediate) RISC86 operation that is completed when placed in the scheduler and does not consume execution resources. Chapter 5 x86 Optimization Coding Guidelines 57

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

U s e 8 -b i t s i g n -ex t e n d e d i m m e d i a t e s — U s i n g 8 -b i t sign-extended immediates improves code density with no negative effects on the AMD-K6 processor. For example, ADD BX, –55 should be encoded 83 C3 FB and not 81 C3 FF FB. Use 8-bit sign-extended displacements for conditional branches—Using short, 8-bit sign-extended displacements for conditional branches improves code density with no negative effects on the AMD-K6 processor. Use integer multiply over shift-add sequences when it is advantageous — The AMD-K6 MMX enhanced processor features a low-latency integer multiplier. Therefore, almost any shift-add sequences can have higher latency than MUL or IMUL ins t ru ctio ns . An ex ce pti on is a tr ivi al ca se invo lv ing multiplication by powers of two by means of left shifts. In general, replacements should be made if the shift-add sequences have a latency greater than or equal to 3 clocks. Carefully choose the best method for pushing memory data — To reduce register pressure and code dependency, use PUSH [mem] rather than MOV EAX, [mem], PUSH EAX. Balance the use of CWD, CBW, CDQ, and CWDE — These instructions require special attention to avoid either decreased decode or execution bandwidth. The following code illustrates the possible trade-offs:
s

The following code replacement trades decode bandwidth (CWD is vector decoded, but with only one RISC86 operation) with execution bandwidth (SAR requires two RISC86 operations, including a shift):
Replace:CWD With: MOV SAR DX,AX DX,15

s

Th e fo l l ow i n g c o d e re p l a c e m e n t i m p rove s d e c o d e bandwidth (CBW is vector decoded while MOVSX is short decoded):
Replace:CBW With: MOVSX AX,AL

s

The following code replacement trades decode bandwidth (CDQ is vector deco de d, but with o nly t wo RI SC86 operations) with execution bandwidth (SAR requires two RISC86 operations, including a shifter):
Replace:CDQ With: MOV SAR EDX,EAX EDX,31

58

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

s

Th e fo l l ow i n g c o d e re p l a c e m e n t i m p rove s d e c o d e bandwidth (CWDE is vector decoded while MOVSX is short decoded):
Replace:CWDE With: MOVSX EAX, AX

Replace integer division by constants with multiplication by the reciprocal—This is a commonly used optimization on RISC CPUs. Because the AMD-K6 processor has an extremely fast integer multiply (two cycles) and the integer division delivers only two bits of quotient per cycle (approximately 18 cycles for 32-bit divides), the equivalent code is much faster. The following examples illustrate the use of integer division by constants:
s

Unsigned division by 10 using multiplication by reciprocal Static Latency: 5 cycles
; IN: EAX = dividend ; OUT:EDX = quotient MOV EDX, 0CCCCCCCDh MUL EDX SHR EDX, 3

;0.1 * 2^32 * 8 rounded up ;divide by 2^32 * 8

s

Unsigned division by 3 using multiplication by reciprocal Static Latency: 5 cycles
; IN: EAX = dividend ; OUT:EDX = quotient MOV EDX, 0AAAAAAABh MUL EDX SHR EDX, 1

;1/3 * 2^32 * 2 rounded up ;divide by 2^32 * 2

s

Signed division by 2 Static Latency: 3 cycles
; IN: EAX = dividend ; OUT:EAX = quotient CMP EAX, 800000000h SBB EAX, –1 SAR EAX, 1

;CY = 1, if dividend >=0 ;increment dividend if it is <0 ;perform a right shift

s

Signed division by 2^n Static Latency: 5 cycles
; IN: EAX = dividend ; OUT:EAX = quotient MOV EDX, EAX SAR EDX, 31 AND EDX, (2^n–1) ADD EAX, EDX SAR EAX, (n)

;sign extend into EDX ;EDX = 0xFFFFFFFF if dividend < 0 ;mask correction (use divisor –1) ;apply correction if necessary ;perform right shift by log2 (divisor)

Chapter 5

x86 Optimization Coding Guidelines

59

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

s

Signed division by –2 Static Latency: 4 cycles
; IN: EAX = dividend ; OUT:EAX = quotient CMP EAX, 800000000h SBB EAX, –1 SAR EAX, 1 NEG EAX

;CY = 1, if dividend >=0 ;increment dividend if it is <0 ;perform right shift ;use (x/–2) = = – (x/2)

s

Signed division by –(2^n) Static Latency: 6 cycles
; IN: EAX = dividend ; OUT:EAX = quotient MOV EDX, EAX SAR EDX, 31 AND EDX, (2^n–1) ADD EAX, EDX SAR EAX, (n) NEG EAX

;sign extend into EDX ;EDX = 0xFFFFFFFF if dividend < 0 ;mask correction (–divisor –1) ;apply correction if necessary ;right shift by log2(–divisor) ;use (x/–(2^n)) = = (– (x/2^n))

s

Remainder of signed integer 2 or (–2) Static Latency: 4 cycles
; IN: EAX = dividend ; OUT:EDX = quotient MOV EDX, EAX SAR EDX, 31 AND EDX, 1 XOR EAX, EDX SUB EAX, EDX MOV [quotient], EAX

;sign extend into EDX ;EDX = 0xFFFFFFFF if dividend < 0 ;compute remainder ;negate remainder if ;dividend was < 0

s

Remainder of signed integer (2^n) or (–(2^n))) Static Latency: 6 cycles
; IN: EAX = dividend ; OUT:EDX = quotient MOV EDX, EAX SAR EDX, 31 AND EDX, (2^n–1) ADD EAX, EDX AND EAX, (2^n–1) SUB EAX, EDX MOV [quotient], EAX

;sign extend into EDX ;EDX = 0xFFFFFFFF if dividend < 0 ;mask correction (abs(divison)–1) ;apply pre-correction ;mask out remainder (abs(divison)–1) ;apply pre-correction if necessary

60

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

AMD-K6™ Processor Multimedia Coding Optimizations
This se cti on de scr ibe s multime dia co de o ptimiza tio n techniques specific to the AMD-K6 MMX enhanced processor. Pair MMX instructions with short-decodeable instructions— M M X i n s t r u c t i o n s a re s h o r t -d e c o d e a b l e a n d c a n b e simultaneously decoded with any other short-decodeable instruction. This technique requires that MMX instructions be arranged as the first of a pair of short-decodeable instructions. Avoid using MMX registers to move double-precision floating-point data—Although using an MMX register to move floating-point data appears fast, using MMX registers requires the use of the EMMS instruction when switching from MMX to floating-point instructions. Avoid switching between MMX and FPU instructions—Because the MMX registers are mapped on to the floating-point stack, the EMMS instruction must be executed after using MMX code and prior to the use of the floating-point unit. Group or partition MMX code away from FPU code so that the use of the EMMS instruction is minimized. Also, the actual penalty from the use of the EMMS instruction occurs not at the time of execution but when the first floating-point instruction is encountered.

Chapter 5

x86 Optimization Coding Guidelines

61

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

AMD-K6™ Processor Floating-Point Coding Optimizations
This section describes floating-point code optimization techniques specific to the AMD-K6 MMX enhanced processor. Avoid vector decoded floating-point instructions — Most floating-point instructions are short decodeable. A few of the less common instructions are vector decoded. Additionally if a short decodeable instruction straddles a cache line, it will become vector decoded. This adds unnecessary overheard that can be avoided by inserting NOPs in strategic locations within the code. Pair floating-point with short-decodeable instructions— Most floating-point instructions (also known as ESC instructions) are short-decodeable and are limited to the first decoder. The short-decodeable floating-point instructions can be paired with other short-decodeable instructions. This technique requires that floating-point instructions be arranged as the first of a pair of short-decodeable instructions. Avoid FXCH usage — Pairing FXCH with other floating-point instructions does not increase performance. Avoid switching between MMX and FPU instructions—Because the MMX registers are mapped on to the floating-point stack, the EMMS instruction must be executed after using MMX code and prior to the use of the floating-point unit. Group or partition MMX code away from FPU code so that the use of the EMMS instruction is minimized. Also, the actual penalty from the use of the EMMS instruction occurs not at the time of execution but when the first floating-point instruction is encountered. Avoid using MMX registers to move double-precision floating-point data—Although using an MMX register to move floating-point data appears fast, using MMX registers requires the use of the EMMS instruction when switching from MMX to floating-point instructions.

62

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Exploit parallel execution of integer and floating-point multiplies — The AMD-K6 MMX enhanced processor allows simultaneous integer and floating-point multiplies using separate, low-latency multipliers. Avoid splitting floating-point instructions with integer instructions—No penalty is incurred when using arithmetic or comparison floating-point instructions that use integer o p e ra n d s , s u ch a s F I A D D o r F I C O M . S p l i t t i n g t h e s e instructions into discrete load and floating-point instructions decreases performance. Replace FDIV instructions with FMUL where possible— The FMUL instruction latency is much less than the FDIV instruction. When possible, replace floating-point divisions with floating-point multiplication of the reciprocal. Use integer instructions to move floating-point data — A floating-point load and store instruction pair requires a minimum of four cycles to complete (two-cycle latency for each instruction). The AMD-K6 processor can perform one integer load and one store per cycle. Therefore, moving single-precision data requires one cycle, moving double-precision data requires two cycles, and moving extended-precision data only requires three cycles when using integer loads and stores. The example below shows how to translate the C-style code when moving double-precision floating-point data:
double temp1, temp2; temp2 = temp1; Avoid:FLD FSTP QWORD PTR [temp1];Use: QWORD PTR [temp2]; MOV MOV MOV MOV EAX, [temp1]; [temp2], EAX; EAX, [temp1+4]; [temp2+4], EAX;

Scheduling of floating-point instructions is unnecessary—The A M D -K 6 p ro c e s s o r h a s a l ow -l a t e n cy, n o n -p i p e l i n e d floating-point execution unit. Therefore, no scheduling between floating-point instructions is necessary. Use load-execute floating-point instructions — The use of a load-execute instruction (such as, FADD DWORD PRT [mem]) is preferable to the use of a load floating-point instruction followed by a FP reg, reg instruction. For the AMD-K6 processor, load-execute arithmetic and compare instructions Chapter 5 x86 Optimization Coding Guidelines 63

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

are identical in throughput to FP reg, reg instructions. Because common floating-point instructions execute in two cycles each and the floating-point unit is not pipelined, code executes more efficiently if the minimum possible number of floating-point instructions are generated.

Floating-Point Code Sample
The following code sample uses three important rules to optimize this matrix multiply routine. The first rule is to force [ESI] to be [ESI+0]. The second rule is the insertion of NOPs to avoid cache line straddles. The third rule used is avoiding vector decoded instructions.
MATMUL db FMUL FLD FMUL FLD FMUL FLD FMUL FADDP FADDP FADDP FSTP NOP db FMUL FLD FMUL FLD NOP FMUL FLD FMUL FADDP FADDP FADDP NOP FSTP db FMUL FLD FMUL NOP FLD MACRO 0d9h, 046h, 00h DWORD PTR [EBX] DWORD PTR [ESI+4] DWORD PTR [EBX+4] DWORD PTR [ESI+8] DWORD PTR [EBX+8] DWORD PTR [ESI+12] DWORD PTR [EBX+12] ST(3), ST ST(2), ST ST(1), ST DWORD PTR [EDI] ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; FLD DWORD PTR [ESI+00] ;;x a11*x y a21*y z a31*z w a41*w a41*w+a31*z a41*w+a31*z+a21*y a41*w+a31*z+a21*y+a11*x store rx make sure it does not straddle across a cache line FLD DWORD PTR [ESI+00] ;; x a12*x y a22*y z make sure it does not straddle across a cache line a32*z w a42*w a42*w+a32*z a42*w+a32*z+a22*y a42*w+a32*z+a22*y+a12*x make sure it does not straddle across a cache line store ry FLD DWORD PTR [ESI+00] ;; x a13*x y a23*y make sure it does not straddle across a cache line z

0d9h, DWORD DWORD DWORD DWORD

046h, 00h PTR [EBX+16] PTR [ESI+4] PTR [EBX+20] PTR [ESI+8]

DWORD PTR [EBX+24] DWORD PTR [ESI+12] DWORD PTR [EBX+28] ST(3), ST ST(2), ST ST(1), ST

DWORD 0d9h, DWORD DWORD DWORD

PTR [EDI+4] 046h, 00h PTR [EBX+32] PTR [ESI+4] PTR [EBX+36]

DWORD PTR [ESI+8]

64

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

FMUL FLD FMUL FADDP FADDP FADDP FSTP db FMUL FLD FMUL FLD FMUL FLD FMUL FADDP NOP FADDP FADDP FSTP ENDM

DWORD PTR [EBX+40] DWORD PTR [ESI+12] DWORD PTR [EBX+44] ST(3), ST ST(2), ST ST(1), ST DWORD PTR [EDI+8] 0d9h, 046h, 00h DWORD PTR [EBX+48] DWORD PTR [ESI+4] DWORD PTR [EBX+52] DWORD PTR [ESI+8] DWORD PTR [EBX+56] DWORD PTR [ESI+12] DWORD PTR [EBX+60] ST(3), ST

ST(2), ST ST(1), ST DWORD PTR [EDI+12]

;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;; ;;

a33*z w a43*w a43*w+a33*z a43*w+a33*z+a23*y a43*w+a33*z+a23*y+a13*x store rz FLD DWORD PTR [ESI+00] ;; x a14*x y a24*y z a34*z w a44*w a44*w+a34*z make sure it does not straddle across a cache line a44*w+a34*z+a24*y a44*w+a34*z+a24*y+a14*x store rw

Chapter 5

x86 Optimization Coding Guidelines

65

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

66

x86 Optimization Coding Guidelines

Chapter 5

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

6
Considerations for Other Processors
The tables in this chapter contain information describing how AMD-K6 MMX enhanced processor-specific optimization techniques affect other processors, including the AMD-K5 processor. Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors
AMD-K5 Processor Guideline/Event Jumps and Loops AMD-K5 Processor Details JCXZ requires 1 cycle (correctly predicted) and therefore is faster than a TEST/JZ. All forms of LOOP take 2 cycles (correctly predicted). Although there is only one shifter, certain shifts can be done using other execution units. For example, shift left 1 by adding a value to itself. Use LEA index scaling to shift left by 1, 2, or 3. Independent IMULs can be pipelined at one per cycle with 4-cycle latency. (MUL has the same latency, although the implicit AX usage of MUL prevents independent, parallel MUL operations.) Usage/Effect on AMD-K6 Processors Different AMD-K6 Processor Details JCXZ takes 2 cycles when taken and 7 cycles when not taken. LOOP takes 1 cycle. Shifts are short decodeable and converted to a single RISC86 shift operation that executes only in the Integer X unit. LEA is executed in the store unit. 2– or 3–cycle throughput and latency (3 cycles only if the upper half of the product is produced)

Shifts

Same

Multiplies

Different

Chapter 6

Considerations for Other Processors

67

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)
AMD-K5 Processor Guideline/Event AMD-K5 Processor Details Load-balancing (that is, selecting instructions for parallel decode) is still important, but to a lesser extent than on the Pentium processor. In particular, arrange instructions to avoid execution-unit dispatching conflicts. Usage/Effect on AMD-K6 Processors AMD-K6 Processor Details

Dispatch Conflicts

Same

For byte operations, the high and low bytes of AX, BX, CX, and DX are effectively independent registers that can be Byte Operations operated on in parallel. For example, reading AL does not have a dependency on an outstanding write to AH. Floating-Point Top-of-Stack Bottleneck Move and Convert Indexed Addressing Instruction Prefixes Floating-Point Execution Parallelism The AMD-K5 processor has a pipelined floating-point unit. Greater parallelism can be achieved by using FXCH in parallel with floating-point operations to alleviate the top-of-stack bottleneck, as in the Pentium. MOVZX, MOVSX, CBW, CWDE, CWD, CDQ all take 1 cycle (2 cycles for memory-based input). There is no penalty for base + index addressing in the AMD-K5 processor. There is no penalty for instruction prefixes, including combinations such as segment-size and operand-size prefixes. This is particularly important for 16-bit code. The AMD-K5 processor permits integer operations (ALU, branch, load/store) in parallel with floating-point operations.

Same

Register dependency is checked on a byte boundary.

Not required

Loads and stores are performed in parallel with floating-point instructions. Zero and sign extension are short-decodeable with 1 cycle execution.

Same Same

Possible

A penalty can only occur during accumulated prefix decoding. In addition, the AMD-K6 processor allows two integer, a branch, a load, and a store. Branch targets should be placed on 0 mod 16 alignment for optimal performance.

Same

Performance can be sensitive to code alignment, especially in tight loops. Locating Branch Locating branch targets to the first 17 Targets bytes of the 32-byte cache line maximizes the opportunity for parallel execution at the target. NOPs The AMD-K5 processor executes NOPs (opcode 90h) at the rate of two per cycle. Adding NOPs is even more effective if they execute in parallel with existing code.

Optional

Same

NOPs are short-decodeable and consume decode bandwidth but no execution resources.

68

Considerations for Other Processors

Chapter 6

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 10. Specific Optimizations and Guidelines for AMD-K6™ and AMD-K5™ Processors (continued)
AMD-K5 Processor Guideline/Event Branch Prediction AMD-K5 Processor Details There are two branch prediction bits in a 32-byte instruction cache line. For effective branch prediction, code should be generated with one branch per 16-byte line half. BSF and BSR take 1 cycle (2 cycles for memory-based input), compared to the Pentium’s data-dependent 6 to 34 cycles. BT, BTS, BTR, and BTC take 1 cycle for register-based operands, and 2 or 3 cycles for memory-based operands with immediate bit-offset. Register-based bit-offset forms on the AMD-K5 processor take 5 cycles. Usage/Effect on AMD-K6 Processors AMD-K6 Processor Details

Not required

This optimization has a neutral effect on the AMD-K6 processor.

Bit Scan

Different

A multi-cycle operation, but faster than Pentium.

Bit Test

Different

Bit test latencies are similar to the Pentium.

Table 11. AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines
Pentium Guideline/Event Instruction Fetches Across Two Cache Lines Mispredicted Conditional Branch Executed in U pipe Mispredicted Conditional Branch Executed in V pipe Mispredicted Calls Mispredicted Unconditional Jumps FXCH Optimizing Index Versus Base Register No Penalty Pentium Effect Usage/Effect on AMD-K6 Processors Possible AMD-K6 Processor Details Decode penalty only if there is not sufficient information to decode at least one instruction. Mispredicted branches have a 1– to 4–cycle penalty. Mispredicted branches have a 1– to 4–cycle penalty.

3-cycle penalty

Different

4-cycle penalty 3-cycle penalty 3-cycle penalty Pairs with most FP instructions and effectively hides FP stack manipulations. 1-cycle penalty to calculate the effective address when an index register is used.

Different None None None

None

Chapter 6

Considerations for Other Processors

69

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 11. AMD-K6™ Processor Versus Pentium® Processor-Specific Optimizations and Guidelines
Pentium Guideline/Event Address Generation Interlock Due to Explicit Register Usage Address Generation Interlock Due to Implicit Register Usage Pentium Effect 1-clock penalty when instructions are not scheduled apart by at least one instruction. 1-clock penalty when instructions are not scheduled apart by at least one instruction. Usage/Effect on AMD-K6 Processors None AMD-K6 Processor Details However, it is best to schedule apart the dependency. However, it is best to schedule apart the dependency.

None None Same Possible None Yes None Similar Similar None

Instructions with an 1-cycle penalty Immediate Displacement Carry & Borrow Instructions Issue Limits Prefix Decode Penalty 0Fh Prefix Penalty MOVZX Acceleration Unpairability Due to Register Dependencies Shifts with Immediates Issue Limitations Floating-Point Ops Issue Limitation Issued to U pipe only 1-clock delay 1-clock delay No, incurs 4-cycle latency Incurred during flow and output dependency. Issued to U pipe only Issued to U pipe only

Issued to Integer X unit only. Delays can occur due to prefix accumulation. Short-decodeable, 1 cycle. Dependencies do not affect instruction decode. Issued to the Integer X unit only. Issued to dedicated floating-point unit. Conditional code such as JCCs are short decodeable and pairable. The AMD-K6 processor has a separate floating-point execution unit. Long and vector decodeable only. 1-clock delay.

Conditional Code Pairing Special pairing case Integer Execution Delay Due to Transcendental Operation Instructions Greater Than 7 Bytes Misaligned Data Penalty

Issue to U pipe is stalled Issued to U pipe only 3-clock delay

None Similar Partial

70

Considerations for Other Processors

Chapter 6

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 12. AMD-K6™ Processor and Pentium® Processor with Optimizations for MMX™ Instructions
Pentium/MMX Guideline/Event 0Fh Prefix Penalty Three-clock Stalls for Dependent MMX Multiplies Two-clock Stalls for Writing Then Storing an MMX Register U Pipe: Integer/MMX Pairing None Dependent instruction must be scheduled two instruction pairs following the multiply. Requires scheduling the store two cycles after writing (updating) the MMX register. MMX instruction that access either memory or integer registers cannot be executed in the V pipe. V pipe integer instruction must be pairable. Pentium/MMX Effect Usage/Effect on AMD-K6 Processor None Different MULL - 1 clock MULH - 2 clocks MADD - 2 clocks AMD-K6 Processor Details

None Pairing requires short-decodeable integer instruction as the second instruction. Pairing requires short-decodeable integer instruction as the second instruction.

Different

U Pipe: MMX/Integer Pairing

Similar

Cannot pair two MMX multiplies, two MMX shifts, or Pairing Two MMX Instructions MMX instructions in V pipe with U pipe dependency. 66h or 67h Prefix Penalty Three clocks.

None None

Table 13. AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations
Pentium Pro Guideline/Event Pentium Pro Effect Avoid reading a large register after writing a smaller version of the same register. This causes the P6 to stall the issuing of instructions that reference the full register and all subsequent instructions until after the partial write has retired. If the partial register update is adjacent to a subsequent full register read, the stall lasts at least seven clock cycles with respect to the decoder outputs. On the average, such a stall can prevent from 3 to 21 micro-ops from being issued. Usage/Effect on AMD-K6 AMD-K6 Processor Detail Processor The AMD-K6 processor performs register dependency checking on a byte granularity. Due to shorter pipelines, execution latency, and commitment latency, instruction issuing is not affected. However, execution is stalled.

Partial-Register Stalls

Different

Chapter 6

Considerations for Other Processors

71

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

Table 13. AMD-K6™ Processor and Pentium® Pro Processor-Specific Optimizations (continued)
Pentium Pro Guideline/Event Branches Avoid Self-Modifying Code Code Alignment Predicted Branch Penalty Mispredicted Branch Misaligned Data Penalty 2-Byte Data Alignment Pentium Pro Effect Exploit the P6 return stack by using a RET rather than a JMP at the end of a subroutine. Code that alters itself can cause the P6 to flush the processor’s pipelines and can invalidate code resident in caches. 16-byte block BTB suffers 1-cycle delay Minimum 9, typically 10 to 15 clocks More than 3 clocks 4-byte boundary Usage/Effect on AMD-K6 AMD-K6 Processor Detail Processor Same The AMD-K6 processor contains a Call/Return stack.

Same Same None Different Different Same The AMD-K6 processor uses parallel adders for on-the-fly address generation. 1 to 4 clocks. 1-clock maximum delay. Note, the misalignment penalty is only a 1-clock delay. Note, the misalignment penalty is only a 1-clock delay. Note, the misalignment penalty is only a 1-clock delay. Long-decodeable and vector-decodeable. Delays can sometimes occur due to prefix accumulation. Short-decodeable, 1 cycle. 3 clocks.

4-Byte Data Alignment

4-byte boundary

Same

8-Byte Data Alignment Instruction Lengths Greater Than 7 Bytes Prefix Penalty 0Fh Prefix Penalty MOVZX Acceleration Static Prediction Penalty

8-byte boundary Issued one at a time 1-clock delay None Yes 6 clocks

Same Different Possible None Yes Different

72

Considerations for Other Processors

Chapter 6

21828A/0—August 1997

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

Table 14. AMD-K6™ Processor and Pentium® Pro with Optimizations for MMX™ Instructions
Pentium Pro Guideline/Event Three Clock Stalls for Dependent MMX Multiplies Pentium Pro Effect Dependent instruction must be scheduled two instruction pairs following the multiply. Usage/Effect on AMD-K6 Processor AMD-K6 Processor Details None MULL - 1 clock MULH - 2 clocks MADD - 2 clocks

Cannot pair two MMX multiplies, two Pairing Two MMX Instructions MMX shifts, or MMX instructions in V-pipe with U-pipe dependency. Predicted Branches not in the ~5-cycle latency BTB

None Different 1-cycle latency for BTB miss.

Chapter 6

Considerations for Other Processors

73

AMD-K6™ MMX™ Enhanced Processor x86 Code Optimization

21828A/0—August 1997

74

Considerations for Other Processors

Chapter 6