Why 64-Bit Computing?
The question of why we need 64-bit computing is often asked but rarely answered in a satisfactory manner. There are good reasons for the confusion surrounding the question.

That is why first of all; let's look through the list of users who need 64 addressing and 64-bit calculations today:

Users of CAD, designing systems, simulators do need RAM over 4 GB. Although there are ways to avoid this limitation (for example, Intel PAE), it impacts the performance. Thus, the Xeon processors support the 36bit addressing mode where they can address up to 64GB RAM. The idea of this support is that the RAM is divided into segments, and an address consists of the numbers of segment and locations inside the segment. This approach causes almost 30% performance loss in operations with memory. Besides, programming is much simpler and more convenient for a flat memory model in the 64bit address space - due to the large address space a location has a simple address processed at one pass. A lot of design offices use quite expensive workstations on the RISC processors where the 64bit addressing and large memory sizes are used for a long time already.

Users of data bases. Any big company has a huge data base, and extension of the maximum memory size and possibility to address data directly in the data base is very costly. Although in the special modes the 32bit architecture IA32 can address up to 64GB memory, a transition to the flat memory model in the 64bit space is much more advantageous in terms of speed and ease of programming.

Scientific calculations. Memory size, a flat memory model and no limitation for processed data are the key factors here. Besides, some algorithms in the 64bit representation have a much simpler form.

Cryptography and safety ensuring applications get a great benefit from 64bit integer calculations.

What is 64-bit computing?
The labels "16-bit," "32-bit" or "64-bit," when applied to a microprocessor, characterize the processor's data stream. Although you may have heard the term "64-bit code," this designates code that operates on 64-bit data.

In more specific terms, the labels "64-bit," 32-bit," etc. designate the number of bits that each of the processor's general-purpose registers (GPRs) can hold. So when someone uses the term "64-bit processor," what they mean is "a processor with GPRs that store 64-bit numbers." And in the same vein, a "64-bit instruction" is an instruction that operates on 64-bit numbers.

In the diagram above black boxes are code, white boxes are data, and gray boxes are results. The instruction and code "sizes" are not to be taken literally, since they're intended to convey a general feel for what it means to "widen" a processor from 32 bits to 64 bits.

Not all the data either in memory, the cache, or the registers is 64-bit data. Rather, the data sizes are mixed, with 64 bits being the widest.

Note that in the 64-bit CPU pictured above, the width of the code stream has not changed; the same-sized opcode could theoretically represent an instruction that operates on 32-bit numbers or an instruction that operates on 64-bit numbers, depending on what the opcode's default data size is. On the other hand, the width of the data stream has doubled. In order to accommodate the wider data stream, the

sizes of the processor's registers and the sizes of the internal data paths that feed those registers must be doubled.

Now let's take a look at two programming models, one for a 32-bit processor and another for a 64-bit

The registers in the 64-bit CPU pictured above are twice as wide as those in the 32bit CPU, but the size of the instruction register (IR) that holds the currently executing instruction is the same in both processors. Again, the data stream has doubled in size, but the instruction stream has not. Finally, the program counter (PC) has also doubled in size.

For the simple processor pictured above, the two types of data that it can process are integer data and address data. Ultimately, addresses are really just integers that designate a memory address, so address data is just a special type of integer data. Hence, both data types are stored in the GPRs and both integer and address calculations are done by the ALU.

Many modern processors support two additional data types: floating-point data and vector data. Each of these two data types has its own set of registers and its own execution unit(s). The following table compares all four data types in 32-bit and 64bit processors:

Data Type Integer Address Floating Point* Vector

Register Type GPR GPR FPR VR


x86 width 32 32 64 128

x86-64 width 64 64 64 128

*x87 uses 80-bit registers to do double-precision floating-point. The floats themselves are 64-bit, but the processor converts them to an internal, 80-bit format for increased precision when doing computations.

From the table above that the difference the move to 64 bits makes is in the integer and address hardware. The floating-point and vector hardware stays the same.

Now that we know what 64-bit computing is, let's take a look at the benefits of increased integer and data sizes.

Dynamic range
The main thing that a wider integer gives you is increased dynamic range.

In the base-10 number system to which we're all accustomed, you can represent a maximum of ten integers (0 to 9) with a single digit. This is because base-10 has ten different symbols with which to represent numbers. To represent more than ten integers you need to add another digit, using a combination of two symbols chosen from among the set of ten to represent any one of 100 integers (00 to 99). The general formula that you can use to compute the number of integers (dynamic range, or DR) that you can represent with an n-digit base-ten number is:

DR = 10n

So a 1-digit number gives you 101 = 10 possible integers, a 2-digit number 102 = 100 integers, a 3-digit number 103 = 1000 integers, and so on.

The base-2, or "binary," number system that computers use has only two symbols with which to represent integers: 0 and 1. Thus, a single-digit binary number allows you to represent only two integers, 0 and 1. With a two-digit (or "2-bit") binary, you can represent four integers by combining the two symbols (0 and 1) in any of the following four ways:

00 = 0 01 = 1

10 = 2 11 = 3

Similarly, a 3-bit binary number gives you eight possible combinations, which you can use to represent eight different integers. As you increase the number of bits, you increase the number of integers you can represent. In general, n bits will allow you to represent 2n integers in binary. So a 4-bit binary number can represent 24 or 16 integers, an 8-bit number gives you 28=256 integers, and so on.

So in moving from a 32-bit GPR to a 64-bit GPR, the range of integers that a processor can manipulate goes from 232 = 4.3e9 to 264 = 1.8e19. The dynamic range, then, increases by a factor of 4.3 billion. Thus a 64-bit integer can represent a much larger range of numbers than a 32-bit integer.

The benefits of increased dynamic range, Or, how the existing 64-bit computing market uses 64-bit integers? Since addresses are just special-purpose integers, an ALU and register combination that can handle more possible integer values can also handle that many more possible addresses. With all the recent press coverage that 64-bit architectures have garnered, it's fairly common knowledge that a 32-bit processor can address at most 4GB of memory. (Remember our 232 = 4.3 billion number? That 4.3 billion bytes is about 4GB.) A 64-bit architecture could theoretically, by contrast, address up to 18 million terabytes.

So, what do you do with over 4GB of memory? Well, caching a very large database in it is a start. Back-end servers for mammoth databases are one place where 64 bits have long been a requirement, so it's no surprise to see upcoming 64-bit offerings billed as capable database platforms.

On the media and content creation side of things, folks who work with very large 2D image files also appreciate the extra RAM. And a related, much interesting application domain where large amounts of memory come in handy is in simulation and modeling. Under this heading you could put various CAD tools and 3D rendering programs, as well as things like weather and scientific simulations, and even real-

time 3D games. Though the current crop of 3D games wouldn't benefit from greater than 4GB of RAM, it is quite possible that we'll see a game that benefits from greater than 4GB RAM within the next five years.

Some applications, mostly in the realm of scientific computing (MATLAB, Mathematica, MAPLE, etc.) and simulations, require 64-bit integers because they work with numbers outside the dynamic range of 32-bit integers. When the result of a calculation exceeds the range of possible integer values, you get a situation called either overflow (i.e. the result was greater than the highest positive integer) or underflow (i.e. the result was less than the largest negative integer). When this happens, the number you get in the register isn't the right answer. There's a bit in the x86's processor status word that allows you to check to see if an integer has just exceeded the processor's dynamic range, so you know that the result is bogus. Such situations are rare in integer applications.

Programmers who run into integer overflow or underflow problems on a 32-bit platform do have the option of using a 64-bit integer construct provided by a higher level language like C. In such cases, the compiler uses two registers per integer, one for each half of the integer, to do 64-bit calculations in 32-bit hardware. This has obvious performance drawbacks, making it less desirable than a true 64-bit integer implementation.

Finally, there is another application domain for which 64-bit integers can offer real benefits: cryptography. Most popular encryption schemes rely on the multiplication and factoring of very large integers and the larger the integers the more secure the encryption.

64-bit integer code runs slowly on a 32-bit machine, due to the fact that the 64-bit computations have to be split apart and processed as two separate 32-bit computations. So you could say that there's a performance penalty for running 64-bit integer code on a 32-bit machine; this penalty is absent when running the same code on a 64-bit machine, since the computation doesn't have to be split in two. The takehome point here is that only applications that require and use 64-bit integers will see a performance increase on 64-bit hardware that is due solely to a 64-bit processor's wider registers and increased dynamic range.

64 bit Architectures

Let’s discuss 64 bit Architectures from the leaders of Processor Manufacturers – AMD & Intel (AMD’s Opteron & Intel’s Itanium).

Intel 64-bit architecture (IA-64)
By using a technique called VLIW, the letters VLIW mean “Very Large Instruction Word”. Processors that use this technique access the memory by transferring long program words, and in each word many instructions are packed. In the case of the IA-64, three instructions are used for each pack of 128 bits. As each instruction has 41 bits, there are 5 bits left that will be used to indicate the kinds of instruction that were packed. Figure 1 shows the instruction packaging scheme. This packaging lessens the number of memory accesses, leaving to the compiler the task of grouping the instructions in order to get the best of the architecture.

Instruction packaging used in the IA-64 architecture. As it has already been said, the 5-bit field, named as “pointer”, serves to indicate the kinds of instructions that are packed. Those 5 bits offer 32 kinds of packaging possible that, in fact, are reduced to 24 kinds, since 8 are not used. Each instruction uses one of the CPU features, which are listed below, and that can be identified in Figure given below.

Unit Unit Unit Unit

I - integer data F - floating-point operations M - memory access and B - branch prediction.

The architecture that Intel suggests to execute those instructions, that was called Itanium, is versatile and promises performance by means of the simultaneous (parallel) execution of up to 6 instructions. Figure shows the diagram in blocks of this architecture that uses a ‘pipeline’ of 10 stages.

Block diagram of the Itanium CPU (IA-64 architecture).

The basic structural unit of the Itanium looks like the picture above. The data bus can cope according to Intel with a data rate of 2.1GB/sec. The Itanium processor contains 4 integer ALUs, 4 multimedia ALUs, 2 AGUs, 3 branching units and 4 FPUs for arithmetic with floating point numbers. The processor is capable of theoretically performing 20 operations in one clock cycle by loading 16 operands and evaluating 4 ALU operations. This possibility should not be confused with the number of instructions possible within one clock cycle - namely six. The instructions are

retrieved from memory and are bundled by a process called bundle rotation; this prepares the execution of parallel instructions on the hardware level. The instructions are fetched from the cache speculatively. All this is implemented with the help of 128 floating point registers, 128 integer registers and 8 branching registers, which all support explicitly 64-bits

The IA-64 architecture receives the sigla EPIC, which means “Explicit Parallel Instruction Computing”. By using this sigla, Intel wants to say that the compiler will be the great responsible for determining and clearing the parallelism present in the instructions to be executed. This is a combination of concepts called speculation, predication and explicit parallelism.

Next, we will briefly study each one of them.

Explicit parallelism:

The Instruction Level Parallelism - ILP is the ability of executing multiple instructions at the same time. As we have seen, the IA-64 architecture allows to pack independent instructions to be executed in parallel and, for each clock period, is capable of treating multiple packs. Due to the great number of features in parallel, as well as the great number of registers and multiple executing units, it is possible for the compiler to manage and program the parallel computing. The compilers used for the traditional architectures are limited in their speculative capacity because there is not always a way to be sure if the speculation will be correctly managed by the processor. The IA-64 architecture allows the compiler to explore the speculative information without sacrificing the correct execution of an application.

The IA-64 architecture has mechanisms denominated instruction pointer, suggestions for branches and cache, that allow the compiler to send to the processor information obtained during the time of compilation. That information minimizes the penalties that come from the branches and cache misses. Speculation:

The Itanium can load instructions and data onto the CPU before they're actually needed or even if they prove not to be needed, effectively using the processor itself as a cache. Presumably, this early loading is done when the processor is otherwise idle. The advantage gained by speculation limits the effects of memory latency by allowing loading of data before it is needed, thus making it ready to go the moment the processor can use it.

There are two kinds of speculation: data and control. With the speculation, the compiler advances an operation in a way that its latency (time spent) is removed from the critical way. The speculation is a form of allowing the compiler to avoid that slow operations spoil the parallelism of the instructions. Control speculation is the execution of an operation before the branch that precedes it. On the other hand, data speculation is the execution of a memory load before a storage operation (store) that precedes it and with which it can be related. Speculation Benefits: Reduces impact of memory latency .Reduces impact of memory latency Performance improvement at 79% when combined with predication*. Greatest improvement to code with many cache accesses large databases and operating systems.ems Scheduling flexibility enables new levels of performance headroom levels of performance headroom Predication: Branch prediction is currently used in today's processors. However, much processor time is taken by doing calculations for branches that end up being unneeded. Predication is a compiler-based technique of looking ahead to make more accurate predictions of which code branches will actually be used, thus limiting unneeded calculations.

With the predication you mark with predicates all the branches of the conditional branches that, next, are sent to the execution in parallel, however only the necessary ones are executed. Therefore, it is possible to prepare the execution of the instructions even before having solved the conditional branches. Besides the removal of branches by means of predicates, IA-64 architecture has a series of mechanisms that should reduce the error in predicting the branches and the cost when this error happens. Predication Benefits: Reduces branches and mispredict penalties. Parallel compares further reduce critical paths Parallel compares further reduce itical paths Greatly improves code with hard to predict branches ranches Large server apps- capacity limited .e server apps- capacity limited Sorting, data mining- large database apps .Sorting, data mining- large database apps Data compression Data compression Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication. Cmove: 39% more instructions, 30% lower performance.39% m Instructions must all be speculative.

The IA-64 architecture has a great number of registers. There are 128 integer registers, 128 floating-point registers, 64 predicate registers of 1 bit, and many other registers for configuration, management and monitoring of the CPU’s performance. Rotating Registers On top of the frames, there's register rotation, a feature that helps loop unrolling more than parameter passing. With rotation, Itanium can shift up to 96 of its general-purpose registers (the first 32 are still fixed and global) by one or more apparent positions. Why? So that iterative loops that hammer on the same register (s) time after time can all be dispatched and executed at once without stepping on each other. Each instance of the loop actually targets different physical registers, allowing them all to be in flight at once.

If this sounds a lot like register renaming, it is. Itanium's register-rotation feature is less generic than all-purpose register renaming like Athlon's, so it's easier to implement and faster to execute. Chip-wide register renaming like Athlon's adds gobs of multiplexers, adders, and routing, one of the big drawbacks of a massively out-oforder machine. On a smaller scale, ARM used this trick with its ill-fated Piccolo DSP coprocessor. At the high end, Cydrome also used this technique, a favorite feature that Cydrome alumnus and Itanium team member Bob Rau apparently brought with him.

So IA-64 has two levels of indirection for its own registers: the logical-to-virtual mapping of the frames and the virtual-to-physical mapping of the rotation. All this means that programs usually aren't accessing the physical registers they think they are, but that's nothing new to high-end microprocessors. Arcane as it seems, this method still uses less hardware trickery than the full register renaming of Athlon, Pentium III, or P4.

Intel promises compatibility with the 32-bit software (IA-32). They should run without any change since the operating system and the firmware have features for that. It should be possible to run software in real mode (16 bits), protected mode (32 bits) and virtual mode 86 (16 bits). They mean that the CPU will be able to operate in IA-64 mode or IA-32 mode. There are special instructions to go from one mode to the other, as it is shown in Figure 3.

Figure 3: Model of instruction sets transition. The three instructions that make the transition between the instruction sets are:

JMPE (IA-32): jumps to a 64-bit instruction and changes to IA-64 mode;

br.ia (IA-64): moves to a 32-bit instruction and changes to IA-32 mode;

Interruptions transit to IA-64 mode, allowing the fulfillment of all interruption conditions and

rfi (IA-64): it is the return of the interruption; the return happens both to an IA-32 situation and to an IA-64, depending on the situation present at the moment when the interruption is invoked.

Athlon 64 and AMD's 64-bit technology

64-bit architecture Introduction: To get a first idea, how the 64-bit architecture works and also how it differs significantly from a 32-bit implementation it is useful to consider one definition first:

"A 64-bit processor is a microprocessor with a word size of 64 bits, a requirement for memory and data intensive applications such as computer-aided design (CAD) applications, database management systems, technical and scientific applications, and high-performance servers. 64-bit computer architecture provides

higher performance than 32-bit architecture by handling twice as many bits of information in the same clock cycle.

The most important parts, which define a 64-bit architecture are boldfaced and give a rough idea that one can now process not only 2^32 = 4294967296 basic units of information, but 2^64 = 18446744073709551616 units. The numbers are quite impressive and show that the architecture level has to be updated accordingly.

There are several companies, which actually implemented 64-bit processors, but the two main companies are AMD and Intel. Other enterprises certainly have their place in the development of 64-bit processors, too, but the mainstream market is going to face those products by AMD and Intel. Therefore it is reasonable to explain, how those two companies designed the 64-bit processors and moreover there are only details to consider in translating the two special layouts and implementations to the general concept. There are quite some differences how the two companies chose to convert 32-bit programs to work with the 64-bit architecture and those differences will be outlined in the 32-bit part of this document, but in the following part the structure of a "pure" 64-bit architectural level will be outlined. As there is not much public information available about the physical structure of current 64-bit processors due to the fact that neither AMD nor Intel want to provide crucial information to the corresponding rival on the processor market it is useful to focus on the instruction set architecture (ISA) and the general differences between a 32-bit processor and the new 64-bit one.

With the successful introduction of the Opteron processor, AMD completed one half of its forecast entry into the 64-bit processing world. It is based on an evolution of the x86 instruction set used by current 32-bit processors made by Intel and AMD, the Opteron is targeted at the high to mid-range server and workstation market.

The second processor released under the AMD64 architecture will be the Athlon 64, formerly known as 'Claw hammer,' which aims to bring 64-bit computing power to the desktop and mobile markets. The Athlon 64 will be a slightly hobbled version of the Opteron, and with its built in compatibility with current software and operating systems, will attempt to bridge the gap easily between 32-bit and 64-bit computing environments.

We will focus on the Athlon 64 and what it will offer to home users and PC enthusiasts, as well as covering the important details of the AMD64 platform. The Opteron and the Athlon 64 share an identical base architecture.

AMD has positioned the Opteron as the solution to many system needs, with the primary goal of providing a 64-bit physical architecture while supplying high-end

performance for both 64- and 32-bit software. This translates into architectural advantages such as 64-bit data and address pathways, upgraded physical and virtual memory addressing, and a true 64-bit internal design.

The other main innovation has been to move key Northbridge functions from the system chipset directly into the Opteron core. These include a memory controller, multiprocessing control, and data flow, along with a bridge to peripheral data traffic. Traditional Southbridge and AGP components are still present in the Opteron architecture, but AMD's eighth-generation processor has absconded with the main performance and CPU-centric duties.

Opteron Micro architecture
The Opteron core resembles the basic design of the Athlon XP, but the move to a 64bit architecture has brought some inherent advantages. Both the Opteron and Athlon XP contain a few similar features, such as 64K apiece of Level 1 data and instruction cache and three apiece of integer and floating-point units, but there have been some noted improvements elsewhere. In terms of basic features, the Opteron includes a full 1MB of Level 2 cache on the inside, along with an integrated heat spreader and new Socket 940 packaging on the outside.

Looking a bit deeper, AMD has improved on its seventh-generation design in other ways. A processor's registers are like miniature cache areas where crucial data is stored and retrieved; the Opteron features eight more general-purpose registers, and these have been extended to 64 bits. AMD has also added eight 128-bit Streaming SIMD Extension (SSE) registers for multimedia instructions, as well as compatibility with the SSE2 instructions that premiered in Intel's Pentium 4.

The chip's transaction look-aside buffers are larger and offer lower latencies than those of the Athlon XP. Branch prediction is also enhanced, including an increase to 16K bimodal/history counters, or four times the level found on the Athlon XP.

This last note is important, because in order to provide higher frequencies and better scalability, AMD has extended the Opteron pipelines. The Opteron features a 12stage integer operation pipeline (versus 10 stages for the Athlon XP) and a 17-stage floating-point operation pipeline (versus 15 for the Athlon XP). While this pays dividends on higher potential clock speeds, it also incurs a risk of increased prediction misses, so AMD has adjusted the architecture to provide even higher pipeline efficiencies than the Athlon XP.

The Opteron also has built-in core logic to support multiprocessor systems without the need for a Northbridge chip. Internal CPU data traffic is all routed through a crossbar (XBAR) communications architecture, which shuttles command and data information between the CPU, memory controller, and three HyperTransport links. This is a huge technological leap for multiprocessor workstation and server designs, as it provides a true standard for OEMs to work with, and takes the Northbridge component out of the equation.

Dual-Channel Memory, More Or Less
The AMD Opteron includes an integrated memory controller, capable of supporting DDR200 through DDR333 speeds and a maximum of eight DIMM memory modules per processor. The controller provides up to 5.3GB/sec of memory bandwidth (with 333MHz DDR), yielding higher memory performance, lower memory latencies, and performance levels that can scale to processor frequencies.

Since each CPU has its own memory controller, memory bandwidth will also scale in multiprocessor systems. For example, a 2-way Opteron workstation will yield 10.6GB/sec of memory bandwidth, while a 4-way Opteron server will double this again to an incredible 21.3GB/sec, along with supporting up to 32 DDR DIMMs.

The Opteron's integrated memory controller has been referred to as a dual-channel design, but this isn't the exact truth. It certainly delivers double the bandwidth of a single-channel controller, but does so by taking two 64-bit DDR modules and viewing them as a single 128-bit DIMM with a corresponding 128-bit data path. This is similar to the design of Intel's dual-channel DDR chipsets such as the E7205 and 875P, but different than the true dual-channel memory architecture of the NVIDIA nForce2.

This is actually a smart call when it comes to building an integrated memory controller, as for all intents and purposes, the bandwidth and performance are equivalent, but the 128-bit memory bus is more streamlined. In the Opteron architecture, there is no need for an arbiter chip to handle traffic along the dual physical memory channels, and no requirement for extra controller hardware. Of course, due to the "single-channel 128-bit" memory architecture, the pairs of DDR modules but be matched in size, speed, and chip-count, though not necessarily in manufacturer.

AMD's 64-bit platform To access an area in the computer's physical memory (RAM) to store or retrieve data, the processor needs the address of that location, which is an integer number representing one byte of memory storage.

Suddenly, having 64-bit registers makes sense as, while a 32-bit processor can access up to 4.3 billion memory addresses (232) for a total of about 4GB of physical memory, a 64-bit processor could conceivably access over 18 petabytes of physical memory. This is the one area that clearly shows why 64-bit processors are the future of computing, as demanding applications such as databases have long been scraping on the 4GB memory ceiling.

If you are a business with a database of a terabyte or more of information, 64-bit processors look pretty good right now.

Formerly known as X86-64, the AMD64 architecture is AMD's method of implementing 64-bit processors.

AMD64 is massively different from Intel's approach to 64-bit processors as seen in their Itanium line. While Intel used a completely different architecture for the Itanium chips, forcing software developers to relearn in order to program for them, or use emulation which slowed down performance, AMD decided to simply extend the existing x86 architecture (the foundation of all PC's since Intel developed the 8086 processor in 1978) to accommodate 64-bit registers as mentioned above.

There are several advantages to this. First, obviously, reworking code for AMD 64-bit processors should be considerably easier, since the basis is the same. Secondly, the AMD64 based Opteron and Athlon 64, are fully compatible with 32-bit applications.

A system based on either of these processors can use a 32-bit operating system and software without a hitch, providing a stress free upgrade path for businesses and opening up the desktop market to 64-bit processors, and more specifically, AMD's Athlon 64.

AMD accomplishes this by enabling the AMD64 processors to run in one of two modes, Legacy mode and Long mode. Legacy mode removes all 64-bit support and enables the processor to run strictly in 32-bit mode, necessary for running most current operating systems, including Windows. Long mode is comprised of two sub modes, Compatibility mode and 64-bit mode.

Compatibility mode is designed for a 64-bit operating system such as Microsoft's impending 64-bit versions of XP and Server 2003, due late this year or early in the next, but running 32-bit software such as current databases. The advantage of this is that each 32-bit application, though still limited by the 4GB memory limit, can have all of that 4GB to itself with no overhead for the operating system, since that will use 64-bit addressing and can thus access additional memory space.

This provides some improved performance for demanding 32-bit apps before they are ported over to 64-bit. 64-bit mode is intended for a pure 64-bit environment, operating system and software, and offers one huge advantage.....

AMD - Instruction Set Architecture:
The most basic units of organization for the instructions are specified the following way (see AMD manual again - page 38/39):

1. General Purpose Instructions: The basic integer instructions, which are used
nearly everywhere. Also often referred to as the x86 instruction set and easily illustrated by examples like addition of integers, moving, load, store, shifts and so on.

2. 128-Bit Media Instructions: Named due to their primary application, these
instructions operate on vectors of large data packages (e.g. video, scientific applications, games, etc.). Moreover, they operate in parallel. That means they are able to access multiple data sets at once. Obviously, these instructions are designed for speed in one special field of applications and therefore are not able to perform any task.

3. 64-bit Media Instructions: Also SIMD instructions and not much different in
use compared to the 128-bit instructions.

4. Floating Point Instructions: As GPIs only work for integers, these instructions
are designed to have a suitable tool for floating point operations.

When the LMA is activated the maximum speed for instructions to be performed is enabled and this is usually done by the operating system. This is the stage we would like to call "pure" 64-bit mode and this mode can be recognized for both architectures, the one described here from AMD and the Intel IA64 described later on this page. For the following part of the analysis we assume that LMA is activated and the processor is in "pure" 64-bit mode, which is not to be confused with legacy mode or long mode compatibility mode; these are features to support the transition from 32-bit machines and software to the new architecture. Those should not be considered yet, but in the 32-bit section. The default size for operands is 32-bits in contrast to the 16-bits of the 32-bit architecture. The REX registers, which is the common name for the 8 new GPRs R8-R15 - specify whether one would like to accept this default value or to extend to virtual 64-bits (basically a concatenation of two registers). This means that some of the instructions for the opcode had to be redefined to allow the virtual 64-bit addressing. Nevertheless, these are only minor changes and most parts of the opcode are carried over from a 32-bit processor. The memory is a single flat address space starting at the address 0 and is distributed linearly over 64-bits. The operating system can specify several levels of data

access/protection for the address space. The segment registers to access memory locations are set to a canonical position - namely 0 - and it is not possible for the processor to access all segmented registers. This is essentially a real simplification compared to 32-bit processing and all the compatibility modes offered by AMD. It is just pure memory addressing from 0 to 2^64 -1 without any specialties. This concept shows on the micro level what the goal of the complete architecture is. The search for more simplicity, more raw computing power and preparation for large amounts of data. Another cornerstone of this path is the possibility to translate all the virtual 64address space in physical memory in a one-to-one translation process. Paging can be performed on the virtual address directly. The bytes themselves are ordered according to little/low Endean and so are all the data and instructions. The instructions do not really "change" in the sense that there a structural redesign has happened. The size of the operands is the crucial factor. Consider for example this instruction: 48 B8 1234567812345678. The 48 specifies the length of the operands: 64-bits! The opcode B8 is also used in the 32-bit architecture and the remaining part is just an 8-bit immediate value and we are computing with a 64-bit processor.

There exist five addressing modes:

Absolute Address: given as displacements from the base - for 64-bits just 0)

Instruction-Relative Address: referring to the IP (instruction pointer) and the PC (program counter)

Stack Address: using the stack pointer

String Addresses

Mod R/M Address

And again one realizes that there are no real differences in the structure compared to non-64-bit ISAs. The PC, the Stack and absolute addressing just carry over with more bits. The RIP (relative instruction pointer / program counter) keeps its function, but due to 64-bits provides a more efficient way to directly access segments of code with relative addressing. This is one reason, why there is a significant increase in speed for the AMD 64-bit architecture - direct access to program code.

For the Absolute Addressing it gets even easier due to the common standard base 0. The same holds for pointers in general. As one is no longer able to access the segmented registers the concept of far pointers, which store a

segment address and the usual address, is no longer needed as the memory is just one linear chunk. Near pointers are enough and one can return for 64-bit applications for the AMD architecture to the general term pointer as it is obvious that it can only point into one data segment. The immediate and displacements remain of 32-bit size but can be extended to a virtual 64-bit mode if needed.

This finishes the broad outline of the instruction set architecture for AMD based on the document mentioned above and their philosophy to keep it simple and easy becomes apparent, but this is only true for AMD, not for 64-but processors in general. They might demand more sophisticated instruction sets and might not rather focus and build upon established concepts. One has to know more certain technical details, which should not be emphasized here as the new registers must be taken into account and therefore the possibility of combinations to address and declare correctly rises, but their complexity level does not rise significantly for AMD. Outlining the new instructions for every new register would be tedious and cumbersome work and is only valid for the ISA of AMD.

Memory Controllers and Hypertransport Both the Opteron and the Athlon 64 contain 8 extra registers useable only in 64-bit mode, which should increase application performance significantly.

One of the largest problems in modern computer design is the presence of bottlenecks, or areas of low performance which slow an otherwise fast system down.

In most modern computers, data intended for the video and main memory needs to be passed to and through the Northbridge chip on the motherboard, and data from other sources like USB connections, PCI slots or hard-drives must pass through the Southbridge chip, then the Northbridge.

With the amount of information that needs to be squeezed through the various data buses into the processor to be operated on, bottlenecks inevitably develop, where the processor is waiting for the necessary bits to be delivered by the I/O subsystem feeding it.

As processors get consistently faster every few months, while data bus breakthroughs are irregular, the issue perpetuates itself.

AMD has attempted to get around this constant problem by equipping its 64-bit processors with two advantages, internal DDR memory controllers and Hypertransport links. AMD has built the memory controller (normally a part of the motherboard to which the processor is attached), directly into their Opteron and Athlon 64 CPUs.

As you can imagine, this gives a considerably reduces the time it takes the processor to access memory, since while data still needs to travel between the processor and the physical memory, communication with the controller that arranges the data flow does not need to be passed outside the processor, reducing the amount of computing cycles lost while waiting for the memory.

Another benefit is the fact that memory traffic no longer needs to run between the processor and the Northbridge chip on the motherboard which traditionally provides the memory controller, reducing bottlenecks. The second part of the package is support for Hypertransport input/output technology.

HyperTransport™ technology HyperTransport™ technology is a high-speed, low latency, point-to-point link designed to increase the communication speed between integrated circuits in computers, servers, embedded systems, and networking and telecommunications equipment up to 48 times faster than some existing technologies. HyperTransport™ technology helps reduce the number of buses in a system, which can reduce system bottlenecks and enable today's faster microprocessors to use system memory more efficiently in high-end multiprocessor systems.

HyperTransport™ technology is designed to:

Provide significantly more bandwidth than current technologies

Use low-latency responses and low pin counts

Maintain compatibility with legacy PC buses while being extensible to new SNA (Systems Network Architecture) buses.

Appear transparent to operating systems and offer little impact on peripheral drivers.

Conclusion With this article and the previous one, that mention the 64-bit architectures by Intel and AMD, we finished to talk about the processors for the beginning of the millennium. In addition, it is important to mention that there already are computers running 64-bit versions of Windows and Linux. Now, more than performance, our biggest concern is the compatibility with our present programs. We really have to verify how much those 64-bit architectures are compatible with our 32- or 16-bit programs. We hope that in less than a year we already have the answer to this question. To finish this part of 64-bit CPUs, it is very good to see how the two companies compete in the market of high performance processors. This grants us access to even cheaper and better computers. To conclude, we would like to comment the great space that there still is to the evolution of electronics and consequently to the evolution of computers. More important than the creation of supercomputers, this new age will see the permeability of the computers. It will be the time of invisible computers. They will be present in nearly all modern devices. At the moment they inhabit our TV sets, microwave ovens, cars, watches, stereos, DVD, etc... In a near future, they will invade the refrigerator, the toaster, the air-conditioner and all everyday appliances. We have gone beyond the cheap electronics age and we are entering the cheap intelligence age.


References for this part are basically placed in the appropriate positions - this list gives an overview:

-,,sid10_gci498697,00.html - Hammer Review A1- Electronics: - Article X86-64 Hardware site: - AMD Developer's Manual X86-64: - Article IA-64 Hardware site: - Presentation IA-64: - Software Developer's Manual Itanium: - Hardware Developer's Manual Itanium: - AMD Opteron video: - Article 64-bit computing: c't 12/99 page 28 - basic notations, definitons and concepts are taken from "Computer Organization and Design", Hennessey and Patterson

Sign up to vote on this title
UsefulNot useful