You are on page 1of 21

1.

INTRODUCTION
1.1 Emergence of Crusoe processors Mobile computing has been the buzzword for quite a long time. Mobile computing devices like laptops, webslates & notebook PCs are becoming common nowadays. The heart of every PC whether a desktop or mobile PC is the microprocessor. Several microprocessors are available in the market for desktop PCs from companies like Intel, AMD, Cyrix etc. The mobile computing market has never had a microprocessor specifically designed for it. The microprocessors used in mobile PCs are optimized versions of the desktop PC microprocessor. Mobile computing makes very different demands on processors than desktop computing, yet up until now, mobile x86 platforms have simply made do with the same old processors originally designed for desktops. Those processors consume lots of power, and they get very hot. When you're on the go, a power-hungry processor means you have to pay a price: run out of power before you've finished, run more slowly and lose application performance, or run through the airport with pounds of extra batteries. A hot processor also needs fans to cool it; making the resulting mobile computer bigger, chunkier and noisier. A newly designed microprocessor with low power consumption will still be rejected by the market if the performance is poor. So any attempt in this regard must have a proper 'performance-power' balance to ensure commercial success. A newly designed microprocessor must be fully x86 compatible that is they should run x86 applications just like conventional x86 microprocessors since most of the presently available software's have been designed to work on x86 platform. Crusoe is the new microprocessor which has been designed specially for the mobile computing market. It has been designed after considering the above mentioned constraints. This microprocessor was developed by a small Silicon Valley startup company called Transmeta Corp. after five years of secret toil at an expenditure of $100 million. The concept of Crusoe is well understood from the simple sketch of the processor architecture, called 'amoeba'. In this concept, the x86-architecture is an ill-defined amoeba containing features like segmentation, ASCII arithmetic, variable-length instructions etc. The amoeba explained how a traditional microprocessor was, in their design, to be divided up into hardware and software. Thus Crusoe was conceptualized as a hybrid microprocessor that is it has a software part and a hardware part with the software layer surrounding the hardware unit. The role of software is to act as an emulator to translate x86 binaries into native code at run time. Crusoe is a 128-bit microprocessor fabricated using the CMOS process. The chip's design is based on a technique called VLIW to ensure design simplicity and high performance. Besides this it also
1

uses Transmeta's two patented technologies, namely, Code Morphing Software and Long run Power Management. It is a highly integrated processor available in different versions for different market segments. In electronics, Crusoe is a family of microprocessors from Transmeta. They use a VLIW hardware "core", upon which runs a software abstraction layer, or virtual machine, known as the Code Morphing Software (CMS). The CMS translates machine code instructions received from programs running on the chip into native instructions for the core. In this way, the chips can emulate the instruction set of other computer architectures. Currently, this is used to allow the chips to emulate the Intel x86 instruction set. In theory, it is possible for the CMS to be modified to handle other instruction streams (i.e. to emulate other microprocessors). The addition of an abstraction layer between the x86 instruction stream and the hardware means that the hardware architecture can change without breaking compatibility, just by modifying the CMS. For example Efficeon, the second-generation Crusoe, has a 256-bit-wide VLIW core versus 128-bit in the first generation. Crusoe performs in software some of the functionality traditionally implemented in hardware (e.g. instruction re-ordering), resulting in simpler hardware with fewer transistors. The relative simplicity of the hardware means that Crusoe consumes less power (and therefore generates less heat) than other x86-compatible microprocessors running at the same frequency. The name is taken from the novel Robinson Crusoe.

2

the hardware designers opted for minimal space and power.13μ.13μ. 800MHz.3 Technology Perspective The Transmeta designers have decoupled the x86 instruction set architecture (ISA) from the underlying processor hardware. 256K L2 Cache TM5600--500-700MHz. models TM3120 and TM5400.2 Crusoe processor family       TM5400--500-700MHz. 3 . By eliminating roughly three quarters of the logic transistors that would be required for an all-hardware design of similar performance.1. 256K L2 Cache TM5500--0. 667-800MHz. future hardware designs can emphasize different factors and accordingly use different implementation techniques. 512K L2 Cache TM5800--0. the Code Morphing software which resides in standard Flash ROMs itself offers opportunities to improve performance without altering the underlying hardware. 667-800MHz. For the same reason.13μ. For the initial Transmeta products. which allows this hardware to be very different from a conventional x86 implementation. the designers have likewise reduced power requirements and die size. the underlying hardware can be changed radically without affecting legacy x86 software: each new CPU design only requires a new version of the Code Morphing software to translate x86 instructions to the new CPU’s native instruction set. Finally. 512K L2 Cache SE TM55E/TM58E--0. 256K L2 cache Embedded version of Crusoe processor 1. However.

memory management unit. ultra-low power. and multimedia instructions. the device integrates a DDR SDRAM memory controller. SDR SDRAM memory controller. PCI bus controller and serial ROM interface controller.2 ARCHITECTURE The Crusoe processor incorporates integer and floating point execution units. These additional units are usually part of the core system logic that surrounds the microprocessor. allow the Crusoe processor to provide a highly integrated. 1 crusoe processor architecture) 4 . a level-2 write-back cache. The Crusoe processor block diagram is shown in Figure 1. In addition to these traditional processor features. separate instruction and data caches. The VLIW processor. in combination with Code Morphing software and the additional system core logic units. (fig. high performance platform solution for the x86 mobile market.

1 Crusoe Processor Model TM5800(Various units)                 CPU Core Integer unit Floating point unit MMU L1 Instruction Cache L1 Data Cache 64K 8-way set associative 64K 16-way set associative Unified TLB 256 entries 4-way set associative DDR SDRAM Controller SDR SDRAM Controller PCI Controller & Southbridge Interface DMA Multimedia Instructions Serial ROM Interface 4-way set associative Bus Interface The Crusoe processor core architecture is relatively simple by conventional standards. the Crusoe processor has very distinctive features from traditional x86 designs. The Crusoe processor includes a 64K-byte 8-way set-associative Level 1 (L1) instruction cache. 5 . The TM5800 model also includes an integrated 512K-byte Level 2 (L2) write-back cache for improved effective memory bandwidth and enhanced performance. and a 64K-byte 16-way set associative L1 data cache. It is based on a Very Long Instruction Word (VLIW) 128-bit instruction set. while maintaining the same low-power implementation that provides a superior performance-to-power consumption ratio relative to previous x86 implementations. arithmetic. as in conventional processors. This allows a simplified and very straightforward hardware implementation with an in-order 7-stage integer pipeline and a 10-stage floating point pipeline. the performance-to-power consumption ratio can be greatly improved over traditional x86 architectures. the control logic of the processor is kept very simple and software is used to control the scheduling of instructions. shift. and floating point instructions. Within this VLIW architecture. Other than having execution hardware for logical. By streamlining the processor hardware and reducing the control logic transistor count.2. This cache architecture assures maximum internal memory bandwidth for performance intensive mobile applications.

6 . the hardware generates the same condition codes as conventional x86 processors and operates on the same 80-bit floating point numbers. Also. The software that converts x86 programs into the core VLIW instructions is called Code Morphing software. the Translation Look-aside Buffer (TLB) has the same protection bits and address mapping as x86 processors. The combination of Code Morphing software and the VLIW core together act as an x86compatible solution.To ease the translation process from x86 to the core VLIW instruction set. The software component of this solution is used to emulate all other features of the x86 architecture.

fetch from the instruction cache a VeryLong Instruction Word containing several primitive instructions. which requires enough instruction-level 7 . The compiler then optimizes this path. In theory. These capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel. This provides explicit parallelism. that is. VLIW architectures are a suitable alternative for exploiting instruction-level parallelism (ILP) in programs. The instruction set for a VLIW architecture tends to consist of simple instructions RISC like). VLIW has been described as a natural successor to RISC. and out-of-order execution is possible. VLIW is a method that combines multiple standard instructions into one long instruction word. Trace scheduling is an important technique in VLIW processing. These processors contain multiple functional units. for executing more than one basic (primitive) instruction at a time. The temporary variable space is needed to store results when they come in. Hardware support is needed to implement this technique and requires delay buffers and temporary variable space in the hardware. The results computed in phase two are stored in temporary variables and are loaded into the appropriate phase one register when they are needed. because it moves complexity from the hardware to the compiler. The process. The objective of VLIW is to eliminate the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors. The processors have relatively simple control logic because they do not perform any dynamic scheduling or reordering of operations (as is the case in most contemporary superscalar processors). Thus. This is an advantage because the compiler knows more information about the program than the chip does by the time the code gets to the chip. Dynamic scheduling is another important method when compiling VLIW code. By using VLIW you enable the compiler.2. This allows for multiple instructions to execute at the same time. The rejoining includes special split and rejoin blocks that help align the converted code with the original code. a VLIW processor should be faster and less expensive than a comparable RISC chip. Trace scheduling is when the compiler processes the code and determines which path is used the most frequently travelled. phase one and phase two.2 Basic principles of VLIW Architecture VLIW stands for Very Long Instruction Word. allowing simpler. The basic blocks that compose the path are separated from the other basic blocks. and dispatch the entire VLIW for parallel execution. This word contains instructions that can be executed at the same time on separate chips or different parts of the same chip. not the chip to determine which instructions can be run concurrently. The path is then optimized and rejoined with the other basic blocks. instructions that have certain delays associated with them can be run concurrently. faster processors. The compiler must assemble many primitive operations into a single "instruction word" such that the multiple functional units are kept busy. called split-issue splits the code into two phases.

8 . performing software pipelining.parallelism (ILP) in a code sequence to fill the available operation slots. Such parallelism is uncovered by the compiler through scheduling code speculatively across basic blocks. reducing the number of operations executed. among others.

All atoms within a molecule are executed in parallel. %r0 through %r63. In a later section. with atoms separated by semicolons. In the assembly code examples in this paper.2. e. 2 A molecule that can contain up to four atoms. we write one molecule per line. A Crusoe processor long instruction word. %eax instead of the less descriptive %r0). high-performance. Where a register holds x86 state. Molecules are executed in order. which are executed in parallel. VLIW engine with two integer units. a floating point unit.g. The destination register of an atom is specified first. is needed to effectively reconstruct the order of the original x86 instructions. can be 64 bits or 128 bits long and contain up to four RISC-like instructions. called a molecule.. also have multiple functional units that can execute RISC-like operations (microops) in parallel. Fig. so there is no complex out-of-order hardware. The integer register file has 64 registers. molecules are packed as fully as possible with atoms. and the molecule format directly determines how atoms get routed to functional units.g. the Code Morphing software allocates some of these to hold x86 state while others contain state internal to the system. Figure 2 shows a sample 128-bit molecule and the straightforward mapping from atom slots to functional units.c” opcode suffix designates an operation that sets the condition codes. such as the Pentium II and Pentium III processors. and a branch unit. or can be used as temporary registers. a separate piece of hardware.. for register renaming in software. a “. By convention.3 VLIW in Crusoe Microprocessor With the Code Morphing software handling x86 compatibility. Since the dispatch unit reorders the micro-ops as required to keep the functional units busy. a memory (load/store) unit. this greatly simplifies the decode and dispatch hardware. we describe how the Code Morphing software accomplishes this. and ensure that they take 9 . Figure 3 depicts the hardware these designs use to translate x86 instructions into micro-ops and schedule (dispatch) the micro-ops to make best use of the functional units. Superscalar out-of-order x86 processors. Transmeta hardware designers created a very simple. we use the x86 name for that register (e. the inorder retire unit. called atoms. To keep the processor running at full speed.

the decoding and dispatching hardware requires large quantities of power-hungry logic transistors. this type of processor hardware is much more complex than the Crusoe processor’s simple VLIW engine. Fig. Clearly. 10 . the chip dissipates heat in rough proportion to their numbers.effect in proper order. Because the x86 instruction set is quite complex. 3 Conventional superscalar out-of-order CPUs use hardware to create and dispatch micro-ops that can execute in parallel.

Linux. 4 showing Hierarchy of Crusoe processor software . 3.3 HIERARCHY MODEL Fig. etc.) x86 BIOS x86 Applications 11 .1 Crusoe Processor Software Hierarchy       VLIW Processor Code Morphing Software x86 Operating System (Windows ME. Windows 2000.

profiling registers. 12 . the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs.g. The processor also includes 32 80-bit floating point registers and 16 FP shadow registers.1 Registers The processor has 64 GPRs. for compares). with the following specialized semantics: %r63 (%zero) always reads 0 when used as a source operand %r62 (%sink) is a discarded destination (e.4 INSTRUCTION SET   4. it is never read %r59 (%from) saved return address %r58 (%link) return address %r47 (%sp) is the current stack pointer %r0 (%eax) for current x86 machine state %r1 (%ecx) for current x86 machine state %r2 (%edx) for current x86 machine state %r3 (%ebx) for current x86 machine state The lower 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set.. power control settings and so on. including the condition codes. There are also a wide variety of special purpose registers (SPRs).

when a section of code is run again. As a result.2 Execution. for example C++. however due to the out-of order execution a separate piece of hardware is needed to re-construct the sequence of the original x86 instructions and make sure that they execute in order. this type of processor is much more complex than the Crusoe processor's VLIW engine. In other words. Also. it requires additional hardware to make sure that x86 instructions are executed in the correct order. 13 . It will probably require multiple passes because the code optimization is done in steps. This software translation opens up new possibilities. the Code Morphing software translates instructions once and stores the result in a 'translation cache'. that section might take several passes to get fully optimized. such as the Intel Pentium III. As a result. whereas a conventional x86 processor translates each instruction separately. Then we subsequently use assembly to speed up just those sections. All code is first recompiled quickly. Decoding and Scheduling Current superscalar x86 processors. to keep the program and the processor running. As mentioned above. due to a conventional x86 processor's out-of-order execution. In contrast. Code Morphing can translate an entire group of x86 instructions at once. also have multiple functional units that can execute RISC-like operations in parallel. and thus optimize the program's overall execution. the next time the same code is executed. but not necessarily efficiently. since a conventional out-of order processor has to re-translate and schedule an instruction each time it executes. However. Programming in C++ gets our program running faster than programming it all in assembly. performance might not be at its peak after the first iteration of the optimization process. the translation process can be optimized by looking at the generated code and minimizing the number of instructions executed. it moves up in the hierarchy and is scheduled for optimization. the processor can immediately run the existing instruction from the translation cache.4. but the extra performance gained by using assembly where needed can really speed up processing. sections that only occur once usually don't get optimized. Then. In essence it is no different than programming in a higher language. one of the things we're keen on determining is where the performance bottlenecks are. Code Morphing can speed up execution plus reduce power consumption. With Code Morphing. very quickly. After we've compiled and executed our program.

Long Run Dynamic Power Management. Provides higher performance within smaller. Up to four instructions issued per clock cycle. Unique software-based architecture is key to reducing power consumption and enabling future scalability and flexibility. On-chip 32-bit.1 Characteristics of Crusoe processor               4 Instruction Issue.5 Characteristics and Performance 5. Enables fan less designs for quieter and more reliable systems. 33 MHz PCI bus controller. Advanced Code Morphing Software (CMS). On-chip SDR and DDR-266 memory interfaces. Fully Pentium 4-ISA compatible. thermally constrained environments. Integrated Northbridge Core Logic. MMX multimedia extensions. 14 . 512 MB L2 cache. Enables low power operation by dynamically adjusting operating frequency and voltage to match the performance requirements of application workloads. 128-Bit VLIW Engine.

933. 150 mW typical in deep sleep. applications. 800. and out-of-order execution (OOO) logic 5.2 x86 cpu vs Crusoe Fig.5. 3. SDR Memory Support: SDR-SDRAM 66-133MHz.13μprocess technology. L2 Write Back Cache: 512KB. Power: 0.3 Crusoe Processor performance factors            Fabricated in 0. Much smaller than traditional microprocessors.1 compliant) with 33 MHz. L1 Data Cache: 64KB. Standard product speeds of 733. Processor Package: Compact 474-pin Ceramic BGA 15 . PCI bus controller (PCI 2. L1 Instruction Cache: 64KB.3V interface. 0. High Performance 4 Issue 128-bit VLIW Engine with Code Morphing Software to provide x86 compatibility. branch prediction. Orange portions implemented in software x86 to native VLIW translation. 5 x86 cpu vs crusoe processor From fig 5 in crusoe processor we can see that:     Purple portions implemented in hardware. DDR Memory Support: DDR-SDRAM 100-133MHz.5-1. and 1000 MHz.5 W @ 300-1000 MHz. 867.3V running typical multimedia.8-1.

produce native code and store translation in the translation cache Completely invisible to operating system –looks like x86 hardware processor Runtime system Handle devices.1 What is CMS ?        Dynamic binary translation of x86 ISA into Crusoe internal native VLIW format using software at run-time Translations cached to avoid translator overhead on repeated execution Optimizes across x86 instruction boundaries to improve performance Select a region. power management.6 Code morphing software 16 . interrupts and exceptions. and garbage collection Fig.6 Code Morphing Software 6.

New hardware architecture only needs a new code morphing software. Software layer means that software developers don’t have to recompile programs. Software layer increases performance.6. Processor upgrades are simplified. Software layer helps debugging process. Software reordering can be done much better than hardware by looking at a bigger window of instructions and applying more complicated algorithms. 17 . Optimization is applied to remove unnecessary instructions. Timing of critical paths are improved. Simplicity means fast and low-power design. hence they can be executed by a simple VLIW engine. Code morphing software can be upgraded independently into flash ROM. Hardware does not need to perform complex instruction reordering. There are different ways to perform the same function so software can be changed in debug process.2 Code morphing software benefits               Molecules explicitly encode the instruction-level parallelism.

Code translation requires clock cycles which could otherwise be used in performing application computation. 18 . 2.7 Drawbacks 1. Code optimization does not start until a block of code has been executed more the a few times.

Provides embedded devices with a performance per watt ratio that is unmatched by any other x86-based processor in its class. Streaming video. Used in Notebooks and Tablet PCs. Runs a mobile Linux kernel. UPC(ultra personal computers)  19 . Only microprocessor that is able to provide software upgrades to the processor that offer additional performance and power savings. Thin client for server based computing Low power and high density Crusoe based servers have a high profitability matrix value (Performance/Per Watt/Per Cubic Foot). Web browsers. Email applications.8 Target Applications              Suitable for portable and embedded systems. Capable of running Internet applications. Offers advantages over standard hardware-only processors for making ultra light and thin Notebooks with less power consumption. No need for active cooling and external CPU fans.

2. 3. Crusoe offers the power of a high performance Intel processor consuming a fraction of power. Conclusion 1. 20 . Transmeta has build an X-86 processor based on VLIW(very long instruction word) technology.9. Code Morphing offers a new approach to the implementation of an instruction set architecture.

techextreme. First Annual IEEE/ACM International Symposium on Code Generation and Optimization. “Mobile Platform Benchmarks”. al. “The Software Side of Crusoe”.com/internet/0004/0004ia3. “The Technology Behind Crusoe™Processors”.ieee.com http://www. http://www. HotRod or Performance Hog?”. Dehnertet.eetimes.. January 2000. "Transmeta's Crusoe.htm James C. March 2003.embedded.com. Jon Stokes.com http://www. ArsTechnica.com/articles/paedia/cpu/crusoe. Transmetawhite paper.mdronline.com http://www.com http://www. Chipgeek. http://www. January 2000.com/content/editorials/ WEBSITES         http://www. http://arstechnica. “Crusoe Explored”. Embedded Systems ProgrammingSite. Recovery.com/procspec/features/ transmeta/crusoe. Daniel McKenna. 2000. January 2000.hardwareanalysis. May 11.com http://www. and Adaptive Retranslation to Address Real-Life Challenges”. Transmeta Corporation White Paper. 2001.geek. “Transmeta’sCrusoe Microprocessor”. (www.mit.ibm.com    21 .edu http://www.transmeta. “The TransmetaCode Morphing Software: Using Speculation.org http://www.9 References JOURNALS     Alexander Klaiber.ars/ Rob Hughes.zdnet. Alexander Wolfe. January 2000.htm) Sander Sassen.