You are on page 1of 19

1.

Introduction :

Very Long Instruction Word( VLIW) architectures are distinct from


traditional architectures implemented in current mass-market microprocessors.VLIW
microprocessors and superscalar implementations of traditional instruction sets share
some characteristics—multiple execution units and the ability to execute multiple
operations simultaneously. The techniques used to achieve high performance,
however, are very different because the parallelism is explicit in VLIW instructions
but must be discovered by hardware at run time by superscalar processors.
VLIW implementations are simpler for very high performance. Just as
RISC architectures permit simpler, cheaper high-performance implementations than
do CISCs, VLIW architectures are simpler and cheaper than RISCs because of further
hardware simplifications. VLIW architectures, however, require more compiler
support.
Recent high performance processors have depended on Instruction Level
Parallelism (ILP) to achieve high execution speed. ILP processors achieve their high
performance by causing multiple operations to execute in parallel, using a
combination of compiler and hardware techniques. Very Long Instruction Word
(VLIW) is one particular style of processor design that tries to achieve high levels of
instruction level parallelism by executing long instruction words composed of
multiple operations. The long instruction word called a MultiOp consists of multiple
arithmetic, logic and control operations each of which would probably be an
individual operation on a simple RISC processor. The VLIW processor concurrently
executes the set of operations within a MultiOp thereby achieving instruction level
parallelism.

[1]
2.EARLIER TECHNOLOGIES :

2.1 Scalar processors:


Scalar means one dimensional.i.e.one process starts execution at a time.It is
also known as pipeling.
Pipelining is a natural concept in everyday life, e.g. on an assembly line.
Consider the assembly of a car: assume that certain steps in the assembly line are to
install the engine, install the hood, and install the wheels (in that order, with arbitrary
interstitial steps). A car on the assembly line can have only one of the three steps done at
once. After the car has its engine installed, it moves on to having its hood installed,
leaving the engine installation facilities available for the next car. The first car then
moves on to wheel installation, the second car to hood installation, and a third car begins
to have its engine installed. If engine installation takes 20 minutes, hood installation takes
5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when
only one car can be assembled at once would take 105 minutes. On the other hand, using
the assembly line, the total time to complete all three is 75 minutes. At this point,
additional cars will come off the assembly line at 20 minute increments.

Sequential Execution Semantics:


Instructions appear as if they executed in the program specified order
and one after the other

Alternatively
At any given point in time we should be able to identify an instruction so
that:
1. All preceding instructions have executed
2. None following has executed

I4

From above figure,it is clear that 4 instructions need net 12 clock cycles to
complete execution.

[2]
Scalar Execution:

I4

But,in pipelining 4 instrucions need 7 clock cycles which is relatively lesser than that of
sequential execution.

2.2 Superscalar Operation:


Superscalar (super:beyond; scalar: one dimensional) means the ability to
fetch, issue to execution units, and complete more than one instruction at a time.We
know, The simplest processors are scalar processors. Each instruction executed by a
scalar processor typically manipulates one or two data items at a time. By contrast, each
instruction executed by a vector processor or superscalar processor  operates
simultaneously on many data items. An analogy is the difference between scalar and
vector arithmetic. A superscalar processor is sort of a mixture of the two. Each
instruction processes one data item, but there are multiple redundant functional units
within each CPU thus multiple instructions can be processing separate data items
concurrently.
Superscalar CPU design emphasizes improving the instruction dispatcher
accuracy, and allowing it to keep the multiple functional units in use at all times. This has
become increasingly important when the number of units increased. While early
superscalar CPUs would have two ALUs and a single FPU, a modern design such as
the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher
is ineffective at keeping all of these units fed with instructions, the performance of the
system will suffer.

Superscalar Execution:

[3]
Write Back

Here,from figure,we can see,4 instructions has started execution simultanesously.

Disadvantages of Superscalar Execution:

 Hardware complexities
As,in superscalar processor we need a strong instruction dispatcher which will
fetch instructions from instruction catch in such a way that parallelism can be
maintained
Again ,Reorder buffer is also required to check speculative execution.

 Window size constrained. This limits the capacity to detect independent


instructions.

3.VLIW ARCHITECTURE :

3.1 History:

For various reasons which were appropriate at that time, early computers
were designed to have extremely complicated instructions. These instructions made
designing the control circuits for such computers difficult. A solution to this problem was
microprogramming, a technique proposed by Maurice Wilkes in 1951. In a micro
programmed CPU, each program instruction is considered a macroinstruction to be
executed by a simpler processor inside the CPU. Corresponding to each macroinstruction,
there will be a sequence of microinstructions stored in a microcode ROM in the
CPU.Horizontal microprogramming is a particular style of microprogramming where bits
in a wide microinstruction are directly used as control signals within the processor. In

[4]
contrast, vertical microprogramming uses a shorter microinstruction or series of
microinstructions in combination with some decoding logic to generate control signals.
Microprogramming became a popular technique for implementing the control unit of
processors after IBM adopted it for the System/360 series of computers.
Joseph Fisher, a pioneer of VLIW, while working on PUMA, a CDC-6600 emulator was
frustrated by the difficulty of writing and maintaining 64 bit horizontal microcode for that
processor. He started investigating a technique for global microcode compaction--a
method to generate long horizontal microcode instructions from short sequential ones.
Fisher soon realized that the technique he developed in 1979, called trace scheduling,
could be used in a compiler to generate code for VLIW like architectures from a
sequential source since the style of parallelism available in VLIW is very similar to that
of horizontal microcode. His discovery lead to the design of the ELI-512 processor and
the Bulldog trace-scheduling compiler.

Two companies were founded in 1984 to build VLIW based mini supercomputers. One
was Multiflow, started by Fisher and colleagues from Yale University. The other was
Cydrome founded by VLIW pioneer Bob Rau and his colleagues. In 1987, Cydrome
delivered its first machine, the 256 bit Cydra 5, which included hardware support for
software pipelining. In the same year, Multiflow delivered the Trace/200 machine, which
was followed by the Trace/300 in 1988 and Trace/500 in 1990. The 200 and 300 series
used a 256 bit instruction for 7 wide issue, 512 bits for 14 wide issue and 1024 bits for 28
wide issue. The 500 series only supported 14 and 28 wide issue. Unfortunately, the early
VLIW machines were commercial failures. Cydrome closed in 1988 and Multiflow in
1990.

Since then, VLIW processors have seen a revival and some degree of commercial
success. Some of the notable VLIW processors of recent years are the IA-64 or Itanium
from Intel, the Crusoe processor from Transmeta, the Trimedia media-processor from
Philips and the TMS320C62x series of DSPs from Texas Instruments. Some important
research machines designed during this time include the Playdoh from HP labs, Tinker
from North Carolina State University, and the Imagine stream and image processor
currently developed at Stanford University.

3.2 Execution mechanism:


VLIW stands for VERY LONG INSTRUCTION WORD. A VLIW
implementation has capabilities very similar to those of a superscalar processor—issuing
and completing more than one operation at a time—with one important exception: the
VLIW hardware is not responsible for discovering opportunities to execute multiple
operations concurrently. For the VLIW implementation, the long instruction word

[5]
already encodes the concurrent operations. This explicit encoding leads to dramatically
reduced hardware complexity compared to a high-degree superscalar implementation of a
RISC or CISC.

Sample instruction format:

Load/
Store

VLIW execution:
Write Back

Above figure represents VLIW execution with degree 4.

3.3 Comparision with CISC,RISC:


From the larger perspective, RISC, CISC, and VLIW architectures have more
similarities than differences.The differences that exist, however, have profound effects on
the implementations of these architectures.
Obviously these architectures all use the traditional state-machine model of
computation: Each instruction effects an incremental change in the state (memory,
[6]
registers) of the computer, and the hardware fetches and executes instructions
sequentially until a branch instruction causes the flow of control to change.
The differences between RISC, CISC, and VLIW are in the formats and
semantics of the instructions
CISC instructions vary in size, often specify a sequence of operations, and can
require serial (slow) decoding algorithms. CISCs tend to have few registers, and the
registers may be special-purpose, which restricts the
ways in which they can be used. Memory references are typically combined with other
operations (such as add memory to register). CISC instruction sets are designed to take
advantage of microcode.
RISC instructions specify simple operations, are fixed in size, and are easy
(quick) to decode. RISC architectures have a relatively large number of general-purpose
registers. Instructions can reference main memory only through simple load-register-
from-memory and store-register-to-memory operations. RISC instruction sets do not need
microcode and are designed to simplify pipelining.
VLIW instructions are like RISC instructions except that they are longer to
allow them to specify multiple,independent simple operations. A VLIW instruction can
be thought of as several RISC instructions joined together. VLIW architectures tend to be
RISC-like in most attributes.

Above figure clarifies difference among three of them.


[7]
Example Illustrating difference:
Consider the C code:
Function(long j)
{
long i;
J=j+i;
}

The assembly code is:


add 4[r1]<-r2

The CISC code consists of one instruction because the CISC architecture has an add
instruction that canencode a memory address for the destination. So, the CISC instruction
adds the local variable in register r2to the memory-based parameter. The encoding of this
CISC instruction might take four bytes on some
hypothetical machine.

The RISC code is artificially inefficient. Normally, a good compiler would pass the
parameter in a register, which would make the RISC code consist of only a single
register-to-register add instruction. For the sake of illustration, however,
the code will consist of three instructions as shown. These three instructions
load the parameter to a register, add it to the local variable already in a register,
and then store the result back to memory. Each RISC instruction requires four
bytes.

[8]
The VLIW code is similarly hampered by poor register allocation. The
example.VLIW architecture shown has the ability to simultaneously issue three
operations. The first slot (group of four bytes) is for branch instructions, the
middle slot is for ALU instructions, and the last slot is for the load/store unit.
Since the three RISC operations needed to implement the code fragment are
dependent, it is not possible to pack the load and add in the same VLIW
instruction. Thus, three separate VLIW instructions are necessary.

A CISC machine is able to execute the code fragment in three cycles. A RISC
machine would be able to execute the fragment in three cycles as well, but the cycles
would likely be faster than on the CISC. The VLIW machine, assuming three fully-
packed instructions, would effectively execute the code for this fragment in one cycle. To
see this, observe that the fragment requires three out of nine slots, for one-third use of
resources. One-third of three cycles is one cycle.

3.4 Different from superscalar:

Superscalar :
[9]
Superscalar (super: beyond; scalar: one dimensional) means the ability to
fetch, issue to execution units, and complete more
than one instruction at a time. But ,the instruction fetching decision taken by complex
hardware system.
The implementation consists of a collection of execution units (integer ALUs,
floating-point ALUs,load/store units, branch units, etc.)that are fed operations from an
instruction dispatcher and operands from a register file.
The execution units have reservation stations to buffer waiting operations that have been
issued but are not yet executed. The operations may be waiting on operands that are not
yet available. The instruction dispatcher examines a window of instructions contained in
a buffer. The dispatcher looks at the instructions in the window and decides which ones
can be dispatched to execution units. It tries to dispatch as many instructions at once as is
possible, i.e. it attempts to discover

maximal amounts of instruction-level parallelism. Higher degrees of superscalar


execution, i.e., more execution units, require wider windows and a more sophisticated
dispatcher. It is conceptually simple—though expensive—to build an implementation
with lots of execution units and an aggressive dispatcher, but it is not currently profitable
to do so. The reason has more to do with software than hardware. The compilers for
RISC and CISC processors produce code with certain goals in mind. These goals are
typically to minimize code size and run time. For scalar and very simple superscalar
processor implementation, these goals are mostly compatible. For high-performance
superscalar implementations, on the other hand, the goal of minimizing code size limits
the performance that the superscalar implementation can achieve. Performance is limited
because
minimizing code size results in frequent conditional branches, about every six
instructions. Conceptually, the processor must wait until the branch is resolved before it
can begin to look for parallelism at the target of the branch. To avoid waiting for
conditional branches to be resolved, high-performance superscalar implementations
implement branch prediction. With branch prediction, the processor makes an early guess
about the outcome of the branch and begins looking for parallelism along the predicted
path. The act of dispatching
and executing instructions from a predicted—but unconfirmed—path is called
speculative execution.
Unfortunately, branch prediction in not 100% accurate. Thus, with
speculative execution, it is necessary to be able to undo the effects of speculatively
executed instructions in the case of a mispredicted branch. Some implementations, such
as Intel’s superscalar Pentium, simply prevent instructions along the predicted path from
progressing far enough to modify any visible processor state, but to gain the most from
speculative execution, it is necessary to allow instructions along the predicted path to
execute fully. To be able to undo the effects of full, speculative execution, a hardware
structure called a reorder buffer can be employed. This structure is an adjunct to the
register file that keeps track of all the results produced
[10]
by instructions that have recently been executed or that have been dispatched to
execution units but have not yet completed. The reorder buffer provides a place for
results of speculatively executed instruction (and it solves other problems as well). When
a conditional branch is, in fact, resolved, the results of the speculatively executed
instructions can be either dropped from the reorder buffer (branch mispredicted) or
written from the buffer to the register file (branch predicted correctly).

VLIW :
A VLIW implementation achieves the same effect as a superscalar RISC or
CISC implementation, but the VLIW design does so without the two most complex parts
of a high-performance superscalar design. Because VLIW instructions explicitly specify
several independent operations—that is, they explicitly, specify
parallelism—it is not necessary to have decoding and dispatching hardware that tries to
reconstruct parallelism from a serial instruction stream. Instead of having hardware
attempt to discover parallelism, VLIW processors rely on the compiler that generates the
VLIW code to explicitly specify parallelism. Relying on the compiler has advantages.
First, the compiler has the ability to look at much larger windows of instructions
than the hardware. For a superscalar processor, a larger hardware window implies a
larger amount of logic and therefore chip area.At some point, there simply is not enough
of either, and window size is constrained. Worse, even before a
simple limit on the amount of hardware is reached, complexity may adversely affect the
speed of the logic,thus the window size is constrained to avoid reducing the clock speed
of the chip. Software windows can be arbitrarily large. Thus, looking for parallelism in a
software window is likely to yield better results.
Second, the compiler has knowledge of the source code of the program. Source
code typically contains important information about program behavior that can be used to
help express maximum parallelism at the instruction-set level. A powerful technique
called trace-driven compilation can be employed to
dramatically improve the quality of code output by the compiler. Trace-drive compilation
first produces a suboptimal, but correct, VLIW program. The program has embedded
routines that take note of program behavior. The recorded program behavior—which
branches are taken, how often, etc.—is then used by the
compiler during a second compilation to produce code that takes advantage of accurate
knowledge of program behavior. Thus, with trace-driven compilation, the compiler has
access to some of the dynamic information that would be apparent to the hardware
dispatch logic in a superscalar processor.
Third, with sufficient registers, it is possible to mimic the functions of the
superscalar implementation’s reorder buffer. The purpose of the reorder buffer is to allow
a superscalar processor to speculatively execute instructions and then be able to quickly
discard the speculative results if necessary. With sufficient registers, a VLIW machine

[11]
can place the results of speculatively executed instructions in temporary registers. The
compiler knows how many instructions will be speculatively executed, so it simply uses
the temporary registers along the speculated (predicted) path and ignores the values in
those registers along the path that will be taken if the branch turn out to have been
mispredicted.

EXECUTION UNIT
#1

3.5 Software complexity instead of hardware :


While a VLIW architecture reduces hardware complexity over a superscalar
implementation, a much more complex compiler is required. Extracting maximum
performance from a superscalar RISC or CISC implementation does require sophisticated
compiler techniques, but the level of sophistication in a VLIW compiler is significantly
higher. VLIW simply moves complexity from hardware into software. Luckily, this
trade-off has a significant side benefit: the complexity is paid for only once, when the
compiler is written instead of every time a chip is
fabricated. Among the possible benefits is a smaller chip, which leads to increased profits
for the microprocessor vendor and/or cheaper prices for the customers that use the
microprocessors. Complexity is usually easier to deal with in a software design than in a
hardware design. Thus, the chip may cost less to

[12]
design, be quicker to design, and may require less debugging, all of which are factors that
can make the design cheaper. Also, improvements to the compiler can be made after
chips have been fabricated; improvements to superscalar dispatch hardware require
changes to the microprocessor, which naturally incurs all the expenses of turning a chip
design.

4. CASE STUDIES:

4.1 Defoe architecture:


Defoe is a processor which use VLIW architecture and programming.
Defoe is a 64-bit architecture with the following function units.

 Two load/store units.


 Two simple ALUs that perform add, subtract, shift and logical operations on 64-
bit numbers and packed 32, 16 and 8-bit numbers. In addition, these units also
support multiplication of packed 16 and 8-bit numbers.
 One complex ALU that can perform multiply and divide on 64-bit integers and
packed 32, 16 and 8-bit integers.
 One branch unit that performs branch, call and comparison operations.

There is no support for floating point. Figure 1 shows a simplified diagram of the


Defoe architecture.

Registers and prediction:

Defoe has a set of 64 programmer visible general purpose registers which are 64
bits wide. As in the MIPS architecture, register R0 always contains zero. Predicate
registers are special 1-bit registers that specify a true or false value. There are 16
programmer visible predicate registers in the Defoe named PR0 to PR15. All operations
in Defoe are predicated, i.e., each instruction contains a predicate register field (PR) that
contains a predicate register number. At run-time, if control reaches the instruction, the
instruction is always executed. However, at execution time, if the predicate is false, the
results are not written to the register file and side effects such as TLB miss exceptions are
suppressed. In practice, for reasons of efficiency, this may be implemented by computing
a result, but writing back the old value of the target register. Predicate register 0 always
contains the value 1 and cannot be altered. Specifying PR0 as the predicate performs

[13]
unconditional operations. Comparison operations use a predicate register as their target
register.

Instruction format:
Defoe is a 64-bit compressed VLIW architecture. In an uncompressed VLIW
system, MultiOps have a fixed length. When suitable operations are not available to fill
the issue slots within a MultiOp, NOPs are inserted into those slots. A compressed VLIW
architecture uses variable length MultiOps to get rid of those NOPs and achieve high
code density. In the Defoe, individual operations are encoded as 32-bit words. A special
stop bit in the 32-bit word indicates the end of a MultiOp. Common arithmetic operations
have an immediate mode, where a sign or zero extended 8-bit constant may be used as an
operand. For larger constants of 16, 32 or 64 bits, a special NOP code may be written into
opcode field of the next operation and the low order bits may be used to store the
constant. In that case, the predecoder concatenates the bits from 2 or more different
words to assemble a constant.
 

4.2 Intel Itanium processor:

The Itanium processor is Intel's first implementation of the IA-64 ISA. IA-64 is
an ISA for the Explicitly Parallel Instruction Computing (EPIC) style of VLIW
developed jointly by Intel and HP. It is a 64-bit, 6 issue VLIW processor with four
integer units, four multimedia units, two load/store units, two extended precision floating
point units and two single precision floating point units. This processor running at 800
MHz on a 0.18 micron process has a 10 stage pipeline.

Unlike the Defoe, the IA-64 architecture uses a fixed-width bundled instruction
format. Each MultiOp consists of one or more 128 bit bundles. Each 128 bit bundle
consists of three operations and a template. Unlike the Defoe where the opcode in each
operation specifies a type field, the template encodes commonly used combinations of
operation types. Since the template field is only five bits wide, bundles do not support all
possible combinations of instruction types. Much like the Defoe's stop bit, in the IA-64,
some template codes specify where in the bundle a MultiOp ends. In IA-64 terminology,

[14]
MultiOps are called instruction groups. Like Defoe, the IA-64 uses a decoupling buffer to
improve its issue rate. Though the IA-64 registers are nominally 64 bits wide, there is a
hidden 65 bit called NaT (Not a Thing). This is used to support speculation. There are
128 general purpose registers and another set of 128, 82-bit wide floating point registers.
Like the Defoe, all operations on the IA-64 are predicated. However, the IA-64 has 64
predicate registers.

The IA-64 register mechanism is more complex than the Defoe's because it
implements support for software pipelining using a method similar to the overlapped loop
execution support pioneered by Bob Rau and implemented in the Cydra 5. On the IA-64,
general purpose registers GPR0 to GPR31 are fixed. Registers 32-127 can be renamed
under program control to support a register stack or to do modulo scheduling for loops.
When used to support software pipelining, this feature is called register rotation.
Predicate registers 0 to 15 are fixed while predicate registers 16 to 63 can be made to
rotate in unison with the general purpose registers. The floating point registers also
support rotation.

Modulo scheduling is a software pipelining technique that can support overlapped


execution of loop bodies while reducing tail code. In a pipelined function unit, each stage
can hold a computation and successive items of data may be applied to the function unit
before previous data is completely processed. To take advantage of pipelined operation,
in a modulo scheduled loop, the loop body is unrolled and split into several stages. The
compiler can schedule multiple iterations of a loop in a pipelined manner as long as data
outputs of one stage flow into the inputs of the next stage in the software pipeline.
Traditionally, this required unrolling the loop and renaming the registers used in
successive iterations. IA-64 reduces the overhead of such a loop and avoids the need for
register renaming by rotating registers forward, i.e., the rotating register base is
incremented in the direction of increasing register index. After rotating by n registers, the
value that was in register X+n can be accessed from register X. When used in
conjunction with predication, this allows a natural expression of software pipelines
similar to their hardware counterparts.

The IA-64 supports software directed control and data speculation. To do control
speculation, the compiler moves a load before its controlling branch. The load is then
flagged as a speculative load. The processor does not signal exceptions on a speculative
load. If the controlling branch is taken, the compiler uses a special opcode named check.s
to determine if an exception occurred. If an exception occurred, the check operation
transfers control to exception handling code.

The IA-64 also includes SIMD instructions suitable for media processing. Special
multimedia instructions similar to the MMX and SSE extensions for 80x86 processors

[15]
treat the contents of a general purpose register as two 32-bit, four 16-bit or eight 8-bit
operands and operate on them in parallel.

4.3 Transmeta crusoe processor:


The Crusoe processor from Transmeta corporation represents a very interesting
point in the development of VLIW processors. Traditionally, VLIW processors were
designed with the goal of maximizing ILP and performance. The designers of the Crusoe
on the other hand needed to build a processor with moderate performance compared to
the CPU of a desktop computer, but with the additional restriction that the Crusoe should
consume very little power since it was intended for mobile applications. Another design
goal was that it should be able to efficiently emulate the ISA of other processors,
particularly the 80x86 and the Java virtual machine.

The designers left out features like out of order issue and dynamic scheduling that require
significant power consumption. They set out to replace such complex mechanisms of
gaining ILP with simpler and more power efficient alternatives. The end result was a
simple VLIW architecture. Long instructions on the Crusoe are either 64 or 128 bits. A
128-bit instruction word called a molecule in Transmeta parlance encodes 4 operations
called atoms. The molecule format directly determines how operations get routed to
function units. The Crusoe has two integer units, a floating point unit, a load/store unit
and a branch unit. Like the Defoe, the Crusoe has 64 general purpose registers and
supports strictly in order issue. Unlike the Defoe which uses predication, the Crusoe uses
condition flags which are identical to those of the x86 for ease of emulation.

Binary x86 programs, firmware and operating systems are emulated with the help of a
run time binary translator called code morphing software. This makes the classical VLIW
software compatibility problem a non-issue. Only the native code morphing software
needs to be changed when the Crusoe architecture or ISA changes. As a power and
performance optimization, the hardware and software together maintain a cache of
translated code. The translations are instrumented to collect execution frequencies and
branch history and this information is fed back to the code morphing software to guide its
optimizations.

To correctly model the precise exception semantics of the x86 processor, the part of the
register file that holds x86 register state is duplicated. The duplicate is called a shadow
copy. Normal operations only affect the original registers. At the end of a translated
section of code, a special commit operation is used to copy the working register values to
the shadow registers. If an exception happens while executing a translated unit, the run
time software uses the shadow copy to recreate the precise exception state. Store

[16]
operations are implemented in a similar manner using a store buffer. As in the case of IA-
64, the Crusoe provides alias detection hardware and data speculation primitives.

5. ADVANTAGES AND DISADVANTAGES:

Advantages:
 Less hardware complexity
 BETTER COMPILER TARGETS.
 Cheaper.
 Software has a larger window to look at.
 Potentially scalable
Can add more execution units, allow more instructions to be packed into the
VLIW instruction.

Disadvantages:
 Increased program size.
Generally compiler has to explicitly add NOP to obtain parallelism because
efficiency increases if all slots of VLIW instruction is filled.Then,NOPs are filled
in empty ones
 Wasteful encoding with NOPs.
 New versions of the architecture can force major rewriting of the compiler.
 New kinds of programmer/compiler complexity
So, Programmer (or code-generation tool) must keep
track of instruction scheduling.
Deep pipelines and long latencies can beconfusing, may make
peak performance elusive .

6. FUTURE SCOPE:
VLIW processors have enjoyed moderate commercial success in recent times as
exemplified by the Philips Trimedia, TI TMS320C62x DSPs, Intel Itanium and to a lesser
extend the Transmeta Crusoe. However, the role of VLIW processors has changed since
the days of Cydrome and Multiflow. Even though early VLIW processors were
developed to be scientific super computers, newer processors have been used mainly for
stream, image and digital signal processing, multimedia codec hardware, low power
mobile computers etc. VLIW compiler technology has made major advances during the
last decade. However, most of the compiler techniques developed for VLIW are equally
applicable to super scalar processors. Stream and media processing applications are
[17]
typically very regular with predictable branch behavior and large amounts of ILP. They
lend themselves easily to VLIW style execution. The ever increasing demand for
multimedia applications will continue to fuel development of VLIW technology.
However, in the short term, super scalar processors will probably dominate in the role of
general purpose processors. Increasing wire delays in deep sub micron processes will
ultimately force super scalar processors to use simpler and more scalable control
structures and seek more help from software. It is reasonable to assume that in the long
run, much of the VLIW technology and design philosophy will make its way into main
stream processors.
 Stream and image processing. Eg PhilipsTrimedia
 Digital Signal Processig. Eg TMS320C62x from Texas Instr
 Mobile computing. Eg Transmeta Crusoe
 High end server applications. Eg Intel Itanium
 Stream and media processing lend themselves to VLIW style with large
amounts of ILP.
 Superscalars will be forced to use simpler structures and seek help from
software.

7. CONCLUSION:
With ever increasing demand for efficient,cheaper and faster processors,VLIW
architecture is surely going to dominate over superscalar operation.
However, most of the compiler techniques developed for VLIW are equally
applicable to super scalar processors. Stream and media processing applications are
typically very regular with predictable branch behavior and large amounts of ILP. They
lend themselves easily to VLIW style execution. The ever increasing demand for
multimedia applications will continue to fuel development of VLIW technology.
However, in the short term, super scalar processors will probably dominate in the role of
general purpose processors. Increasing wire delays in deep sub micron processes will
ultimately force super scalar processors to use simpler and more scalable control
structures and seek more help from software. It is reasonable to assume that in the long
run, much of the VLIW technology and design philosophy will make its way into main
stream processors.
As discussed its key features are going to be vital for its future.

 Cheaper
 Less hardware complexity
 Better compiler target

[18]
7. REFERENCES:

 www.cs.utah.edu/~mbinu/coursework/686_vliw/

 www.semiconductors.philips.com/acrobat/others/

 Advanced Computer Architecture - Kai Hwang.

 www.entecollege.com

 http://www.siliconintelligence.com/people/binu/pubs/vliw/vliw.html

 Joseph A. Fisher, Global code generation for instruction-level parallelism:


Trace Scheduling-2. Tech. Rep. HPL-93-43, Hewlett-Packard Laboratories,
June 1993

[19]

You might also like