You are on page 1of 11

CellVM: A Homogeneous Virtual Machine Runtime System for a Heterogeneous

Single-Chip Multiprocessor

Albert Noll Andreas Gal Michael Franz
ETH Zurich University of California, Irvine University of California, Irvine

Abstract Integrating different cores onto a single chip in this man-
ner makes sense from a hardware designer’s perspective.
The Cell Broadband Engine Architecture (Cell) is a For the intended application domain, having a number of
hardware platform for high performance parallel comput- throughput-orientated processor cores is far more efficient
ing. Due to its architectural features and programming that trying to do the same work with general purpose cores,
model, efficiently programming the Cell processor implies a which would also take up more space on the die. From a
detailed knowledge of the underlaying hardware architec- software designer’s perspective, on the other hand, dealing
ture, carefully designed communication protocols to mini- with asymmetric architectures and multiple instruction sets
mize the synchronization and communication overhead be- makes programming considerably more difficult.
tween the individual processing units as well as an elabo- The goal of our work is to provide the best of both
rate layout of the vector processor’s local stores. worlds, combining the benefits of a symmetric multi-
In order to take the burden from the programmer of hav- processor system (which is simpler for programmers) on top
ing to deal with such low-level architecture-specific details, of an asymmetric multi-processor system (which hardware
the Cell compiler toolchain implements the OpenMP stan- architects can build more efficiently and cheaply). Our solu-
dard [27], which allows for seamless parallelization of ap- tion consists of a virtual machine layer that sits on top of the
plications by offloading workloads to the SPEs. heterogeneous hardware and automatically distributes work
In this paper we report on the design and implementa- to the different processing cores.
tion of a middleware for the Cell processor that goes one In particular, our virtual machine presents an interface to
step further and completely abstracts the heterogeneity of applications that mimics the behavior of a homogeneous,
the underlaying hardware architecture and allows program- shared-memory multiprocessor. The homogeneous inter-
mers to employ higher-level and more intuitive program- face allows for a simplified programming model at a high
ming constructs than the OpenMP approach. In particular, level of abstraction, which has several benefits such as
we implemented a virtual machine (VM) that mimics the be- improved programmer productivity and (potentially) im-
havior of a homogeneous, shared-memory multiprocessor. proved software reliability. E.g., the Java Programming
Internally, our VM schedules individual threads to be exe- Language [5] allows software development at such a high-
cuted on the vector cores which offers the same parallelism level of abstraction, leaving lower-level implementations
as compared to the traditional programming models of the such as dynamic memory allocation and garbage collec-
Cell processor. tion to the runtime system. The disadvantage of a more
abstract software development, however, is that implemen-
tations tend to incur a runtime overhead as compared to a
lower-level implementation.
1 Introduction We have implemented a prototype of such a system that
we call CellVM. To the application, CellVM presents the
Unlike the traditional processors that are commonly used interface of a standard Java Virtual Machine (Java VM).
in the desktop and server market, the Cell [19, 16] im- Internally, CellVM executes Java VM instructions by co-
plements an asymmetric architecture. It consists of one execution between the different functional units contained
general-purpose processor core (PPE) and eight identi- on the Cell microprocessor. CellVM supports two modes of
cal throughput-orientated vector cores (SPE), implementing execution: a purely interpretative approach and a dynamic
an entirely different instruction set that is geared towards Java byte-code to native code compiler that translates Java
computation-rich applications. code to native vector code on-the-fly.

ac. Therefore. Instead. are released to be processed incorporate the SPEs to process Java instructions since they by the PPE.4). of references and dynamic memory management which is Incorporating the SPEs into the Java VM and using them to used for allocating objects on the Java heap. threads and hence maintained in main memory) from a SPE our approach implements an automated software-controlled requires a DMA transfer that reads from the specified ad. As a consequence. ShellVM Core Core Core VM0 VM1 VM n profiling shows surprisingly high hit rates above 90% for Java Thread 0 Java Thread 1 Java Threadn the instruction and data cache. Key to a streamlined di- vision of labor in our cooperative execution environment is an efficient use of these scratch pads as a distributed data cache. Despite relying on a software-only cache approach. resources. we measured a series of A further consequence of the restricted access capa- benchmarks that we discuss in Section 5. In code execution units. “Complex” Existing Java VMs for the Cell architecture are unable to memory interactions. private local happen by using its DMA engine. Section 2 we discuss the design considerations on which we based our implementation that is described in Section 3. In order to enable the processing of execute Java code. divided into two cooperating VM implemen- tations: While the Shell-VM operates on the • an automatic memory management system for the PPE and primarily maintains global system SPE’s local store. process Java instructions requires a significant change in the Furthermore having a centralized resource manager located Java VM implementation (see Figure 1). In main purpose of the memory management system is to copy particular. To summarize. command. a switch of control from the executing SPE to the PPE (see memory environment. The dress in main memory and puts the data in the local store. our design restricts the SPEs to only process Java opcodes. As the local store is used for both. followed by re.3) or require “simple” main 2 Design Considerations memory interaction. instructions and data to the software-managed caches. holding the SPE binary . Hence. each SPE maintains its own. the size of the local store is limited to 256KB. Complex interactions include the resolving provide the major computational capabilities remain idle. however. Simple interactions with main mem- ory are constricted to be read or write accesses. which either require access to thread-local data structures (see Section 3. In essence. Java Application lem. The vector cores in the Cell architecture do not have a hardware-based cache hierarchy. cores. store that can be regarded as a software controlled cache. opcodes cannot be implemented efficiently on the vector backs of our approach in Section 7. all availabe SPEs that Section 3. A complex memory interaction is simulated by are not designed to operate in a heterogeneous. To evaluate our approach.3) has to be implemented by a corresponding DMA Finally. We discuss the benefits and draw. bilities to main memory is that a small subset of Java lated work in Section 6. every access to a non thread-local data (see Sec. the contributions of CellVM are: OS • introduction of a hardware abstraction layer (HAL) that presents a homogeneous interface to the applica- tion and hence allows for software development at a PPE SPE0 SPE1 SPE n high level of abstraction. cessing the Java heap (which is shared among all Java To mask the latency of the mandatory DMA commands. at the PPE can reduce expensive synchronization overhead sign was driven by three Cell architecture-specific features: between SPEs significantly. accessing the main memory from a SPE can only Second. for instance. Of particular note is our solution to the data latency prob. memory management system for the SPE’s local store. which represent the virtual Java The reminder of this paper is organized as follows. • incorporating the SPEs into the Java virtual machine to Figure 1. each vector core provides a software-controlled scratch pad memory in con- CellVM Abstraction Layer junction with DMA capabilities. our de. which current VM implementations Java instructions on the SPEs. First. tion 3. In Section 4. we describe the software-driven caching mech- anisms that we use on the throughput-orientated processor cores. distributed. each SPE is equipped with a Core- VM instance. the Java VM is for the Cell architecture are unable to do.

Kazi et al. Our work fo. In fact. The Shell-VM then executes the in- cessed instructions is handled by the PPE (see Section 5. which facilitates the execution of existing soft. execute native methods on the SPEs would also be point. Java stack or local caches. The second approach mentation deficiency instead of a general limitation. Since the SPE binary resides within the explored so far. cution (only frequently used methods are compiled to native cation [23]. the address could be and the just-in-time compilation and present a list of limita- resolved dynamically (which can be tricky without having tions of the current prototype implementation. since more SPE code would code.(the Core-VM instance) and the data that are processed. migrating the state of a “interpreter-SPE” to a “native-SPE” . this is rather an imple. local store. Since these instructions are rare. In our context a native method is a function written in a language other than Java. multi-threaded applications are able to consume additional space. a DMA Reads/Writes major design goal was to keep the SPE binary as small as possible. The Shell-VM runs on the PPE and is primarily intended to less because most native methods interact with the operat. which has not been SPEs on demand. For this reason. we considered implementing a mixed-mode exe- CellVM conforms to the Java Virtual Machine Specifi. Core-VM faces an instruction with a “com- measured for several common benchmark programs. all opcodes that create new objects (< 1%) as model Shell-VM primarily acts as a resource well as native methods are exclusively handled by the PPE. Execution model of CellVM. code) as potentially less efficient with respect to available ware without the need for modifications to the existing Java local store space for caching. tion.2) we decided to let the PPE execute 3 Implementation native methods. If a stores (11%–17%) and arithmetic operations (6%–23%). In our In particular. it would be conceivable to structions. provides a runtime environment that supports the execution ning Java code on the PPE is well understood and has been of native SPE instructions. However. Conse.2). In particular. Furthermore there are native methods that manipulate the Java heap (e. memory Figure 2. However.g. Calling a native method requires The following sections provides deeper insights into the its compiled code to be available to the Java VM. which SPEs are unable to do and would have to onto each SPE and manages local data structures like the consult with the PPE for anyway. since either an interpreter or a native execution environment and our model offloads Java threads to the on-chip SPEs. All plex” memory interaction a switch from the of the instructions mentioned above can be executed local Core-VM (SPE) to Shell-VM (PPE) is per- on the SPE. The first approach installs a byte-code interpreter on mented in our prototype. since the SPE binary subtracts from the available Main Memory Local Store memory that can be used for the caching of Java instruction and data. Run. Java byte-codes to SPE code which is then distributed to the cuses on running Java code on SPEs. only an insignificant portion of the pro. performance is not affected significantly. The Core-VM is loaded ing system. each SPE to process Java instructions. Since native method invocations are usu- ally rare (see Section 5. service requests While the offloading of functionality to the PPE seems to ShellVM (PPE) CoreVM (SPE) be an inherent performance-limiting factor of our approach request done (transfer of control from a SPE to the PPE is an expensive operation). manager. transparent access to main memory) and copied to the local store on the fly. The PPE dynamically compiles implemented by many existing Java VMs. struction on behalf of the Core-VM. architecture and implementation of CellVM. Attempting to (Figure 2). Furthermore quently. The Core-VMs are the virtual While our prototype system focuses on using SPEs for execution units and process the majority of Java VM in- the execution of Java programs. use the PPE in addition to the SPEs whenever no requests Our current prototype implements two modes of execu- are pending from any SPE. array copying) which must CellVM effectively consists of two separate programs be handled by the Shell-VM in any case. we decided to offload a subset of Java VM functionality (mainly complex interactions) from manages manages the SPE to the PPE. compiled by two different compiler tool chains. [21] show that the most frequent Java instructions are local variable loads (30%–47%). manage global system resources. providing SPEs with exploit the computational power of the Cell processor. Alternatively. the compiled code would have to reside within the we describe code optimizations strategies for the interpreter local store of each SPE. Currently this is not imple. formed.

which is [22]. Another reason is that tracts from the local memory available for storing instruc. What furthermore in- interpreting technique for the SPE. which can be implemented using different approaches of achieving higher instruction cache hit rates. code generator.5 require access tion and data. com- prototype system this is not a significant limitation. This growth mainly originates in branch instructions to one. To keep the translated ables and the Java frame is reserved. place). imple. a branch to a C routine is issued but 40 SPE instructions if Our current prototype implementation does not implement directly compiled to machine code. It holds operands.1 Byte-Code Interpretation One challenge in this approach is identifying which in- structions should be “exlined” (instead of “inlined” in- The standard technique for Java execution is interpreta. return values. In [34] we SPE instruction (32-bit). that are private to each thread are maintained locally by each tion by instruction code generator that only performs a very SPE.g. E. E.5. replacing the original stack-architecture of the Java VM by a register architecture [30] is particularly inter- esting for the SPE since it possesses 128 general purpose Even in the case of direct-threaded interpretation we still registers. branches to C-equivalent implementations have to be per- mance penalty since at least three machine level branch in. Profiling the current approach shows an increase structions are executed per byte-code instruction. tion. Since the main goal of the per-thread data areas. and in. since the inter. As in traditional Java instead of emitting machine code to implement the entire VMs. We chose JamVM per instruction. as described in Section 3. In essence the Java VM-Specification describes two limited set of optimizations. used to hold intermediate operands and pass parameters to sume too much space are implemented in C and compiled the callee (see Figure 3). incur at least one expensive branch operation per executed Java bytecode. If the code generator faces such an instruction. It should also be noted that this branch to the helper func- To implement our approach we extended JamVM [18]. to main memory. of compiled code size by a factor of 12 as compared to the threaded interpreting [7] reduces the number of mandatory original byte-code size. Java heap) are accessible by to directly executable SPE machine code. tion is much cheaper than the branch the interpreter incurs an open source Java Virtual Machine. cludes a dynamic compiler that can translate Java bytecodes Global memory regions (e. Direct. formed. some optimizations described in Section 3. Due to the code size constraints of the SPE binary. code generation is done on the PPE..could potentially prove beneficial for certain applications. for the purpose of this programming language such as C. For building an optimizing compiler nu- merous existing optimizations [11] can be ported to Cel- 3. show how to build a customized direct-threaded interpreter CellVM’s JIT compiler is a simple pattern matching targeted towards the Cell architecture. however. The generated code is copied to the in- struction cache of a Core-VM on demand.2 Just-In-Time Compilation lVM. This significantly frame pointer are used to restore the caller’s state before reduces the size of the generated machine code. 3. with GCC. Due to the missing hardware branch prediction fa.g.. method invocation. The operand stack is code size at a reasonable level instructions that would con. compiler is to reduce the branching penalty (and allow us The Java stack is analog to the stack of a conventional to benchmark the cost of the branch). For every method call. direct and thus can be hinted using a special SPE instruc- This is important for the interpreter version. To address this problem many high per- formance VMs compile byte-codes to directly executable 3. The memory layout of CellVM is illustrated in Figure 3. The previous frame pointer points to menting the opcode LDIV requires 5 SPE instructions when the frame of the caller and the method area holds function . space for local vari- ries of SPE machine instructions. Our interpreter uses the creases native code size is the fixed instruction length of a direct-threaded interpreter variant of JamVM. Data areas compiler in CellVM implements a straight-forward instruc.g. When putational results and arguments for method invocation and compiling bytecode. each bytecode is translated into a se. cilities of the SPE architecture the implementation of a The downside.. preter needs to reside in the local store of a SPE and sub. crucial to good performance. The dynamic every thread and hence managed by Shell-VM. which is thus the most suitable the fact that the SPE has a RISC ISA. Keeping the code size small has the advantage tion. program counter and previous functionality of the instruction in place. is that more performance limiting “switched interpreter” would incur a significant perfor. this migration process. CellVM follows the same approach. it We adapted the Java frame by adding fields that are re- emits a branch instruction that invokes the helper function quired for returning from a function. because this branch is unconditional and amongst available alternatives because of its compact size.3 CellVM Memory Layout machine instructions at runtime (dynamic just-in-time code generation). a method area pointer.

By installing code 1 a buffer that holds data to be written back. .. public).. Code preparation is accomplished by fields “PPE Code Ptr”.g. can be used to perform a cache lookup and gather the ac- . By using the co-operative ap. Finally. would make room for other simulates the processing of a complex instruction on the data in the cache. Code Size Reading or writing data to and from main memory is im- lvars n prev Frame Ptr plemented using DMA transfers. quires a switch) and storing the attached operands. Otherwise. stage and coded into the operands of the LDC operation. Java Stack proach. clude dynamically resolved information that can be re-used VM has an own service-thread on the PPE) catches the sig. Whereas 3.. which in turn adapts its internal state and contin. as a first step. Figure 3. The simple code preparation technique is limited to stat- tween a SPE and the PPE. the request. We added the the switching activity. which then would become the system’s local store.. tency can be reduced to the overhead stemming from filling lvars pc code n the buffer and programming the DMA engine. E.. Too much service request could frequently used data and instructions on the overuse the PPE. Blocking DMAs suspend ostack n execution until the data to copy is in place. frequently used data. Both version of CellVM use the same frame1 program counter ostack 1 PPE Code Ptr approach for switching. importance in that performance penalties do not only stem tains a cache (instruction and data) to hold from switching latencies. the dress of the caller’s next instruction is calculated along with opcode LDC pushes a constant value (located in the constant the program counter. performance bottleneck. or if we would maintain the constant pool in the A switch of execution from a Core-VM to the Shell-VM local store (which we do not). We call this mechanism cooperative next time this particular opcode is executed the reference execution or co-execution. ically resolvable information.. Quick opcodes in- raises an interrupt. “SPE Code Ptr” and “code size”. The ues with execution.4 Switching and Communication reside within the local store. is known at compile-time. As a consequence the method’s constant-pool does not need to 3. To meet the conditions mentioned above we applied code preparation to reduce the number of blocking DMA trans- fers and opcode rewriting or binary rewriting to minimize properties like accessibility (private. The waiting service-thread (each Core. the resulting la- non−local tos frame code 2 data .. Per definition the parameters “code size” and “PPE Code Ptr” are used the constant value must not change at any time and hence to copy instructions from main memory to the I-cache (in. the application of more time-consuming optimiza- Method Area tions for JIT would be reasonable if admitted by the number lvars1 SPE Code Ptr of service requests. which can be issued in a frame n blocking or non-blocking manner. with every execution of the same instruction. After completion of the request the PPE notifies by resolving the reference to the constant pool (which re- the SPE. the ad. Hence.g.5 Code Optimizations per-thread data areas (Java stack and Java frames) are managed by each instance of a Reasonable performance can only be achieved if the Core-VM individually. the nal and executes the requested instructions. Code rewriting (opcodes for ecution path independently the Shell-VM cannot track the the interpreter and native code for the JIT) is a dynamic Core-VM’s current state. information as possible prior to method invocation. The latter is of major Shell-VM. a set of registers allows for fast pool at runtime the value can be resolved in the preparation access to local. mechanism used by the SPE to replace Java code on the ing Core-VM updates its internal state to the Shell-VM and fly with so-called “quick opcodes”. (instruction cache miss) pool) of the method onto the operand stack. SPE and is the most expensive communication facility be. Core-VM memory layout. Writing data back to the PPE domain is imple- mented using non-blocking DMA commands. global resources as the number of (blocking) DMA transfers and switches of ex- Java Heap or method-code are managed by ecution are comparatively small. Instead of accessing the constant struction cache). Meanwhile the opcode GETSTATIC can be re-written as a quick-opcode SPE stalls. If processing the method’s byte-code and gathering as much the “SPE Code Ptr” is valid (instruction cache hit). each Core-VM main. As each Core-VM follows its ex. E. In addition. which saves a blocking DMA transfer.. This mechanism is applied to transferring data from main memory to the lo- Registers D−Cache I−Cache cal store.

Stores tion is lost when the cache must be purged (see Section 4). On the one hand. Therefore a high I-cache hit-rate is abso- 4 Software Controlled Caches of SPEs’ Local lutely necessary to keep the number of DMA transfers and switches at a minimum since dynamically resolved informa. It is unlikely that this overhead could be reduced significantly merely through 3. we can except con- or by placing a GC thread on the PPE. We use Java Grande [31] cache coherent Java heap in the future that guarantees co. Instead. the . In this section we describe the experimental evaluation herency protocol and we plan on implementing a partially of our prototype implementations. to benchmark the performance of both implementations. Another way of reducing mandatory switches is extend. rate Cell simulator indicate that both malloc and free of 1- KB of local store consumes 1400 cycles. are kept inside the cache at a greater probability. we do not memory and purge it is a switch of execution from a Core- offer a cache coherent view of the Java heap. since each SPE can responding address in main memory is used as the tag. the caches uses GNU classpath version 0. The cache the Cell processor if the application is made up of 6 (or block length equals the code size of a method and its cor- fewer) threads (see Section Section 5). threads have to for the PPE (read accesses) and on the other hand by purg- explicitly synchronize using Java’s locking mechanism and ing the local cache on the SPE it is assured that PPE ma- synchronize keyword. We decided to implement a direct mapped cache throughput. The data field of the tions with more than 6 threads since each thread is phys. function calls to the math library are native methods. tant than for the instruction cache since it is accessed more ments that require a switch to the Shell-VM reduce the frequently. SPEs offer hardware mechanisms for direct syn. but Java threads are not allowed cache content back to main memory guarantees coherency to synchronize via main memory. By using collection (GC) is not implemented in the current version this replacement strategy we accept latencies coming from of CellVM. since tags are added to the cache in sequence the manner of [10]. which we found to be very slow. One more reason to update the data cache to main Java’s memory model. Each SPE can VM to the Shell-VM. since the PPE coordinates such synchronization since a software implementation can be done efficiently. In. We plan on adding cache block size) can be configured at compile time. If insufficient space is available to cache the threads vs. Performance measure- sponding math-opcode. flict misses (as they can appear in direct-mapped caches) The main performance limiting factors of CellVM are and can furthermore enforce that recently used instructions DMA transfers and switches as described in Section 3. A this feature in a future version of CellVM. opcode rewriting is done lo- cally by each SPE. Also.tual value. We plan to add a GC implementation either in the lookup. points. The instruction cache is implemented as a fully asso- Our approach exploits full computational capabilities of ciative cache with variable cache block length. Another common ap- proach is to introduce explicit memory barriers that provide certain cache coherency guarantees only when needed [13]. which then can be executed locally ments on IBM’s cycle-accurate and memory-timing accu- by the SPE. in multi-threaded applications. our current prototype does not strictly follow flict miss. cache is filled with instructions until the pre-defined limit ically bound to a SPE. garbage next method all cached instructions are purged. However. cache line is written back to main memory in case of a con- Also.4. To avoid expensive synchronization overhead herent memory access of object fields declared as volatile. CellVM three sections: the SPE binary covering 51%. coherency protocols [24] in hardware.91 [15] in which numerous using 25% and 24% are reserved for the Java runtime stack. 5 Performance Evaluation The Cell architecture offers-hardware features in the SPUs that are well suited to implement a software-only cache co. writing the SPE operate on the entire heap. Several avenues exist to remedy this nipulated data is updated correctly to the SPE with the next limitation: multi-processor systems often implement cache DMA transfer. Providing a fast lookup for the data cache is more impor- duces performance noticeably. the PPE could be added easily. To improve performance. A process a single thread only. method instructions to the local store. Also “synchronized” state. An implementation shortcut cache miss triggers a DMA transfer that copies the entire of the current prototype is that it cannot execute applica. of their appearance. The data cache configuration (number of cache lines and chronization without involving the PPE. a heuristic to route SPE is reached. Shell-VM detects quire utilizing dynamic memory allocation on the SPEs such calls and rewrites the native function call to the corre. Having such operations in the application’s “hot spot” re. On the other hand. The local store of a SPE can be roughly divided into ing CellVM’s instruction set by math-opcodes. which use a SPE as a GC coprocessor.6 Limitations of the Current Prototype a more efficient implementation. We use static pool sizes because dynamic cache pools re- stead of performing a switch each time.

Section 1 is composed of small programs measuring low.1 Microbenchmarks synchronized operations.60. Figure 4. Note that We explored CellVM’s behavior in more complex ap- for these benchmarks DMA latencies only carry a minor plications using the modified version of the sequential Java penalty since the data cache is large enough to hold the en.6. which explains the solid performance the DMA queue size (16 commands per DMA engine) the results. As illustrated in Figure 4. This overhead could be reduced benchmarks use shared-memory for thread synchronization. on the other SPE must stall until a slot in the queue becomes available.2 GHz. CellVM currently does not offer a fully cache. it allows us to evaluate our co-operative execution executing native code on the SPE results in a speedup of model. Performance was evaluated by pitting a sin- gle SPE against the PPE of the Cell processor.g. the number of PPE in the loop benchmark. Per- with a cache block length of 128 byte and 128 sets. 48-KB are reserved for the instruction cache and are implemented in a C routine to keep the code size rea- 16-KB for the data cache. The data cache was configured sonable small (e.77 to 0. Values are normalized to operations. For the arithmetic. the performance slightly variegates form since SPE execution must be suspended until the data is in 0. if the total number of DMA requests exceeds tion form the PPE. Furthermore. The loop strategy and investigate whether the EIB (Element Intercon. JamVM running on the PPE. LDIV for the math benchmark). Finally. That baseline is compared to running 1 to 6 Java threads in parallel using the 6 available SPEs of the Cell processor. The bench- marks were conducted on a Playstation 3 running a 2.6 for the loop. cur: first. the SPE has to yield for ob- coherent view of main memory. Since the SPEs release syn- the Cell. comparison to interpreting code. Second. by exploring direct SPE-to-SPE synchronization in favor of a software-based cache coherence protocol is required to performing all synchronization exclusively by the PPE.2 Application Performance in Figure 4. Grande benchmark suite. This The greater speedup essentially results from the reduced benchmarking methodology does not take synchronization number of branches when executing native instructions in overhead of truly multi-threaded applications into account.8 to 2. Values in Figure 4 are normalized plications three types of performance bottlenecks might oc- to the JamVM running on the PPE (1 is JamVM’s perfor.16 Linux kernel populated with eight SPEs. The assign and method benchmark. formance limiting factors for the assign and and method benchmark are identical to the interpreter version. For the program- mer. The arith. we are able to measure our caching 3. however. To analyze the performance characteristics and expose possible performance bottlenecks we ran a series of micro benchmarks using Java Grande Section 1 benchmark suite. Following the micro benchmarks we used Section 2 of the Java Grande benchmark suite to evaluate the per- formance for a series of real-world computation intensive kernels. Performance evaluation of low-level level VM operations such as loops. cast and blocking DMA transfers can affect throughput significantly math benchmark. we modified the benchmarks CellVM performs significantly better in the JIT-version. Hence. The baseline of our benchmarks is JamVM running on the PPE of the Cell. chronization issues to the PPE. These benchmarks can run without interven. . form a Core-VM to the Shell-VM. include CellVM’s worst case scenario: synchronized single SPE to running in parallel all 6 available SPEs of methods and data structures. if the applications requires too many switches mance). casting or arithmetic VM operations. cast and The cache configuration for both versions of CellVM is math benchmarks. the PPE can become The interpreter version is able to slightly outperform the congested by too many requests. ensure data integrity. Since the multi-threaded ject lock and lock release. The results of Java Grande micro benchmarks are shown 5. When running multi-threaded ap- tire data operated on. that worker-threads only operate on non-shared data. do not include branches to C routines. on the other hand. cast and arith benchmark. place. The PPE as well as the SPEs run at 3. namely 5. Each benchmark was scaled up from running a hand. only 6 SPEs are available since one SPE is reserved for internal uses and one is disabled to improve manufacturing yields. benchmark performs best since the generated instructions nect Bus) could become a bottleneck. contain opcodes that the same.interpreter and the JIT.

however. DMA and performance characteristics. each results in a higher number of DMA transfers. LU and the HeapSort benchmark show simi- 99. nificantly depends on the number of DMA transfers. While the interpreter-version shows a slightly the application of the math opcodes. The Sparse Matrix Multiplication . It can be seen that the sparse matrix multiplica- tion benchmark has the lowest data cache hit rate. The number Data Encryption Algorithm) encryption benchmark. Speedup for the interpreter version Figure 7. Fig. Over The SOR. which lar cache. The low Figure 6 illustrates the speedup for the JIT-version. Speedup for the JIT version for Java Grande Section 2 benchmarks relative to the performance of JamVM when using 1 to 6 parallel SPE threads. which benchmark requires a comparatively high number of DMA means that if a requested method code resides within the transfers. Cache hit rates for the instruction of CellVM for Java Grande Section 2 bench. be achieved with 5 SPEs. capability of the SPE of moving data in parallel with pro- For both version. Figure 6. Number of performed DMA trans- fers for a benchmarks using a single SPE. The Series benchmarks is characterized by an extensive Figure 5 shows the results for the interpreter version and use of transcendental and trigonometric functions. executing its own instance of the benchmark concurrently. The FFT can be traced back to the high instruction cache hit rate and benchmark in the JIT-version. As a large fraction of them are non-blocking. with blocking DMAs causing a higher latency than non-blocking transfers. The second best per- higher I-Cache hit rate (since byte-codes are more compact formance results were achieved by the IDEA (International in size) the data cache numbers are identical. Figure 8 shows that performance sig. the number of switches from a SPE to gram execution reduces the latencies considerably. seems to saturate to our VM’s extraordinary efficiency in rewriting opcodes the DMA capabilities of the EIB. The highest speedup can to quick-opcodes. the cache we count a hit and a miss otherwise.99% of all opcodes are executed by Core-VM. The high throughput basi- and provide information about the the DMA transfers of the cally results from the low number of DMA transfers and JIT-version. which Our benchmarks were performed using 1–6 threads. and data cache for each benchmark marks relative to the performance of JamVM when using 1 to 6 parallel SPE threads. data cache hit rate barely affects performance since the data ure 7 and Figure 8 show the corresponding cache statistics cache is accessed infrequently. The of the I-Cache hit rate are based on method-level. Figure 8. the PPE is negligibly small for all tested applications. Figure 5.

which compiles a Java dialect application of compiler optimization techniques [2] (which to C. since users are (global) register allocation. Any Java multi-threaded application can run in putational capabilities of the heterogeneous Cell multipro- parallel on a distributed system without any changes. [25] describe a collection of transformations that can significantly reduce this overhead while remaining fully compliant with the Java The Java Programming Language offers features such as language semantics. Another additional cost of Java arrays platform independence for portability and support for multi- in comparison to unchecked C arrays are dynamic out-of- threading or distributed programming. allel Java projects. of such a just-in-time compilation framework is described In contrast to these works. Cel- cessor to Java technology. forms to the Java VM specification and completely hides the On the hardware side. which aims at utilizing the Internet are commonly used in static compilers) have proven to as a computing resource. opers more freedom in applying parallel methodologies its these include common subexpression elimination (CSE) or acceptance among users might be limited. The CellVM project pur- the Cell architecture includes a reasonable load balancing sues a different approach: by implementing a new Java VM. Moreover.benchmarks perform worst due to its poor data locality. parallel ex- ficient software based management of the scratch-pad mem- ecution on multiple cores is managed by the VM. Techniques that eliminate gram developers. Java parallelization approaches include cJVM [4] or Java- Party [28]. as well as the usage of low-level assembly The third category aims at providing seamless paral- intrinsics or architecture dependant compiler optimizations lelization for multi-threaded applications. The Java Programming Language has also gained ground 7 Conclusion in the domain of high performance computing although Java offers a high level of hardware abstraction that entails per. However. a Java virtual machine that formance penalties. over the available throughput-orientated processor cores. Midkiff et al. This tors of the Java language [26] can be mitigated by several explains the low data cache hit rate and the high number of approaches: for example. however differs in that gramming model of the Cell existing virtual machines are it explicitly distributes workloads to execution units on a limited to use Cell’s main processor only. These approaches provide a good support for het- architectures. To the best of our single die and thus reduces communication and synchro- knowledge we are the first to incorporate the throughput- nization overheads greatly. and ParaWeb [8]. executes Java programs cooperatively on the different cores . Williams et al. however they tend to be less efficient due to a tremendous potential for scientific computations and the overhead of implementing remote objects and message demonstrate that it can outperform the processor cores men- passing at the language level instead of implementing them tioned above. Al Jaroodi et al. 1D and 2D FFTs) that ment by additional class libraries that provide explicit par- compares Cell performance to leading superscalar (AMD allelization functions. Among others this cate- originating in byte-code interpretation [1]. Among others. without ories [20] in conjunction with the appropriate DMA buffer- the need for additional class libraries. The inherent performance limiting fac. Other examples of this class of orientated processors into a Java Virtual Machine. into a serious alternative to traditional processors in the Hence. ing scheme [9]. This approach [12]. the Cell processor is evolving heterogeneous multi-core architecture beneath a VM layer. VLIW (Intel Itanium 2) and vector (Cray X1E) [14]. [3] survey current par- the expensive runtime check are described in [29]. We have presented CellVM. accessing arrays in Java is ex- DMA transfers. Due to the architecture and pro- lVM closely resembles this category. While this approach gives devel- be efficient for dynamic compilation too. dynamic just-in- The first category of systems replaces the standard Java time compilers [6] aims at reducing performance penalties VM by building a new system. does not require programmers to address the parallelization The CellVM project aims at opening up the full com- process. The authors break existing approaches In addition to optimizations targeted towards eliminating down into three categories: language-based performance constrains. [35] present quantitative performance data for scientific The second category enhances the Java runtime environ- kernels (dense matrix multiply. These features make bounds check that assure that the index lies within the spec- Java particularly attractive for parallel and distributed pro- ified array size with each access. our implementation fully con- in [32]. existing Java code can be executed without the need domain of high-performance computing. Examples are Ajents [17] and JPVM Opteron). pensive (as compared to C) since a compulsory reference check has to be performed and a corresponding exception 6 Related Work to be thrown if the check fails. for modifications. They conclude that the Cell processor shows erogeneity. exploiting the full capabilities of as virtual machine primitives. ef- which matches the capabilities of the hardware. The design and implementation forced to apply a new dialect or port code to a new platform. the gory includes Titanum [33].

M. Z. 2003. Inc. A high performance cluster jvm presenting a pure single cal store add an additional overhead. pages 168–177. Stichnoth. 1998. [6] J. 2000.g. ETH urate the PPE core with service requests. USA. [11] B. Boston. quire monitoring of e. Brecht. Bell. gramming in Java. ACM. [2] A. data-transfer with computation. Schus- required to move instructions and data to and from the lo. It re. In PPPJ ’03: Proceedings of the 2nd California and Microsoft Research under the Microeletron. V. The Java programming language herently transparent to a Core-VM. Addison-Wesley Longman Pub- has to satisfy to achieve good performance: First. Computer Science Press. mains an open question. Having a transfer of control comparative study of parallel and distributed java projecys for heterogeneous systems.. [7] J. and J. New York. New York. between the SPEs can be implemented efficiently. international conference on Principles and practice of pro- ics Innovation and Computer Research Opportunities (MI.-Y. we have devised a The views and conclusions contained herein are those software-based caching strategy that maintains local copies of the authors and should not be interpreted as necessarily of frequently accessed data structures such as method byte. 2008. for the interpreter version. Government is on the throughput-orientated SPEs. Gschwind. 35(2):97–113. The U. of the National Science Foundation or Experiments show that cooperative execution is able to any other agency of the U. 33(5):280–290. ParaWeb: To- We conclude that the current prototype implementa. M. K. In Proceedings of the Inter- from the SPE to the PPE (which also includes synchroniza- national Parallel and Distributing Processing Symposium. techniques. Aycock. NY. R. array accesses at runtime. O’Brien. Government. and D. implemented in C allows for an efficient overlapping of a USA.).and an average speedup of 2. 1986. the current implementation of CellVM leaves much [9] T. a Java [10] C. H. In EW 7: Proceedings tion of CellVM performs well for computation intensive of the 7th workshop on ACM SIGOPS European workshop. ger. Parikh. A brief history of just-in-time. T.. Sethi. and K. distribute most of the activity to the SPEs so that even six The authors would like to acknowledge Christian Ste- simultaneous Java threads running on six SPEs do not sat. preparation phase (which is performed at runtime) or re- Surv.. Swanson. Zurich.-R. Talbot. ever. This shows that our software-based caching strategy is successful for 6 of 7 applications. 1996. Mohamed. Jiang. Cher and M. Cierniak. Commun. SPEs in parallel our system was able to achieve an average and J. Hence. most and the National Science Foundation (NSF) under grants common Java byte-code instructions are executed directly TC-0209163 and ITR-0205712. How. NY. USA. R. perfor. Adl-Tabatabai. New ing of data would either require an analysis stage in the York. [5] K. An efficient prefetch. pressed or implied. Sandhu. representing the official policies or endorsements.. Al-Jaroodi. November 2006. Lueh. gistic processor as a garbage collection coprocessor.S. . tion operations in the current implementation) makes Cel- 2002. and Thomas Gross. ACM. between main memory and the SPEs. and tools. lVM intolerably slow. In VEE ’08: Proceedings of the fourth ACM SIGPLAN/SIGOPS in- 8 Acknowledgments ternational conference on Virtual execution environments. Cell GC: using the cell syner- VM for the Cell processor can perform very well. O’Brien. A. Technical University of Graz. Chen. (2nd ed.5 in-time java compiler. Teperman. programs that only operate on small set of data. USA. Second. Fast. USA. Gosling. the DMA transfers that are [4] Y. Waldron. ACM Press. ACM Press/Addison-Wesley Publishing Co. our approach system image. for the tested applications. Furthermore. 2003. and A. [8] T. While an application conference on Java Grande. NY. 16(6):370–372. can be performed effectively. whether this analysis/monitoring 1973. Using all six [1] A. Davis and J. CRO) program. while instructions that authorized to reproduce and distribute reprints for Govern- require complex main memory interaction are processed by mental purposes notwithstanding any copyright annotation the main PPE core. A a Core-VM to the Shell-VM. this overlapping is not in. M. 1998. Eilam. only one benchmark program saturated the EIB when increasing the number of parallel References threads to six. Threaded code. To mask expensive DMA transfers thereon. Compilers: princi- Our approach relies on several properties an application ples. lishing Co. pages 181–183. the Bavaria California Technology Center. N. Aho. MA. for their support of this research project. Ullman. In The 19th International Workshop on Languages and Compilers data prefetching and the synchronization/communication for Parallel Computing (LCPC 2006). M.of the Cell multiprocessing platform. pages 181–188. SIGPLAN Not. A survey of optimizations for the This research effort was partially funded by the State of java virtual machine. V. and J. H. Optimizing room for improvements and we strongly believe that if the the use of static buffers for DMA on a CELL chip. ACM Comput. ACM Press. ter. wards World-Wide Supercomputing.5 for the JIT. Shan. mance significantly depends on the switching activity from [3] J. In JAVA ’00: Proceedings of the ACM 2000 relies on a reasonable data-locality. In our design. NY. New York. either ex- codes or arrays in the local store of the SPE. M. Sura. Inc. G. New York. pages 141–150. Factor. Arnold and J. D.-Y.S. USA. NY. Aridor.. effective code generation in a just- speedup of 3.

NY. New York. M. A. Ajents: towards an en. In Su- percomputing ’89: Proceedings of the 1989 ACM/IEEE con- ference on Supercomputing. USA. Beatty. Zhang. ACM. V. Tech- niques for obtaining high performance in java programs. [33] Titanium. Chan. [20] M. 39(1):175–193. [17] M. IBM Syst. July/September 2005. Day. A comprehensive P. Midkiff. 2000. 1996. T. J. UK. S. Lawrence. [21] I. In Supercomputing ’01: Proceed- [14] A. DC. (CDROM). 1997. Kandemir. In SPAA ’92: Proceedings of the fourth annual ACM NY. Obdrzálek. Snir. H. J. In PACT ’05: Proc. [26] J. ACM. Shalf. 49(4/5):589–604. The OpenMP specification for parallel computing. ACM. 1992. H.cs. H. ACM Comput. 11(9):963–973. Interpretation techniques. system. pages 466–475. http://www. J. K. USA. 2006. NY. USA. Bull. A parallel java 209–218. C2mp: a cache- coherent. Johns. In VEE ’05: Proceed- [13] A. approach to array bounds check elimination for java. E. USA.berkeley. Kadayif. distributed and mobile java applica. Java programming for high- performance numerical computing. NY. 2000. Izatt. R. Zhao. engine. 2005. Klint. R. L. H. O’Hallaron. and D. Prener. Takeuchi. Ertl. Irwin. grande benchmark suite. pages 9–20. distributed memory multiprocessor-system.gnu. IBM Journal of Research & Development. In CC A. Lilja. ings of the 1st ACM/USENIX international conference on Subset barrier synchronization on a private-memory parallel Virtual execution environments. Parikh. Eichenberger. ACM. and A. September 1981. Oden. [25] S. N. 2008. Gschwind.. Javaparty: Transparend re- mote objects in java. 37(3):409–453. Nakatani. Smith. 2001. Chen.sourceforge. Conf. Optimizing ’02: Proceedings of the 11th International Conference on Compiler for the CELL Processor. J. Zenger. A. and M. strategies for a java virtual machine interpreter on the cell tions. O’Brien. J. NY. NY. [34] K. Charlottesville. I. Vijaykrishnan. J. Shepherd. Kazi. ACM. [24] D. The Java Virtual Machine Speci- fication. ACM. A. Midkiff. P. A.. Computing frontiers. 2000. [23] T. M. P. Z. New York. pages 153–163. [27] OpenMP. Stricker. Hendren. T. S. [16] M. T. Gupta. Wang. 1989. niques. Chip multiprocessing and the cell broadband M. [29] F. M. ACM. K. P. Philippsen and M. Moreira. USA. Virtual machine Computer Society. Lindholm and F. New York. N. Gal. showdown: stack versus registers. Husbands. pages 161–172. E. computing. JPVM: Network Parallel Computing in Java. New York. Gschwind. L. Marquardt and H. Springer-Verlag. In CF ’08: Proceedings of the 5th inter- ence on Java Grande. 9(11):1225–1242. Qian. D.[12] A. VA. H. B. Komatsu. B. C. symposium on Parallel algorithms and architectures. Yasue. M. [22] P. Noll. P. on Computing frontiers. IEEE [30] Y. Gregg. J. The potential of the cell processor for scientific [19] J. Feldmann. tiprocessor. and M. ACM. and T. Wu. Shippy. C. J. Int. T. Addison-Wesley. Kawahito. P. and D. D.. pages 690–695. Moreira. and T. 14th Compiler Construction. T. K. pages 325–342. S. USA. D. and R. E. A Compact Virtual Machine. Verbrugge. Optimizing ar- ray reference checking in java programs. [15] GNU Classpath. Brecht. NY. 1998. Suganuma. Williams.openmp. USA. Ferrari. J. http://www. and M. So. J. . In JAVA ’99: Proceedings of the ACM 1999 confer. Surv. [18] JamVM. and T. P. London. A. K. IBM Syst. pages 1–8. H. Gross.. national conference on Computing frontiers. [32] T. Software — Practice & Experience. on Parallel Architectures and Compilation Tech. NY. Ogasawara. J. New York. 2001. New York. [28] M. M. and D. USA. E. Oliker. Ishizaki. In CF ’06: Proceedings of the 3rd conference on Overview of the ibm java just-in-time compiler. Snir. USA. 32(3):213–240. A. 2006. Gregg. NY. 2002. Dynamic management of scratch- pad memory 1997. broadband engine. P. Williams. ings of the 2001 ACM/IEEE conference on Supercomputing Technical report. Artigas. USA. Introduction to the Cell Mul. and J. IBM Syst. Yelick. USA. pages 15–24. New York. J. Optimization vironment for parallel. pages [31] L. 2005. Kahle. O’Brien. Chen. In CF ’06: Proceedings of the 3rd conference Maeurer. [35] S. P. New York. A. and C. ACM. Washington. Ramanujam. T. D. and http://jamvm. Concurrency: Practice and Experi- ence. M. New York. 1999. In DAC ’01: Proceedings of the 38th conference on Design automation. Kamil. Sura. pages 8–8. 39(1):21– 56. Alkhatib. http://titanium. Shi.