You are on page 1of 11

An introduction to multimedia processors

Stephan Sossau

Vertiefungsseminar Architektur von Prozessoren, SS 2008 Institute of Computer Science, University of Innsbruck June 30, 2008
This seminar paper gives a short overview about media processors. After describing various media processing platforms and comparing their strengths and weaknesses, the most important tasks and a media processor family in detail will be presented. The Trimedia family from NXP, which will be described in terms of architecture, implementation and eld of application. At last, a short introduction about the MPEG-2 algorithm with some of the special multimedia operations of the TriMedia will be given.

1 An introduction to media processing platforms

Multimedia - It has become an important technological innovation and appears in our daily activity, e.g. set-top boxes for digital television, video conferencing systems or mobile phones. Especially in the 3G (third generation of mobile phone standards), the speed-up of the network operation has proved to stimulate the use of multimedia applications. Multimedia is dened as a combination of dierent forms of information: text, 2D- and 3D-graphics, video and audio. These formats were traditionally presented in an analog form but digital representation has some advantages like easier editability. Extreme computational requirements of new encoding techniques like MPEG4 oder H.264 and the enormous size of digital information confront the scientic community and the industry with a sizable problem.

1.1 Requirements of multimedia content

What are the demands on a media processing platform? To implement video telephony, for example, this platform has to perform both the video and the audio processing. Common multimedia tasks are: Video decoding and encoding Audio decoding and encoding 2D- and 3D-Graphics Data communications New mobile phones increase performance in every domain but media prcessors dont have to support all application domains. The NXPs Trimedia processor, for example, focuses only on the video processing part. NXP has created multimedia processors like the Nexperia, that uses the Trimedia for video processing, but also include additional hardware for other tasks like 3D-graphics an imaging. Encoded multimedia data is processed along a decoding chain. The incoming bitstream has to be demultiplexed, decoded and adapted in dierent stages. Decoding and encoding is always linked with a sizable amount of processing. Multimedia processing platforms have to decode modern video codec standards in real-time to guarantee a constant framerate.

Figure 1: Possible decoding-chain.[1] To illustrate the possible complexity of a decoding process and the computational demands on a multimedia processing platform, MPEG-4 decoding will be briey described as an example. MPEG-4 is a unied standard that allows the streaming of both graphics and video. It is a video encoding standard, which allows universal, low bit-rate data transfer of video by reusing most of the data from the rst video frame. Only those bits that have changed from one frame to the next are transfered. The algorithm applies dierent techniques for the compression: Dividing the image into 8x8 blocks or 16x16 macroblocks, motion estimation, transform coding with discrete cosine transform (DCT), quantization, and run-length and Human coding for variable length codes (VLC). Encoding and decoding are computationally intensive. Even decoding a video for small screens with a QVGA resolution (quarter-VGA: 320x240) the algorithm requires about 2 billion operations/second, which is a huge amount of processing.

1.2 Overview of media processing platforms

In the cost-driven embedded consumer market, audio and video processing were initially addressed with dedicated hardware. In this way the required performance could be achieved with lower cost than it could with programmable processors. But the increasing complexity of audio and video standards made programmability attractive. MPEG-2, for example, was initially performed on dedicated hardware, but todays standards like H.264 or MPEG-4 are performed by application specic and more programmable processors. Thats why modern devices are much more programmable than the mainframes of the 1960s. In this seminar paper, a short overview of dierent multimedia processing platforms will be given. Possible approaches are illustrated in gure 2. General-Purpose Processors (GPPs) are not designed for specic usage but for generic program execution. They can be extended with SIMD-style (Single Instruction, Multiple Data) instruction to exploit intra-word parallelism. For example, Intels x86 architecture has been extended with MMX instructions. SIMD-style instructions are typically generic multimedia instructions, not application specic. However, a weak point of this solution is the poor media processing data movement support. Non-aligned memory access and the streaming nature of media data types are not well addressed.

Figure 2: Media processing platforms.[1] Examples for GPPs with streaming vector capabilities (GPP+vector) are Motorolas recongurable vector processor (RSVP) or the matrix oriented multimedia (MOM) approach from UPC in Barcelona, that creates matrix-style operations by merging SIMD-style with vector-style extensions. GPP+vector approaches are ecient if large amount of data level parallelism can be used. This works ne for most older video codec standards, but in newer codecs like MPEG4 or H.264 we have less data level parallelism but an increasing control overhead. In the MPEG2 codec, 16x16 pixel sized blocks are used for motion vectors, but these blocks get smaller in newer codecs. For H.264, processing a 4x4 block may even require the blocks to the left and above it have already been processed. As a result, approaches become less ecient, which only rely on stream-based processing on large vectors of independent data elements. Design-time recongurable processors, like the Xtensa LX from Tensilica, allow the user to add required application specic instruction extensions to the base processor during the design process. The resulting funcionality of the processor is limited. Run-time recongurable processors are often extended with a Field Programmable Gate Array (FGPA). They can adapt their behaviour to the application at hand but the cost of recongurability in terms of silicon area is high. Media processors emerged in the 1990s. These processors typically combine a very long instruction word (VLIW) architecture with SIMD-style instructions. In this way they can exploit instruction level parallelism and intra-word parallelism. Media processors also typically have a unied register-le for integer, oating point and SIMD-style operands, whereas GPPs typically have separate register-les. Media processor register-les tend to be larger than GPP register-les, such that a large data working set can be kept in registers, preventing the generation of load and store operations as a result of spilling due to register pressure. Media processors typically support non-aligned memory access and the streaming nature of media data types through either data prefetching techniques or direct memory access (DMA) techniques. Fixed function dedicated hardware platforms are application specic solutions without any exibility for programmability. With low price in terms of silicon area and power consumption it may be attractive to a well-dened video or audio processing task that has no need for exibility. One might argue that this partitioning of media processing platforms into ve distinct approaches is somewhat articial and indeed the best solution for a specic application (domain) may be a combination of approaches. Especially when multiple domains are supported. We have to distinguish Multimedia processors like NXPs Nexperia or STs Nomadik from Digital-Signal Processors (DSP) like the Trimedia family, designed specically for digital signal processing, generally in real-time computing. DSPs are self-contained processors with their own (often dicult to use) instruction set they are often used as additional componenents on a multimedia processor that supports more application

domains. After describing all these dierent approaches we can compare them along multiple axes: The width of the application domain. Cost, in terms of development, silicon area and power consumption. Infrastructure. The availability of tools, operating systems, development environments.. Performance level, measured in context of the target application, e.g. for video decoding the performance level can be expressed in terms of image resolution Time-to-market. The time which it takes until the product is to the market place. Figure 3 shows the relative strengths and weaknesses of each approach and it suggests that the best solution for a specic application (domain) may be a combination of approaches.

Figure 3: Realtive strengths and weaknesses of media processing platforms.[1]

1.3 Manufactureres and (multi)media processors

After this introduction to the environment and the tasks and requirements of the media processor let us have a look at the market. Nearly every major manufacturer of semiconductor produces media- and application processors. Naming a few: AMD, Intel, ATI, Texas Instruments, Inneon, Freescale, Nvidia, Broadcom, STMicroelectronics, Philips semiconductors, Hitachi, Amtel and many more. They oer a wide variety of media and multimedia processor families as shown in Table 1

2 NXPs TriMedia mediaprocessor

The rst example is The TriMedia family from NXP, founded by Phillips Semiconductors. They published their rst TriMedia TM-1 or TM1000 in 1998. Until now, during 4 architecture level changes, they created numerous successors. In 2006 they published the TM3270, that will be described more detailed. The Trimedia is a DSP (Digital Signal Processor) used as video engine in NXPs Nexperia multimedia processor, that will also be described.

Manufacturer Intel AMD ST NXP TI Freescale ...

Media processor Intel CE 2110 Imageon Nomadik Trimedia, Nexperia VelociTI i.MX ...

Table 1: Examples for manufacturers and media processors.

Figure 4: PNX4101 block diagram.[1]

2.1 Field of application

The Trimedia processors do not compute 3D graphics. Specic graphic processors have to deal with the increasing demands of this domain. Trimedia processors focus on video and audio processing. TM3270 is a lowpower processor but its performance is good enough for portable and connected devices. For example, the TM3270 is used in NXPs mobile multimedia processor Nexperia PNX4101 as video engine. The PNX4101 contains additional components like an ARM926 CPU subsystem or an image processor to compute other multimedia tasks. In Figure 4 we can see a PNX4101 block diagram, showing multiple components including a TriMedia TM3270 media processor CPU as a part of the whole processor. The PNX4101 is an optimized architecture for addressing multimedia applications in mobile devices. A connected example is the video back-end processor PNX5100 which contains 3 TM3271 (successor of TM3270) Trimedia processors and is used in TVs.

2.2 TM3270
The TM3270 was published in 2005. The goal was to create a balanced processor design in terms of power consumption and silicon area that is attractive to the connected and the portable markets. Additionally, the TriMedia application domain does do not include the 3D graphics processing, because of the increasing computational demands in this domain. TriMedia focuses on audio and video processing and, as mentioned

above, is used in NXPs Nexperia multimedia processors. Architecture concerns the specifcation of the function that is provided to the programmer, such as addressing, addition, interruption, and input/output (I/O). Implementation concerns the method that is used to achieve this function, such as parallel datapath and microprogrammed control. Realization concerns the means used to materialize this method, such as electrical, magnetic, mechanical and optical devices and the powering and packaging for them. (G.A. Blaauw and F.P. Brooks jr., Computer Architecture, Concepts and Evolution). Three points will be described: Architecture, implementation and realization. They share following main targets: Application domain: Video processing requirements inuence most design choices. Area: Low cost realization in terms of silicon area. Power: Low power comsumption for battery-operated market. Synthesizable: Migration of of a processor implementation between dierent CMOS technologies. 2.2.1 Architecture 64 Kbyte instructioncache (8-way set-associativity, 128 byte line size, full LRU(Last Recently Used) replacement) 128 Kbyte data cache (4-way set-associativity, 128 byte line size, LRU), pseudo dual-ported. VLIW instruction may contain two load operations, two store operations or one load and one store operation. This way it has the capacity to capture the data working set of most video algorithms on a standard denition (NTSC: 720*480, PAL: 720*576). Non-aligned memory access for load and store operations. No stall cycles for miss-aligned memory access, fewer operations (SIMD Single Instruction Multiple Data). Fewer operations cause reduction in register-le pressure. The functional units are distributed over 5 issue slots. A single VILW instruction contains the operations for each issue slot. Each funtional unit has read ports, write ports and additional single bit port for guarded execution to avoid conditional jumps.

Figure 5: PNX4101 Issue slots of the TriMedia.[7]

Although the TM3270 focuses on video processing the ISA should not be application specic. The functional units are shown in gure 6 and some features support multimedia processing: No operands with architectural state. Operations are limited to up to two issue slots, so called super-operations, where also the neighboring issue slot is used. In this way operations may consume up to four 32-bit sources and produce up to two 32-bit results. Operations should support the possibility for guarded execution to avoid conditional jumps. The ISA contains SIMD vector instructions, partitioned into sub-operands. Non-aligned memory access eciently extends traditional SIMD computational processing. Collapsed load operations with interpolation. LS FRAC8. CABAC operations: Context-based Adaptive Binary Arithmetic Coding. signicant part of H.264/AVC video decoder. This violates the applicability in multiple domains but CABAC operations are available because of the signicant performance enhancement with speedup of 1.5 up to 1,7.

Figure 6: Functional units of the TM3270.[1]

2.2.2 Implementation The Processor Pipeline: The TM3270 has deep pipelining (long operation latency), with negative impact on cycles / VILW instruction ratio, but allows a higher frequency design. The pipeline has 7 up to 12 stages, the depth depends on operation latency. Single-cycle latency operations have 7 stages. Multiple-cycle operations have additional execute stages. The stages are I1, I2, I3, P, D, X1 ,(X2...X6) and W. The rst stages I1, I2, I3, P are called the front-end of the pipeline, X1, X2, ..., W. are the are back-end. Figure 7 shows the pipeline. Stage I1 to I3: Implements the sequential instruction cache. Every cycle a 32-byte chunk of instruction information can be retrieved from the storage and stored in a 4-entry instruction buer in stage P.

Stage P: A VLIW instruction is retrieved from the instruction buer and the individual operations are extracted, guard and source registers are identied. Stage D: Decodes the operations and determines the operands. There are 5 operations per VLIW instruction, ve guard register read ports (1-bit wide, least signicant bit), 10 source register read ports. Stage X1 to X6: The amount of execute stages X1 to X6 is determined by the operations latency. This is where the load/store unit is implemented, that is connected to issue slots 4 and 5. Jump operations are performed in stage X1. Stall cycle free control ow changes without jump/branch-prediction support is allowed because the cage tags (stage I1)and the jump execution (stage X1) are seperated with 5 pipeline stages, that matches the amount of architectural visible jump delay slots. But 25 issue slots (5 jump delay slot VLIW instructions) have to be lled with useful operations. So the task of the compiler/scheduler is complicated. Stage W: Gathers operation results and allows up to 5 writes to the register-le.

Figure 7: TriMedia pipeline.[1]

Figure 8: TM3271 oorplan.[1] 2.2.3 Realization Realization means to materialize the implementation. Here the TM3270 is realized in a low power CMOS with a 90nm feature size. It is fully synthesizable and uses o-the-shelf single ported SRAMs. 450 Mhz frequency (350 Mhz worst case). It has 6 metal layers and a 8.08mm2 area. 47% SRAM) illustrated in table 2 and gure 8. Modul IFU Decode Register-e Execute LS BIU MMIO Total Description Instruction Fetch Unit VLIW instruction decoding 128 entry register-le All functional units Load/Store unit Bus Interface Unit Memory Mapped IO peripherals mm2 1.46 0.05 0.97 1.53 3.60 0.24 0.23 8.08

Table 2: TM3270 area breakdown.[1]

The potential weakness of all mobile processors is their power consumption. Low power for battery devices is a main requirement of the TM3270 processor. We distinuish static and dynamic power consumption. Static power consumption is mainly determined by transistor leakage current. Dynamic power consumption exists due to the capacitances inherent in CMOS circuits being switched between low and high voltage values. Power consumption is roughly 0.9 mW/Mhz. The processor provides enough performance to allow for the most demanding video applications.

2.3 summary
The TriMedia media processor is a VILW CPU with ecient extensions to the ISA. Using SIMD-style and CABAC operations combined with techniques like non-aligned memory for load and store operations, this platform is ecient in video processing. It is used as video engine and as a part of the multimedia processor Nexperia, an application processor for multiple multimedia domains.

3 MPEG-2
The MPEG-2 algorithm is a mechanism to reduce the bitrate of video and audio. Encoders have always access to a large number of techniques and algorithms and have the freedom to apply them in a certain order. The resulting bitstream has to be decoded by the decoder, who has to recognise and know the used technique. MPEG-2 is certainly not the latest video compression method, it has been extended many times to lower the bitrate and to garantee better quality. Anyhow, it will be a nice example to demonstrate some of the media processing functional units used the TM3270 multimedia processor.

Figure 9: The MPEG-2 Encoder.[1] The video frames are encoded into so called I (intra-coded) frames, P (predictive-coded) frames and B (bidirectionally-predictive-coded) frames. I frames are calculated about every 15 frames. They are compressed indepentendly from other frames as single frames. They have the worst compression but are important reference frames. B frames cannot be compressed indepedently, they depend on previous I or P reference frames and have a better compression. P frames depend on previous and predicted frames. They have the best compression but also cannot be available as reference frames. To illustrate some of the advantages of media processors, we take a deeper look at the generation of I frames. First, the image is devided into 8x8 pixel blocks. With the help of reference images, motion vectors and the current block of pixels a block of dierences is calculated. The data of the rusulting blocks will be transformed with a discrete cosine transformation (DCT). It is much easier to work with the resulting coecients and they allow also a better compression. After the DCT a lossy quantizing algorithm compresses the data, followed by a simple run-length endcoding and nally a Human variable length encoding. At last the bitstream can be generated. This chain is illustrated in gure 9. In the discrete cosine transformation following operations are used often and are supported in the instruction set architecture of multimedia processors.


In the TM3270 the buttery operations are realized with DSPIDUALADD and DSPIDUALSUB. The more complex rotate function is realized with a super operator SUPER DUALISCALEMIX. Performace improves very much with the usage of these functions as the following table shows.

This short example shows how eective multimedia processors can compute application specic operations.

4 Conclusion
In the rst chapter we learned about the the dierent approaches of multimedia processing platforms and the demands on this approaches. High computational eort of complex video codec standards and low power consumption for mobile unconnected devices lead to the development of multimedia application processors that can aord both with the help of sophisicated new technology. NXPs VLIW TriMedia Processor showed a number of ways to optimize the performance for multimedia computations. Multimedia applications already found their way into numerous aspects of our life. The growing importance of the mobile market but also new video codec algorithms for higher resolution and more ecient compression will drive on the development of new multimedia processors.

[1] Jan-Willem van de Waerth. The TM3270 Media processor. 2006 [2] Amtel Corporation. MCU Architectures for Compute-Intensive Embedded Applications. December 2005 [3] STMicroeletronics. Nomadik - Open multimedia platform for next generation mobile devices. [4] NXP. NXP Nexperia mobile multimedia processor PNX4101. 2007 [5] ST Microeletronics. STn8820 Mobile multimedia application processor - Data Brief. February 2008 [6] [7] G.J. Hekstra, G.D. La Hei, P.Bingley, F.W. Sijstermans. TriMedia CPU64 Design Space Exploration.