You are on page 1of 12

A performance evaluation of MPEG-21 BSDL in the context of H.

264/AVC
Wesley De Neve+ , Sam Lerouge+ , Peter Lambert+ , and Rik Van de Walle*
* Ghent + Ghent

University, Sint-Pietersnieuwstraat 41 B-9000, Ghent, Belgium; University - IMEC, Sint-Pietersnieuwstraat 41 B-9000, Ghent, Belgium
ABSTRACT

H.264/AVC is a new specication for digital video coding that aims at a deployment in a lot of multimedia applications, such as video conferencing, digital television broadcasting, and internet streaming. This is for instance reected by the design goals of the standard, which are about the provision of an ecient compression scheme and a network-friendly representation of the compressed data. Those requirements have resulted in a very exible syntax and architecture that is fundamentally dierent from previous standards for video compression. In this paper, a detailed discussion will be provided on how to apply an extended version of the MPEG21 Bitstream Syntax Description Language (MPEG-21 BSDL) to the Annex B syntax of the H.264/AVC specication. This XML based language will facilitate the high-level manipulation of an H.264/AVC bitstream in order to take into account the constraints and requirements of a particular usage environment. Our performance measurements and optimizations show that it is possible to make use of MPEG-21 BSDL in the context of the current H.264/AVC standard with a feasible computational complexity when exploiting temporal scalability. Keywords: AVC, BSDL, Content Adaptation, Content Description, H.264, MPEG, Scalability

1. INTRODUCTION
H.264/AVC is a new specication for digital video coding,1 characterized by a design that targets eciency, robustness, and usability.2 Because of its support for a wide range of bit rates,3 H.264/AVC can even be considered as a universal standard for digital video coding. The latter implies that the specication in question will be used under the hood of a lot of multimedia applications in the very near future. Those video-enabled applications will most probably be deployed on a wide variety of terminals, exchanging information with each other by making use of several types of networks. This is not a very attractive situation for content providers because they see themselves as being obliged to provide several versions of the same multimedia presentation in order to reach a target audience that is as large as possible. It would be much more ecient if they only had to provide one presentation that could be reused under all circumstances. A solution for this diversity is the usage of scalable video coding, together with a complementary content adaptation system. In the current H.264/AVC specication, there are no explicit provisions for enabling scalability although some eorts are emerging.4 The latter is currently a hot topic in the video coding and content adaptation community5 because of the fact that scalable coding should make it possible to deal with the growing variety of networks and terminals in an ecient way. To be more specic, think for example about the scenario of a user who has a large collection of music video clips at his or her disposal. One may assume that all video streams are encoded at a very high quality, for instance by making use of an ecient implementation of the Main Prole as available in the H.264/AVC specication. As such the media les in question are suited for playback on a digital home entertainment system. But what if the user wants to enjoy the same video clips on a mobile device when traveling to work by train? Then the need arises for a content adapation system that should make it possible to
Further author information: (Send correspondence to Wesley De Neve) Wesley De Neve: E-mail: wesley.deneve@ugent.be, Telephone: +32 (0)9 264 89 29 Sam Lerouge: E-mail: sam.lerouge@ugent.be, Telephone: +32 (0)9 264 89 17 Peter Lambert: E-mail: peter.lambert@ugent.be, Telephone: +32 (0)9 264 89 29 Rik Van de Walle: E-mail: rik.vandewalle@ugent.be, Telephone: +32 (0)9 264 33 68

Applications of Digital Image Processing XXVII, edited by Andrew G. Tescher, Proceedings of SPIE Vol. 5558 (SPIE, Bellingham, WA, 2004) 0277-786X/04/$15 doi: 10.1117/12.564822

555

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

realize an ecient transfer of the video clips from the full-featured PC to a mobile device, taking into account the constraints of the new usage environment (such as a limited battery life, a reduced screen resolution, . . . ). Although H.264/AVC is a specication for single-layered video compression, we will show how an extended version of MPEG-21 BSDL can be used in combination with the Annex B syntax in order to make possible some high-level manipulations of an H.264/AVC bitstream. In particular, we will discuss some results with regards to the performance when using BSDL to exploit a trivial form of temporal scalability in H.264/AVC. The outline of the paper is as follows: after having given an in-depth overview of the involved technologies in section 2, a description of the applied methodology for performing the measurements is provided in section 3. Section 4 discusses the obtained results and section 5 concludes.

2. MPEG-21 BSDL IN THE CONTEXT OF H.264/AVC


This section describes the dierent technologies that were involved in our research. First of all, an overview is given of the design characteristics and syntaxes of the H.264/AVC specication. Second, a discussion is provided of the MPEG-21 Bitstream Syntax Description Language (MPEG-21 BSDL), together with a detailed description of the extensions that were needed in order to describe a large part of the Annex B syntax.

2.1. Overview of the Design Characteristics and Syntaxes of H.264/AVC


In order to cope with the diversity of the current and future network protocols, H.264/AVC can rely on its two-tier architecture. As illustrated by Figure 1(a), this architecture consists of a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). While the VCL is responsible for the ecient compression of the video data, the NAL is responsible for transforming the compressed video data into a generic stream of logical data units. The latter are called Network Abstraction Layer Units (NALUs) and those syntax structures have the property that their mapping to a transport protocol (RTP, MPEG-2 Systems, ...) or storage format (the ISO Media File Format, ...) can be considered straightforward. In fact, the NALUs are the fundamental units of processing in the context of H.264/AVC. As illustrated by the NALU layer in Figure 1(b), the units in question consist of a one byte header and a payload. The structure of the payload is determined by the value of the nal unit type syntax element, as available in the NALU header. As such, the NALUs are responsible for the delivery of several types of data. For instance, the parameter set NALUs carry parameters necessary for the decoding process while the coded slice NALUs do contain the actual compressed video data. The allowed NALU type codes are provided by Table 1(c) (a similar table can be found in the standards document) while Figure 1(d) provides some more detail about the dependencies between the several types of NALUs and the location of some of the most relevant decoding parameters. The bitstream containing the coded representation of the header information and the video data can be described by making use of two syntaxes: the byte stream format or the NAL unit stream format. The byte stream syntax is characterized by the fact that the NALUs are separated from each other by making use of zero or more zero-valued bytes and a start code prex (see Figure 1(b)). This syntax is also known as the Annex B syntax. Otherwise, without the presence of zero-valued bytes and start code prexes, one is dealing with the NALU syntax. The latter is for instance useful in systems that provide their own kind of framing. The byte stream format can be constructed from the NAL unit stream format by ordering the NAL units in decoding order and prexing each NAL unit with zero or more zero-valued bytes and a start code prex.

2.2. Overview of the MPEG-21 Bitstream Syntax Description Language


MPEG-21 BSDL is an XML based language for the description of the syntax of (scalable) bitstreams . In order to avoid a large overhead and unnecessary computations, the language in question will most often only be used for the description of the high-level structure of a bitstream.6 BSDL was developed in the context of the MPEG-21 Multimedia Framework which aims to enable the transparent and augmented use of multimedia
Sometimes other languages are used for the description of the syntax of media related bitstreams. For instance, the MPEG-4 Systems standard makes use of the Syntactic Description Language (SDL). This language allows to document the syntax of object-oriented structures in a C++ kind of way.

556

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

Control Data (Encoding Parameters)

Video Coding Layer Coded Macroblock Datapartitioning Coded Slice/Partition


NALU layer
NAL header (size of 1 byte) raw byte sequence payload (RBSP)

0x00
zero_byte

0x000001
start_code_prefix_one_3bytes

NAL layer
NAL Unit (NALU)

Network Abstraction Layer generic stream of NAL Units ... NALU NALU NALU NALU NALU ...
slice header

string of data bits (SODB)

stuffing slice layer

NALU syntax (without start codes)

NALU syntax (without start codes)

Annex B syntax (with start codes)

slice data

macroblock layer

Content adaptation involving MPEG-21 BSDL


NALU header
forbidden_zero_bit

MB MB MB MB MB MB MB MB

MB MB

nal_ref_idc nal_unit_type

Systems Layer (synchronization, )

RTP/IP

ISO Media File Format

MPEG-2 Systems

(a) An H.264/AVC encoder in the context of BSDL.


nal_unit_type 0 1 2 3 4 5 6 7 8 9 10 11 12 13..23 24..31 NALU content (RBSP structure) Unspecified Coded slice of a non-IDR picture Coded slice data partition A Coded slice data partition B Coded slice data partition C Coded slice of an IDR picture Supplemental Enhancement Information (SEI) Sequence Parameter Set (SPS) Picture Parameter Set (PPS) Acces unit delimiter End of sequence End of stream Filler data Reserved Unspecified (24-29 allocated for RTP)

(b) Structure of a NAL unit carrying slice data.


NAL header (active) sequence parameter set (SPS)

Parameters valid for an entire sequence - profile@level information - resolution - number of reference pictures
NAL header (active) picture parameter set (PPS)

Parameters valid for at least one picture - type of entropy coding - number of slice groups (FMO) - initial values for quantisation parameter - parameters for deblocking filter
NAL header slice header slice data

Frequently varying parameters - slice type - address of first macroblock in slice


NAL header slice header slice data

NALU type available in NAL header

(c) NALU type codes.

(d) NALU dependencies and relevant syntax elements.

Figure 1. Schematic overview of the design characteristics and syntaxes of H.264/AVC.

resources across a wide range of networks and devices.7 It is actually embedded in part 7 of MPEG-21, the latter better known as MPEG-21 Digital Item Adaptation (MPEG-21 DIA).8 The motivation behind the development of MPEG-21 BSDL is the fact that having a scalable format alone is not sucient. One also needs a program for the analysis and the actual adaptation of (scalable) bitstreams. Because every coding format has its own structure, one would expect at rst sight that a separate program is required for every specic coding format. However, a more generic solution can be devised. To be more specic, it is possible to create a universal program for the analysis and adaptation of (scalable) bitstreams by relying on a common language for the description of the syntax of a specic coding format. Such a language was developed in the context of MPEG-21 and is known as BSDL. The language in question is in fact based on some extensions to W3C XML Schema on the one hand (bitwise datatypes, . . . ) and on some restrictions to W3C XML Schema on the other hand (the occurance of attributes is for instance prohibited in the resulting XML description of the structure of a bitstream because attributes are allowed to occur in an arbitrary order by XML Schema, the latter naturally not being the case for syntax elements, . . . ). Making use of XML and XML
XML Schema is a recommendation of the World Wide Web Consortium (W3C), making it possible to specify some rules with respect to the structure of an XML document, the nomenclature of XML elements and attributes, . . . 911

Proc. of SPIE Vol. 5558

557

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

Schema has several advantages: one can reuse a lot of already existing tools for doing XML related operations and it also allows a more straightforward integration with other XML based standards in the long term. To focus ones mind, Figure 2(a) provides a simplied example that illustrates how MPEG-21 BSDL can be combined with H.264/AVC. An excerpt of the developed scheme in BSDL for the Annex B syntax of the H.264/AVC specication is available in Annex A. On the right side of Figure 2(a), one can notice a video stream, i.e. a sequence of slices. On the left side of the picture an XML based BSD is provided, describing the high level structure of the H.264/AVC bitstream. This XML description contains several elements. As illustrated by the arrows, most of the elements are linked to a corresponding slice and contain some information about the slice in question, such as the type of the slice and the position of the rst and last byte of the slice in question in the compressed stream. In a next step, it is possible to apply some changes to this XML description. For instance, one can decide to drop the XML elements that are linked to the B slices. The interesting thing about this is that one can provide this altered XML description to a content adaptation engine that is smart enough to recognize the changes that were done in the XML domain (which is a more abstract or high-level approach for doing content manipulation). As such, the content adaptation engine can apply those changes in the compressed domain, resulting in a bitstream without B slices. This temporally downsampled bitstream is, for instance, now more appropriate for playback on a mobile device. Since the H.264/AVC specication is time unaware, one may also assume that the synchronisation of the remaining H.264/AVC samples can be taken into account by relying on a le format or a network protocol, as illustrated by the Systems layer in Figure 1(a). With respect to the content adaptation engine, this piece of logic is available in the MPEG-21 reference software package. The process as discussed before is summarized in a formal way in Figure 2(b). In the rst step, one starts from a bitstream typically encoded at a high quality such that it is useful to derive other versions of this particular bitstream. This parent bitstream is given as input to the BinToBSD tool, being part of the MPEG21 reference software, together with a description of the Annex B Syntax at a certain granularity (for instance a description up to the level of the NALU header or up to the level of the slide header()). The latter syntax description is written down by making use of BSDL. The BinToBSD tool is now capable of generating an XML description of the structure of the H.264/AVC bitstream in question. In a next step, one can apply a set of lters to the XML based bitstream syntax description (BSD) of the H.264/AVC bitstream. For instance, in a rst stage one can apply a lter in order to simplify the XML description in question or in order to add some metadata such that smarter adaptations are possible.12 For example, based on MPEG-7 metadata, one can highlight that part of the video stream that is dealing with a sports scene. After this preprocessing step, one can apply zero or more lters in order to realize the actual manipulation of the XML description, such as dropping the XML elements describing the B slices or, for instance, selecting the scenes that contain sports content. Which lter to apply can be made dependent on a negotiation process making use of multi-criteria optimization.13 This will nally result in an appropriate XML description. The lters can be implemented by relying on several technologies, such as Extensible Stylesheet Language (XSL) documents, an XML API, . . . In a nal step, the adapted BSD can be provided to the BSDToBin tool. Together with the original bitstream and the document describing (a part of) the H.264/AVC syntax, this will result in an adapted bitstream that is suited for a particular usage environment. Note that the BSD, as generated by the BinToBSD tool, only has to be created once in a production environment. This observation also applies to the preprocessing step. When the bitstream syntax description is available at a sucient detailed level, it should also be possible to derive several versions of the original H.264/AVC bitstream in order to meet the requirements of a particular usage environment. It is also important to know that MPEG-21 BSDL often allows doing data manipulations without requiring a recode of the media data in question, although it is possible that some side eects have to be solved. The latter will be discussed in a next section. It should also be clear that MPEG-21 BSDL allows realizing manipulations of multimedia content at a more abstract level once a BSD of a particular bitstream is available, thus making it possible to enter the semantic domain (i.e. not having to deal any longer with the pure bits and bytes).

We assume that every remaining slice can be reconstructed without having to rely on a B slice.

558

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

I slice
<bitstream xml:base=myPrecious_30hz.264> <header>0-24</header> <I_slice>25-2637</I_slice> <B_slice>2638-2746</B_slice> <B_slice>2747-2903</B_slice> <P_slice>2903-3857</P_slice> <B_slice>3857-3972</B_slice> <B_slice>3973-4103</B_slice> </bitstream>
1 Original Bitstream [myPrecious_30hz.264] 2 XML Description [myPrecious_30hz.xml]

B slice

BinToBSD + h264_avc.bsd

Pre-processing

B slice

Filters
XSLT Stylesheet Stylesheet 4 XSLT XSLT Stylesheet [drop_BSlices.xsl]

P slice

BSDToBin + h264_avc.bsd
7 Scaled Bitstream [myPrecious_10hz.264]

Post-processing

B slice

6 Adapted XML Description [myPrecious_10hz.xml]

Universal Adaptation Engine = BinToBSD + XSL processor + BSDToBin

B slice t

(a) An XML description of an H.264/AVC bitstream.

(b) Content adaptation by making use of BSDL.

Figure 2. Schematic overview of MPEG-21 BSDL in the context of H.264/AVC.

2.3. Combining H.264/AVC and MPEG-21 BSDL: Implementation Aspects


So far, a Bitstream Syntax Description (BSD) scheme for the Byte Stream NALU syntax was implemented. This scheme makes it possible to describe every syntax element up to the level of the slide header() structure. For a lot of applications it will not be necessary to have access to all those syntax elements in order to realize the desired functionality. For instance, in order to exploit temporal scalability by dropping non-reference B slices, one only needs access to the nal ref idc and the slice type parameter, respectively available in the NALU header and slide header() as illustrated by Figure 1(d). In order to exploit quality scalability realized by making use of the data partitioning feature of the H.264/AVC specication, one only needs access to the nal unit type parameter. For the actual implementation of the BSD scheme for the Annex B syntax, we had to rely on some extensions to the current BSDL specication. These non-normative extensions have already been touched in a previous paper from a more high-level point of view.14 In the next paragraphs, the extensions in question will be covered in some more detail. First of all, we had to make use of the fillByte datatype. This construction makes it possible to force the BinToBSD parser to search for the next byte aligned position. As such, fillByte maps to the syntax function Byte Aligned(), which can be found in the H.264/AVC specication (although the semantics are not entirely the same). Moreover, the type fillByte can be used for debugging purposes but also for limiting the overhead of the amount of XML data produced when describing the structure of a bitstream. This is important because of the fact that the function that skips all data up to the next start code prex, can only be called on a byte aligned position. When dealing with bitstreams that are synthesized by making use of syntax elements encoded by a variable length code (VLC), the fillByte type also proves to be very useful. In case the fillByte datatype is not available, one is often forced to parse the bitstream to a position that is known for its byte alignment, despite the fact that one is not interested in all the information that is being parsed. It is also important to realize that the current informative implementation of the fillByte datatype, as provided by the developers of the reference software, has a lossy character. This can result in some unexpected side eects when doing editing operations on syntax elements represented by a VLC. Such a scenario will be illustrated by an example in one of the following paragraphs. Second, we also had to make use of the implementation construction. This extension makes it possible to rely on procedural objects in order to perform complex computations or in order to deal with complex datatypes. Complex means that is not trivial or just impossible to do the computations by making use of BSDL, or that

Proc. of SPIE Vol. 5558

559

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

slice 1 slice 2 slice 3

0 33 66

slice 1 slice 2 slice 3

33 0 66

Arbitrary Slice Order (ASO)

bitstroom bitstream(binair) (binary) voor before ... bbbbbb11

schema schema(decimaal) (decimal)


<fillByte>1</fillByte> <fillByte>1</fillByte>

slice 2 slice 1 slice 3

0 33 66

na bbbbbb00 00010001 0? ? ? after bbbbbb00 00010001 00000001

(a) Manipulation of the first mb in slice parameter.

(b) Byte alignment problem.

Figure 3. Generation of a corrupt bitstream due to the usage of the fillByte construction. Note that 1 is the binary exponential golomb representation for the decimal zero, and that 00000100001 is the binary representation of 33.

it is not possible to describe a datatype by making use the just mentioned language in an ecient way. To be more specic, the implementation attribute allows to call Java classes from the BSD scheme written in BSDL. The implementation construction was used, among others, for parsing the syntax elements that are encoded by the signed or unsigned exponential golomb entropy coding scheme (i.e. this is one of the cases in which one has to deal with a complex datatype). It is possible to describe those entropy coding schemes in BSDL, but this will result in a tremendous overhead: every single bit of an exponential golomb coded syntax element has to be put in one XML element. On top of that, it is not straightforward to interpret or decode the resulting XML description of the syntax element in question by making use of XPath. The latter technology allows to perform queries against an XML document for retrieving the value of a particular XML element, . . . 15 This functionality is required when one wants to apply changes to a BSD. For instance, the decoding of elements that are encoded by the entropy coding schemes in question is necessary for realizing temporal scalability since the slice type parameter is represented by an exponential golomb codeword. The implementation construction is also used for the parsing of the slice group change cycle syntax element. This parameter occurs as the last syntax element in the slice header() syntax structure when Flexible Macroblock Ordering (FMO) types 2, 3, or 4 are used, the latter supporting evolving slice groups. As such, the parameter in question determines the number of macroblocks in slicegroup 0. The main reason for using the implementation construction lies in the fact that the number of bits for the representation of the slice group change cycle syntax element has to be computed by evaluating the logarithmic function with base two, a common operation when parsing bitstreams. However, the latter is not available in the XPath specication (i.e. this is one of the cases in which complex computations are necessary). When the set of input values is limited, this limitation can be circumvented by making use of the union element and precalculated values (thus no longer requiring the evaluation of the logarithmic function). However, the latter is not the case for the syntax element in question because the number of input values is dependent on the resolution. Getting access to the value of this parameter allowed us to drop the background of a video sequence encoded with FMO types 2, 3, and 4. The procedure for the elimination of the background itself was implemented by making use of a cascade of two XSL stylesheets due to the complexity of the XPath expressions. This complexity is a consequence of the pointer based relationship between a slice header() and the picture and sequence parameter sets, and the fact that the latter can occur more than once in an H.264/AVC compliant bitstream. The encoding and decoding were done by making use of a modied version of the reference encoder and decoder. Finally, the implementation approach was also applied to the cabac alignment one bit syntax element, being part of the slice data() structure. The reason for this approach can be explained by an example in which the slices in an H.264/AVC bitstream are shued per picture. Although being a pure academic problem, it is a good illustration of the side eects that may occur when performing editing operations in the compressed

The functionality of this element can be compared to a switch statement in a programming language.

560

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

domain. The latter phenomenon can for instance occur when transcoding an H.264/AVC bitstream from the Main Prole to the Baseline prole. The shuing of the slices is illustrated by Figure 3(a) in which a sequence of pictures at QCIF resolution is encoded by dividing each picture into three slices of equal sizes (33 macroblocks). The shuing consists of switching every rst and second slice of a picture by manipulating the value of the first mb in slice parameter in the corresponding slice header() syntax structures. This manipulation will be detected by the Arbitrary Slice Ordering (ASO) feature of a decoder, resulting in a distorted video sequence. Due to the fact that the first mb in slice syntax element is represented by an exponential golomb code, the change of zero to 33 and vice versa (33 is the number of the rst macroblock in the second slice) will result in a change of the byte alignment of the slice header() structure. As illustrated by table 3(b) the fillByte construction does not deal with that change in a correct way, resulting in a corrupt bitstream at the transition of the slice header() and the slice data() syntax structures since the question marks should all have been replaced by ones. The fact that byte alignment has to be achieved at the transition of the slice header() and the slice data() syntax structures by adding an appropriate number of one bits is required by the H.264/AVC specication. For simplicity, all syntax elements between the first mb in slice parameter and the cabac alignment one bit parameter are omitted. Although we could develop a BSD description up to the level of the slice header() syntax structure, we are currently not able to parse bitstreams in which NALU emulation prevention bytes occur at the level of the syntax structure in question. The presence of emulation prevention bytes ensures that no sequence of consecutive byte-aligned bytes in the NAL unit contains a start code prex. The reason for not being able to deal with those special bytes is the lack of an appropriate look ahead mechanism for the detection of the bytes in question in the current version of MPEG-21 BSDL. In theory, it would be possible to locate those bytes by making use of the ifNext construction in MPEG-21 BSDL because the latter allows looking ahead. However, such an approach would actually require an ifNext operation that can be executed on every 32 bits aligned position. The latter is not achievable in practice (for instance, due to the usage of VLCs). Another challenge is the appropriate insertion of NALU emulation prevention bytes in manipulated bitstreams. For instance, our implemented procedural objects do not take into account the occurance of and the need for NALU emulation prevention bytes. Note that this problem does not emerge in the case of MPEG-4 Visual bitstreams due to a totally dierent organization of the header information such that the usage of emulation bytes is not necessary.

3. APPLIED METHODOLOGY
This section discusses the way the compressed bitstreams and their corresponding XML based syntax descriptions were generated. Some information is provided about the tools used for doing the proling of the reference software for MPEG-21 BSDL (i.e. the BinToBSD tool and the BSDToBin tools).

3.1. The Encoding Process and BSD Generation


The purpose of the performance measurements is to get some insight in the processing time required by the BinToBSD and BSDToBin tool on the one hand, and the XSL engine on the other hand when exploiting temporal scalability in the current H.264/AVC specication. The latter is being realized by dropping nonreference B slices. Those results are for instance relevant in case one wants to know whether is possible to use the tools in question in real time (for instance, in a streaming scenario). Some attention will also be paid to the overhead as a consequence of the usage of procedural objects for the parsing of syntax elements having a complex representation. It is also important to note that all MPEG-21 related tools are written in Java. For the creation of the H.264/AVC bitstreams, we have relied on the H.264/AVC reference software, version JM 7.6.16 As input, the progressive Foreman test sequence was used in the planar YUV 4:2:0 pixel format, having a resolution of 176x144 and a length of 300 pictures. For the encoding of the test sequence in question, nine dierent lenghts were used. The lengths are 21, 49, 73, 99, 199, 299, 399, 499, and 599 pictures. For each length, the encoding was done at a bit rate of 1000 kbit/s and at a xed frame rate of 30 Hz. Only one I picture was used, alternately followed by a P picture and a non-reference B picture . The encoding process as just mentioned was done for one slice per picture, two slices per picture, and three slices per picture, resulting

A reference picture is a picture with nal ref idc not equal to zero.

Proc. of SPIE Vol. 5558

561

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

in a set of 27 bitstreams. All slices per picture belong to the same type (satised due to the value of the slice type syntax element). Emulation prevention bytes did not occur in the syntax structures parsed by the BSDL reference software. The actual performance analysis was done for several schemes written in BSDL: a full scheme describing all syntax elements up to the level of the slice header() datastructure, and a normalized scheme only describing those parameters that are really necessary for exploiting temporal scalability. The latter implies parsing everything up to the level of the slice type parameter in the slice header() datastructure for the slices containing coded picture data. The SPS and PPS are not analyzed in case of the simplied scheme. For the generation of the XML descriptions, the BSDL reference software was used, version 1.1.3. Timing was done by relying on the timers as made available in the two BSDL tools, taking into account the overhead related to input and output.

3.2. Performance Measurements


The performance measurements for the tools of interest are done by making use of HPjmeter.17 The latter is a program that helps to detect performance bottlenecks in Java based software by graphically displaying proling data. The tool in question was used on a PC having an Intel Pentium IV CPU, clocked at 2.61 GHz, and having 512 MB or RAM at its disposal. The operating system used was Windows XP Pro (service pack 1), running Sun Microsystemss Java 2 Runtime Environment (Standard Edition). The proling option chosen was -Xrunhprof:cpu=times. The latter makes it possible to measure the time taken by the individual methods and it also generates a sorted list ranked as a total percentage of the CPU time taken by the application.

4. EXPERIMENTAL RESULTS
This section covers some of the performance results that were obtained during our research. Figure 4(a) indicates that the processing time required by the BinToBSD tool is characterized by an exponential behavior in terms of the number of slices in case of the simplied BSD scheme (note that the Y-axis has a logarithmic scale and that the points on the X-axis are not equidistant). For instance, 145 seconds are needed in order to generate a BSD for a bitstream containing 599 pictures, hereby making use of one slice per picture. One can also notice that the processing is done in terms of slices: a stream of 300 pictures without slices results in the same behaviour for the BinToBSD tool as a stream of 100 pictures with three slices per picture. The exponential behavior of the BinToBSD tool is also emphasized when making use of the full scheme for generating a BSD, especially due to the evaluation of a lot of control statements necessary for guiding the parsing process, the latter often implemented as complex XPath expressions. A rst attempt to boost the performance consisted of making the simplied BSD scheme deterministic. The two previous schemes, the full one and the simplied one, are generic in the sense that they can be applied to any H.264/AVC compliant bitstream, regardless of the prole implemented or the GOP structure used . When taking into account the latter information, together with the fact that the SPS and the PPS are always the two rst NALUs, one can create a BSD scheme that is much more simple because it is possible to drop a lot of complex control statements then. However, this scheme still resulted in an exponential behavior of the BinToBSD tool as can be seen in Figure 4(b). An extensive proling with the HPjmeter tool revealed that the performance problem of the BinToBSD program, making use of the simplied deterministic scheme, could still be traced back to the usage of XPath expressions. To be more specic, the performance problem in question is related to the usage of XPath expressions when the nOccurs attribute is used. The latter BSDL attribute species how many times a particular syntax element can occur in a bitstream by making use of an XPath expression. When this attribute does not occur in the BSD scheme, the BinToBSD tool falls back to a default value of one (a constant XPath expression) for the attribute in question since most syntax elements only occur once on a particular position in a bitstream. However, when this attribute does occur in the BSD scheme, the BinToBSD tool duplicates the internal datastructure containing the XML description of the structure of the H.264/AVC bitstream, anticipating the possible execution of an XPath expression. Because the nOccurs attribute was used in the declaration of every possible syntax element for clarity purposes (even when the syntax element could only occur once), its presence resulted
This genericity is also one of the major reasons for the complexity of the XPath expressions.

562

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

10000.0 BinToBSD Processing Time [s]


BinToBSD Processing Time [s]
//

10000.0

1000.0

1000.0

100.0

100.0

10.0

10.0

1.0 21 49 73

1.0

//

99

199 #Pictures

299

399

499

599

21

49

73

99

199 #Pictures

299

399

499

599

1 slice/picture

2 slices/picture

3 slices/picture

1 slice/picture

2 slices/picture

3 slices/picture

(a) Simplied scheme (DOM).


1200 BSDToBin Processing Time [s] 1000 XSL Processing Time [ms] 800 600 400 200 0 21 49 73
//

(b) Simplied deterministic scheme (DOM).


6.0 5.0 4.0 3.0 2.0 1.0 0.0
//

99

199 #Pictures

299

399

499

599

21

49

73

99

199 #Pictures

299

399

499

599

1 slice/picture

2 slices/picture

3 slices/picture

1 slice/picture

2 slices/picture

3 slices/picture

(c) Simplied scheme (Xalan implementation).

(d) Simplied scheme (DOM).

Accumulated Exclusive Method Time (CPU) (percentage)


config: 200 pictures - 1 slice/picture

Accumulated Exclusive Method Time (CPU) (percentage)


config: 200 pictures - 1 slice/picture

60 50 40 30 20 10 0

60 50 40 30 20 10 0

tm

io

la ng

til

to

g2

so

a.

ot h

l.d

er

.io

la ng

su n

l.u

e. cr im

ja v

pe

SD

ec

a.

a.

til

e. xm

xm

ut il.V

pe g2 1. X

ja v

.m

ja v

a.

21

ap ac he .

or g

ch

ch

ja v

pe g

a.

ap a

ap a

ja v

(e) Simplied deterministic scheme (DOM).

(f) Optimized simplied deterministic scheme (DOM).

Figure 4. Overview of the experimental results.

Proc. of SPIE Vol. 5558

pe g

ot h

21

.u

er

io

563

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

in a duplication of the XML datastructure for every syntax element. The behaviour in question is reected by the execution times needed by the functions that are responsible for the duplication of the XML structure. As can be deduced from the pie chart in Figure 4(e), a lot of processing time is spent in the Document Table Model (DTM) package (org.apache.xml.dtm). DTM is an interface designed specically to optimize performance and minimize storage when making use of the Apache XPath and XSLT implementations.18 Note that the exclusive method time is the time spent in a method, not taking into account the time spent in the functions that were called by the method in question. Taking into account the latter knowledge, a much more ecient version of the simplied deterministic scheme in BSDL could be created. This nally resulted in the generation of a BSD that is faster than real-time, because of the lack of the nOccurs attribute and the lack of XPath expressions in the scheme in question. This is also illustrated by the shift of the accumulated exclusive method time to other packages in Figure 4, especially to the ones that are responsible for input and output operations. For example, about 4 seconds are needed in order to generate a BSD for a bitstream containing 599 pictures, hereby making use of three slices per picture. The latter example took about 992 seconds in case of the simplied deterministic BSDL scheme, about 1096 seconds in case of the original simplied scheme, and about 68041 seconds in case of the full scheme. The average speed-up of the BinToBSD tool, using the optimized simplied deterministic scheme, is 90.95%, compared to the execution time needed by the original simplied scheme (standard deviation: 13.35%). Note that the BSD, as the result of the usage of the optimized scheme, is still equivalent with the one that is being generated by the simplied and the very rst deterministic scheme, thus still enabling the exploitation of temporal scalability. With respect to the processing time needed by the Xalan XSL engine, one can observe execution times that are quite fast: generating an XML document originally describing 599 pictures (one slice per picture) requires 625 milliseconds. The resulting XML document only contains the descriptions of NALUs carrying a SPS, a PPS, or compressed data related to I and P slices (and no longer to B slices). The same observation applies to the BSDToBin tool, regardless whether the Document Object Model (DOM) or the Simple API for XML (SAX) are used for the internal representation and processing of the XML description. The fast behavior of the BSDToBin tool can be explained by the fact that it is no longer necessary to evaluate XPath expressions. This leads to the observation of a potential asymmetrical behavior between the BSD encoder (BinToBSD) and BSD decoder (BSDToBin) when the encoder in question has to deal with a lot of XPath expressions, the latter being very similar to the behavior of MPEG-x and H.26x encoders and decoders. It is also interesting to see that the fast behavior of the optimized simplied deterministic scheme proves that it is possible to make use of Java procedural objects for achieving byte alignment and for the decoding and encoding of exponential golomb coded syntax elements in an ecient way. Some of the quantitative results can be found in Annex B.

5. CONCLUSIONS AND FUTURE WORK


After having given an overview of the design characteristics and syntaxes of the H.264/AVC specication on the one hand, and of MPEG-21 BSDL on the other hand, a detailed discussion was provided about the extensions needed in order to combine BSDL with the Annex B syntax. Some of those extensions can be mapped to the elementary syntax functions as dened in the H.264/AVC specication. It would be useful if MPEG-21 BSDL could incorporate a relevant subset of their funtionality. With respect to the performance measurements, we have shown that is possible to make use of an extended version of MPEG-21 BSDL for the ecient exploitation of temporal scalability in the current H.264/AVC specication, taking into account certain restrictions. Our measurements have also illustrated the necessity to be careful with the usage of XPath expressions in a BSDL scheme because the latter can have a serious impact on the performance. Further research will be necessary in order to extend BSDL such that it can take into account the occurance and appropriate insertion of NALU emulation prevention bytes, together with a study of the possible side eects that may occur when editing H.264/AVC bitstreams in the compressed domain. A scalable coding scheme should take into account the problems as just mentioned. It would also be very interesting if such a scheme would allow the full exploitation of scalability by only requiring knowledge about the high-level structure of the corresponding bitstream. The latter would make it much easier to bridge the gap to the power of the MPEG-21 tools. Other points of interest are a memory complexity analysis and the usage of metadata in order to realize smart adaptations.

564

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

APPENDIX A. AN H.264/AVC BITSTREAM SYNTAX DESCRIPTION IN BSDL


<xsd:element name=seq parameter set rbsp> <xsd:complexType> <xsd:sequence> <xsd:element name=prole idc type=xsd:unsignedByte bs2:nOccurs=1 /> <xsd:element name=constraint set0 ag type=bt:b1 bs2:nOccurs=1 /> <xsd:element name=constraint set1 ag type=bt:b1 bs2:nOccurs=1 /> <xsd:element name=constraint set2 ag type=bt:b1 bs2:nOccurs=1/> <xsd:element name=reserved zero 5bits type=bt:b5 xed=0 bs2:nOccurs=1 /> <xsd:element name=level idc type=xsd:unsignedByte bs2:nOccurs=1 /> <xsd:element name=seq parameter set id type=jvt:UnsignedExpGolomb bs2:nOccurs=1 /> <! . . . > </xsd:sequence> </xsd:complexType> </xsd:element>

Table 1. Description of the rst seven syntax elements of a SPS in BSDL.


<seq parameter set rbsp> <prole idc>88</prole idc> <constraint set0 ag>0</constraint set0 ag> <constraint set1 ag>0</constraint set1 ag> <constraint set2 ag>0</constraint set2 ag> <reserved zero 5bits>0</reserved zero 5bits> <level idc>21</level idc> <seq parameter set id>0</seq parameter set id> <! . . . > </seq parameter set rbsp>

Table 2. Resulting output of the BinToBSD tool for the rst seven syntax elements of a SPS.

APPENDIX B. QUANTITATIVE RESULTS


sli. pic. BinToBSD D (s) 1 21 8.7 49 28.9 73 55.2 99 92.4 199 344.4 299 770.8 399 1371.8 499 2214.3 599 3408.9 2 21 22.8 49 91.1 73 191.7 99 346.7 199 1366.6 299 3378.9 399 6953.0 499 12851.1 599 21173.7 3 21 43.7 49 194.2 73 417.0 99 768.2 199 3366.6 299 9498.1 399 21072.4 499 40769.6 599 68041.2 Full BSD Simplied BSD XSL BSDToBin BSDToBin BinToBSD XSL BSDToBin BSDToBin (ms) D (s) S (s) D (s) (ms) D (s) S (s) 296 0.7 0.9 1.1 234 0.5 0.5 375 1.0 1.0 2.4 313 0.6 0.7 406 1.2 1.2 4.1 359 0.8 0.8 468 1.4 1.5 6.2 375 1.0 1.0 579 2.3 2.2 19.3 469 1.7 1.7 609 3.0 2.9 41.9 500 2.3 2.3 672 3.3 3.8 70.2 500 2.8 3.0 734 4.6 4.4 104.6 562 3.5 3.5 859 5.3 5.1 145.2 625 4.1 4.1 344 0.8 0.8 2.0 281 0.5 0.5 484 1.2 1.2 6.0 360 0.7 0.8 500 1.4 1.4 11.4 422 0.9 0.9 547 1.7 1.6 18.8 438 1.1 1.1 688 2.7 2.6 68.8 515 1.8 1.8 859 3.7 3.5 144.1 625 2.5 2.5 1016 4.6 4.4 241.9 688 3.3 3.2 1125 5.5 5.2 363.4 750 3.9 3.8 1265 6.4 6.0 512.0 796 4.5 4.5 391 0.9 0.9 3.2 328 0.5 0.5 484 1.3 1.3 11.2 406 0.8 0.8 594 1.6 1.5 21.5 484 1.0 1.0 657 1.9 1.8 39.7 469 1.2 1.2 860 3.1 2.9 142.3 609 2.0 1.9 1047 4.2 3.9 298.3 719 2.8 2.8 1282 5.3 5.0 508.2 813 3.5 3.4 1454 6.4 6.1 772.1 922 4.3 4.2 1641 7.2 6.9 1095.9 985 5.1 4.9 BinToBSD BinToBSD D - det (s) D - opt (s) 1.1 0.5 2.2 0.6 3.6 0.7 5.5 0.9 17.7 1.3 37.7 1.7 62.9 2.2 94.6 2.5 131.8 2.8 1.8 0.5 5.2 0.7 10.1 0.8 16.9 1.0 62.0 1.5 129.8 2.0 218.6 2.5 329.9 2.8 463.5 3.3 2.8 0.6 10.0 0.8 19.7 0.9 35.8 1.1 128.7 1.7 270.6 2.2 461.6 2.7 704.9 3.2 991.8 3.8

Table 3. Overview of the performance measurements: D denotes DOM, S denotes SAX, det denotes the deterministic simplied scheme, while opt stands for the optimized version of the latter. The temporal downsampling resulted in a 48.86% reduction of the bitstream size on the average (standard deviation: 0.89%), the latter being dependent on the rate-distortion model used.

Proc. of SPIE Vol. 5558

565

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

ACKNOWLEDGMENTS
The authors would like to thank the developers of the MPEG-21 BSDL reference software for making available the required extensions. We would also like to thank Davy De Schrijver for the clarifying discussions about the usage of the HPjmeter proling tool. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientic Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Oce (BFSPO), and the European Union.

REFERENCES
1. T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol. 13, pp. 560576, July 2003. 2. I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia, John Wiley & Sons, LTD, 2003. 3. Requirements for AVC Codec, MPEG-document ISO/IEC JTC1/SC29/WG11/N4672, Joint Video Team of ISO/IEC JTC1/SC29/WG11 and ITU-T SG16/Q.6, Jeju, Korea, Mar. 2002. Available on http://www. chiariglione.org/mpeg/working documents. 4. H. Schwarz, D. Marpe, and T. Wiegand, Subband Extension of H.264/AVC, JVT-document JVT-K023, Munich, Germany, Joint Video Team of ISO/IEC JTC1/SC29/WG11 and ITU-T SG16/Q.6, Mar. 2004. 5. Requirements and Applications for Scalable Video Coding, MPEG-document ISO/IEC JTC1/SC29/WG11 N6025, Moving Picture Experts Group (MPEG), Gold Coast, Australia, Mar. 2003. Available on http://www.chiariglione.org/mpeg/working documents. 6. M. Amielh and S. Devillers, Bitstream Syntax Description Language: Application of XML-Schema to Multimedia Content Adaptation, in WWW2002: The Eleventh International World Wide Web Conference, (Honolulu, Hawaii), May 2002. Available on http://www2002.org/CDROM/alternate/334/. 7. I. Burnett, R. V. de Walle, K. Hill, J. Bormans, and F. Pereira, MPEG-21: Goals and achievements, IEEE Multimedia 10, pp. 6070, October-December 2003. 8. Multimedia Framework Part 7: Digital Item Adaptation, Final Draft International Standard, MPEGdocument ISO/IEC JTC1/SC29/WG11/N6167, Moving Picture Experts Group (MPEG), Waikaloa, USA, Dec. 2003. 9. D. C. Fallside, XML Schema Part 0: Primer, recommendation, World Wide Web Consortium (W3C), http://www.w3c.org/TR/xmlschema-0/, May 2001. 10. H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn, XML Schema Part 1: Structures, recommendation, World Wide Web Consortium (W3C), http://www.w3c.org/TR/xmlschema-1/, May 2001. 11. P. V. Biron and A. Malhotra, XML Schema Part 1: Datatypes, recommendation, World Wide Web Consortium (W3C), http://www.w3c.org/TR/xmlschema-2/, May 2001. 12. J. Magalh aes and F. Pereira, Using MPEG standards for multimedia customization, IEEE Trans. Circuits Syst. Video Technol. 19, pp. 437456, May 2004. 13. S. Lerouge, P. Lambert, and R. Van de Walle, Multi-criteria optimization for scalable bitstreams, in Proceedings of the 8th International Workshop on Visual Content Processing and Representation, pp. 122 130, Springer, (Madrid), September 2003. 14. W. De Neve, F. De Keukelaere, K. De Wolf, and R. Van de Walle, Applying MPEG-21 BSDL to the JVT H.264/AVC Specication in MPEG-21 Session Mobility Scenarios, in Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services, p. 4 pp, (Lisboa), April 2004. 15. J. Clark and S. DeRose, XML Path Language 1.0, recommendation, World Wide Web Consortium (W3C), http://www.w3c.org/TR/xpath, Nov. 1999. 16. JVT H.264/AVC Reference Software. http://bs.hhi.de/suehring/tml/download/. 17. HPjmeter. http://www.hp.com/products1/unix/java/hpjmeter/. 18. The Document Table Model. http://xml.apache.org/xalan-j/dtm.html.

566

Proc. of SPIE Vol. 5558

Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms

You might also like