J. Vis. Commun. Image R. 18 (2007) 217–239 www.elsevier.

com/locate/jvci

An optimized MPEG-21 BSDL framework for the adaptation of scalable bitstreams
Davy De Schrijver *, Wesley De Neve, Koen De Wolf, Robbie De Sutter, Rik Van de Walle
Department of Electronics and Information Systems, Multimedia Lab, Ghent University - IBBT, Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium Received 16 April 2006; accepted 9 February 2007 Available online 22 February 2007

Abstract A format-agnostic framework for content adaptation allows reaching a maximum number of users in heterogeneous multimedia environments. Such a framework typically relies on the use of scalable bitstreams. In this paper, we investigate the use of bitstreams compliant with the scalable extension of the H.264/MPEG-4 AVC standard in a format-independent framework for content adaptation. These bitstreams are scalable along the temporal, spatial, and SNR axis. To adapt these bitstreams, a format-independent adaptation engine is employed, driven by the MPEG-21 Bitstream Syntax Description Language (BSDL). MPEG-21 BSDL is a specification that allows generating high-level XML descriptions of the structure of a scalable bitstream. As such, the complexity of the adaptation of scalable bitstreams can be moved to the XML domain. Unfortunately, the current version of MPEG-21 BSDL cannot be used to describe the structure of large video bitstreams because the bitstream parsing process is characterized by an increasing memory consumption and a decreasing description generation speed. Therefore, in this paper, we describe a number of extensions to the MPEG-21 BSDL specification that make it possible to optimize the processing of bitstreams. Moreover, we also introduce a number of additional extensions necessary to describe the structure of scalable H.264/AVC bitstreams. Our performance analysis demonstrates that our extensions enable the bitstream parsing process to translate the structure of the scalable bitstreams into an XML document multiple times faster. Further, a constant and low memory consumption is obtained during the bitstream parsing process. Ó 2007 Elsevier Inc. All rights reserved.
Keywords: Bitstream syntax descriptions; Content adaptation; Context-related attributes; H.264/MPEG-4 AVC; MPEG-21 BSDL; Scalable video coding

1. Introduction Nowadays, the quantity of available multimedia content consumed through a plethora of networks and terminals increases every day. Consequently, multimedia frameworks have to allow that the available multimedia content can be accessed by different users from a various set of devices and networks. In the near future, this information revolution will converge to an efficient Universal Multimedia Access (UMA [1]) environment in which every content provider wants to create the multimedia data once and then publish

*

Corresponding author. Fax: +32 9 331 48 96. E-mail address: davy.deschrijver@ugent.be (D. De Schrijver).

it on every device (from low-end cellphones to high-end personal computers) connected by heterogeneous networks [e.g., Digital Subscriber Lines (xDSL), Universal Mobile Telecommunications Systems (UMTS), etc.]. Hence, two technologies are needed to control this huge diversity of multimedia content and resource constraints. On the one hand, scalable bitstreams are needed to deliver the most suitable bitstream to the user in an elegant manner. On the other hand, a content adaptation system is required to make a decision about the optimal resource adaptation. Such a system takes into account metadata information about the constraints of the targeted user, device, and network [2]. The goal of scalable video coding is to encode a video sequence once, after which the scalable bitstream can be

1047-3203/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2007.02.003

218

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

tailored by using truncation operations [3]. In this paper, we make use of the Joint Scalable Video Model (JSVM, the scalable extension of H.264/AVC [4]) specification to generate scalable bitstreams. This specification makes it possible to extract partial bitstreams from the original coded stream, characterized by a lower frame rate, spatial resolution, and/or visual quality. In order to customize the bitstreams along the different scalability axes in a format-agnostic adaptation framework, we describe their high-level structure in the Extensible Markup Language (XML). By using XML descriptions, the focus of the adaptation process is shifted from the compressed domain to the XML domain [5]. Such an XML-based description can be obtained in different ways. First, it is possible to generate the descriptions during the encoding process of the scalable bitstream. However, this requires to modify the encoders for each targeted format. A second approach relies on a proprietary (i.e., format-specific) parser to obtain the description. This means that for each format, a dedicated software module has to be created, which makes it impossible to obtain a generic behavior. To achieve a fully generic framework, it is necessary to use a format-agnostic parser to generate the descriptions of the scalable bitstreams. These XMLbased descriptions, generated by a format-agnostic parser, can be obtained by using a bitstream structure description language for defining the high-level structure of the coding format. In this paper, the standardized MPEG-21 Bitstream Syntax Description Language (MPEG-21 BSDL) is used to describe the structure of the bitstreams together with the generic BintoBSD Parser to generate the XML descriptions. Regrettably, the format-agnostic BintoBSD Parser, of which the behavior is specified in the MPEG-21 Digital Item Adaptation standard, contains some shortcomings in order to describe scalable video bitstreams. More precisely, the parser is characterized by an increasing memory consumption and a decreasing generation speed as a function of the bitstream length. This is, of course, unacceptable in the world of digital video. In this paper, we prove that these shortcomings cannot be solved by software optimizations. Therefore, we propose a solution for this fundamental problem of the BintoBSD Parser, by extending the BSDL specification with a number of new attributes. These attributes, further referred to as context-related attributes, steer the parser in order to keep the memory usage and execution speed constant during the complete parsing process. Furthermore, our implementation also uses an optimized underlying XML document model to obtain a maximal return of the extensions in terms of memory usage and execution speed. The performance of our modified BintoBSD Parser, used within an MPEG-21 BSDL-driven adaptation framework, is evaluated by relying on JSVM-compliant scalable bitstreams. In order to describe the structure of these bitstreams in XML, we have again introduced a number of language extensions on top of the MPEG-21 BSDL speci-

fication. The need for these extensions is explained in this paper. Finally, the adaptation is executed in the XML domain by transforming the XML descriptions using Streaming Transformations for XML (STX, [6]) stylesheets. The outline of the paper is as follows. Section 2 contains an overview of related work regarding bitstream structure description languages. In Section 3, an explanation is given of the MPEG-21 BSDL framework. The shortcomings of this framework with regard to the BintoBSD Parser are mathematically proven. Also, a number of requirements are formulated. The BSDL framework has to meet these requirements in order to be usable in the context of scalable video bitstreams. Section 4 subsequently discusses our approach for eliminating the shortcomings of the MPEG21 BSDL specification such that the requirements, as proposed in Section 3, are satisfied. In order to assess the expressivity of our modifications, we generate and transform descriptions of bitstreams encoded by the JSVM reference software. The structure of a JSVM-coded bitstream is discussed in Section 5. An explanation with respect to the customization along the different scalability axes is given as well. A complete performance analysis is provided in Section 6. Finally, Section 7 concludes this paper. 2. Related work A number of XML-based bitstream structure description languages have been developed in recent years. The Formal Language for Audio-Visual Object Representation (Flavor, [7]) was initially designed as a declarative language with a C++-like syntax to describe the syntax of a coding format on a bit-per-bit basis. Its aim was to simplify and to speedup the development of software that processes audiovisual bitstreams by automatically generating the required C++ or Java code to parse the data. As such, the language allows developers to concentrate on the processing part of the software. This implies that Flavor only generates an internal memory representation of the structure of the parsed bitstream in the form of a collection of C++ or Java class objects. Therefore, Flavor was extended with tools for generating an XML description of the bitstream syntax and for regenerating a bitstream from the adapted XML description (these extensions are known as XFlavor, [8]). Another bitstream structure description language is the MPEG video Markup Language (MPML, [9]), used to describe the syntax of bitstreams compliant with the MPEG-4 Visual specification [10]. This language is format-specific and generates an XML description based on a bit-per-bit parsing process, implying that all bitstream content is stored in the description. Since XFlavor and MPML do not possess the possibility to describe bitstreams on a high level, these languages are not suited for use in our adaptation system, in which descriptions act as an additional layer on top of the original binary media resource. Moreover, as discussed in [11], embedding the complete bitstream data in an XML document results in

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

219

verbose descriptions, which are practically unusable in an adaptation framework. The MPEG-21 Multimedia Framework contains two languages to describe the high-level structure of media resources in XML, namely MPEG-21 BSDL and the generic Bitstream Syntax Schema (gBS Schema) [12]. gBS Schema enables describing the structure of a bitstream in a format-agnostic manner [13]. As such, only dedicated parsers can be developed to generate the generic descriptions (the format-specific characteristics of the bitstreams are encapsulated in the parser and not in the descriptions). From the XML descriptions, it is possible to generate the corresponding bitstream using a generic parser (i.e., gBSDtoBin). As already mentioned in the introduction, a fully generic framework is envisioned. Therefore, we prefer to use MPEG-21 BSDL technology in this paper because as it specifies the functioning of a generic BintoBSD Parser. A more extended explanation of MPEG-21 BSDL is given in Section 3. The last bitstream structure description language that will be discussed is BFlavor [11,14]. BFlavor was developed to combine the strengths of MPEG-21 BSDL and XFlavor and to eliminate their weaknesses. The structure of a media format is established using the BFlavor language. The corresponding high-level XML description is generated in the same manner as in XFlavor, in particular by an automatically generated parser (which is format-specific). Together with the format-specific parser, the modified Flavorc translator also generates a corresponding MPEG-21 BS Schema, which describes the structure of the bitstream format. This schema is used to regenerate a bitstream from the description by using MPEG-21 BSDL’s technology. BFlavor can be considered an acceptable alternative to BSDL. However, we have chosen to use standardized technologies to obtain a maximal flexibility. 3. The MPEG-21 bitstream syntax description language 3.1. Overview of the description language The MPEG-21 BSDL specification is part of the Digital Item Adaptation (DIA, [15]) standard, which is in its turn embedded in the MPEG-21 Multimedia Framework [16]. The DIA specification focuses on the distribution and adaptation of multimedia content in heterogeneous environments, in which MPEG-21 BSDL specifies the formatindependent conversion step between a bitstream and a corresponding XML description, and vice versa. As a result, the software modules used do not have to be updated in order to support newly developed (scalable) bitstream formats. MPEG-21 BSDL is a language built on top of the World Wide Web Consortium’s (W3C) XML Schema language [17]. The language defines a number of restrictions and extensions to XML Schema, which are fixed in the DIA specification. The entire dataflow of the MPEG-21 BSDL framework is shown in Fig. 1. A certain video sequence

Sequence

Encoding Parameters

= Normative MPEG-21 BSDL process = Non-normative process

Encoder Schema BSDL-1 Bitstream Schema BSDL-2 Bitstream Syntax Description Usage Environment Description (UED)

BintoBSD Parser

BS Schema

Transformer

Adapted Bitstream

BSDtoBin Parser

Transformed Bitstream Syntax Description

Decoder

Schema BSDL-1

Fig. 1. MPEG-21 BSDL framework for XML-based video content adaptation [12].

is encoded, taking into account a number of encoding parameters. The generated bitstream is typically conform to a certain syntax (mostly fixed in an international standard such as MPEG-4 Visual [10], JPEG2000 [18], etc.). The high-level syntax is described in a Bitstream Syntax Schema (BS Schema) by using the standardized MPEG21 BSDL language. Such a BS Schema is format-specific and can import the Schema for BSDL-1 and BSDL-2 extensions. MPEG-21 BSDL is established by standardizing these two XML schemata, together with the semantic meaning of these schemata. Once a bitstream is created and the BS Schema is designed, the generic BintoBSD Parser can be used to generate the corresponding Bitstream Syntax Description (BSD). The behavior of this parser is defined in the MPEG-21 DIA specification. This implies that every implementation of the parser generates the same description, given the same bitstream and BS Schema. The adaptation process is realized by transforming the XML description. This metadata adaptation process is guided by relying on the properties of the targeted usage environment, such as the available band width, screen resolution, and CPU power [19]. How to transform the description is not specified in the DIA standard. This is left to the implementer of the adaptation engine. For example, one can use Extensible Stylesheet Language Transformations (XSLT, [20]), Streaming Transformations for XML (STX, [21]), or an implementation based on an XML API (such as the Simple API for XML, SAX,1 or the Document Object Model, DOM).2 The final step in the dataflow is the generation of the actual adapted bitstream. Therefore, a generic BSDtoBin Parser is used. Similar to the BintoBSD Parser, the behavior of the BSDtoBin Parser is standardized. The parser uses the (transformed) BSD and the corresponding BS Schema,
1 2

The SAX specification can be found on http://www.saxproject.org/. The DOM specification can be found on http://www.w3.org/DOM/.

220

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

optionally taking the original bitstream as additional input (denoted in Fig. 1 by the dashed arrow). This parser has sufficient knowledge about the semantics of the Schema for BSDL-1 extensions in order to generate the desired bitstream. After decoding the created bitstream, the adapted video sequence can be displayed on the desired device. An in-depth explanation of the MPEG-21 BSDL restrictions and extensions on top of W3C XML Schema, in particular the Schema for BSDL-1 and BSDL-2, is given in [16]. 3.2. Shortcomings and complexity analysis of MPEG-21 BSDL In the literature, several BS Schemata for bitstreams generated by different codecs can be found. In [12], a BS Schema is described for MPEG-4 Visual elementary streams and JPEG2000, [22] discusses a BS Schema for the experimental scalable MC-EZBC codec, in [23] the same is done for the recently standardized H.264/AVC specification, and [24] discusses the performance of a BSD-based adaptation of a scalable wavelet-based codec. All these papers have the same conclusion, in particular that high execution times are needed to obtain a BSD for a certain bitstream. A second conclusion from these papers, and probably the most important one, is that the execution speed of the BintoBSD Parser depends on the length of the bitstream.3 This is in contrast with the BSDtoBin Parser, which is characterized by a constant and fast generation speed independent of the length of the generated bitstream. The performance results of the BSDL Parsers in these papers were obtained by using the MPEG-21 reference software [25]. These software packages are never optimized with respect to performance. The software is only necessary to illustrate the functioning of a certain specification, and, in our case, the functioning of the BintoBSD Parser and BSDtoBin Parser. So, the accurate measurements are less important. However, the characteristics of the performance results can be standard-specific and can be applicable to all possible implementations. The length-dependent characteristic of the BintoBSD Parser makes the description generation process of the MPEG-21 BSDL framework unusable for real-world applications. In this paper, we propose a solution for this problem. To understand the origin of this shortcoming, we give a theoretical complexity analysis of the functioning of a BintoBSD Parser. In this analysis, we prove that the shortcomings are an inherent problem of the specification. Consequently, these limitations cannot be solved by software optimizations and are thus implementation-independent. As a number of BSDL-2 attributes and facets take an XPath expression as value, it is necessary to guarantee that
In this paper, the length of a bitstream is not necessarily expressed in seconds but in the number of parse units such as frames, slices, subbands, and so on.
3

this expression can be evaluated. Therefore, the already generated description has to be kept in the memory by the BintoBSD Parser. This parser is characterized by a repetitive execution, e.g., a description is generated in a frame-per-frame fashion. The in-memory structure of the already generated BSD reflects this repetitive characteristic. In Fig. 2, a schematic representation of such an internal structure is given. This figure represents a fictive and simple video coding format containing a header with global information, followed by a number of frame structures. Note that most contemporary video coding formats contain a similar structure. The XPath expressions belonging to the conditional elements are bracketed. From this diagram, it is clear that the memory consumption grows linearly with the length of the sequence. Although this behavior is unacceptable, it is necessary to evaluate the XPath expressions existing in the BS Schema. The repetitive character of a bitstream structure results in a repetitive occurrence of the particles. This means that the same XPath expressions have to be evaluated against a growing internal structure. In order to evaluate the XPath expression that is assigned to the ifColorIs2 element in Fig. 2, only 2 branches need to be checked (the header branch and the first frame branch) during the first iteration. In the second step, 3 branches have to be evaluated; in the third step 4 branches; and so on. The cumulated number of verifications caused by the evaluation of the XPath expression after n iterations (i.e., after n parsed frames) is given by: 2 þ 3 þ 4 þ Á Á Á þ n þ ð n þ 1Þ ¼ ðn þ 1Þðn þ 2Þ À1 2

The asymptotical analysis of the number of needed verifications for a sufficiently big number of parsed frames is given by:   ðn þ 1Þðn þ 2Þ n2 À 1 % ¼ H ð n2 Þ 2 2 The H-notation is used to indicate the order of increase of the corresponding function. From this derivation, we can estimate the execution time needed by the BintoBSD Parser to generate a BSD under the condition that the BS Schema contains at least 1 XPath expression. Suppose that the time
bitstream

header height width color

frame type ifTypeIs1
(./type=1)

… …

frame type ifTypeIs1
(./type=1)

ifColorIs2
(/bitstream/header/color=2)

ifColorIs2
(/bitstream/header/color=2)

stream
= Conditional element

stream length length payload stopByte

= Normal element

payload stopByte

Fig. 2. Internal structure of a generated BSD for a fictive coding format.

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

221

to verify one branch (during the evaluation of an XPath expression) is denoted c1 and the time to generate the BSD fragment for 1 frame during 1 iteration is denoted c2. The execution time of the parser to generate the complete BSD after n frames is then given by: T ðnÞ ¼ c1 Á Hðn2 Þ þ n Á c2 ¼ Hðn2 Þ From this derivation, we can conclude that the BintoBSD Parser has a quadratic execution time as a function of the number of parsed frames or iteration steps. This quadratic behavior results in a decreasing evaluation speed of the repetitively occurring XPath expressions during the parsing process. This also results in a decreasing generation speed of the BintoBSD Parser as a function of the sequence length. 3.3. Requirements to produce descriptions for scalable bitstreams Based on the complexity analysis in Section 3.2, two important shortcomings of the MPEG-21 BSDL specification could be identified. The increasing memory consumption and decreasing generation speed of a BSD during the execution of the BintoBSD process lead to a practically unusable technology. These issues are related to the BintoBSD Parser and are caused by the use of XPath expressions in the BSDL specification. Because the XPath expressions are part of BSDL-2, the BSDtoBin Parser encounters no drawback of this mechanism. As proven in the previous subsection, these shortcomings are fundamental problems of the standard that cannot merely be solved by software optimizations. In order to be able to incorporate MPEG-21 BSDL in future multimedia frameworks, we have formulated a number of requirements for the BintoBSD Parser. In [5], Devillers et al. mention that the critical issue for using the BSDL framework in a constrained or streaming environment is the internal memory consumption of the BintoBSD Parser, the BSD transformation engine, and the BSDtoBin Parser. They propose a solution based on SAX events for the BSD transformation engine and the BSDtoBin Parser. A solution for the BintoBSD Parser is not provided. Our extensions in this paper solve this issue. In order to describe large bitstreams and in order to make use of offline generated BSDs in streaming environments, the following requirements have to be satisfied: • During the generation of a BSD, the memory consumption of the parser has to be constant and independent of the length of the bitstream. • The execution time of the BintoBSD Parser must be linear as a function of the length of the corresponding input sequence. In other words, after the start-up period of the parser (transition period), the generation speed must be constant over the complete sequence, i.e., the generation speed has to be independent of the length of the bitstream. The exact speed depends on the BS

Schema and the bitstream used, but the value of the speed is less important as long as the algorithm can guarantee a constant behavior.

4. Context-related attributes In this section, we propose a number of extensions to the MPEG-21 BSDL specification in order to satisfy the requirements of Section 3.3. These extensions result in algorithmic and fundamental modifications to the functioning of the BintoBSD Parser. Furthermore, the extensions are adopted in the second amendment of the DIA standard [26]. The XPath expressions in a BS Schema are important components to realize conditional operators and loop constructions. Every expression is evaluated against a partial internally stored description. Because of the fact that the growing internal BSD leads to a quadratic execution time, it is important to keep this description tree as compact as possible. Consequently, parts of the tree have to be removed whenever these parts are no longer necessary for future evaluations of XPath expressions. To obtain this goal, we have developed a number of attributes on top of MPEG-21 BSDL. These new attributes, which are present in the BS Schema, are related to the context tree against which the XPath expressions have to be evaluated. The context tree is the in-memory representation of a part of the already generated BSD. As such, in this paper, our attributes are further referred to as context-related attributes. The context-related attributes steer the BintoBSD Parser to keep the memory consumption low and constant during the BSD generation. This results in a faster evaluation of the XPath expressions, as well as in a BSD generation time that is linear in terms of the number of parse units (e.g., frames or packets). Our attributes extend the Schema for BSDL-2 Extensions such that these new attributes do not have an impact on existing BSDtoBin Parsers (see Fig. 1) [27]. The proposed extensions are backwards and forwards compatible with the MPEG-21 BSDL specification. To achieve the goal as described above, we propose five context-related attributes. Some of them are used to keep certain elements temporarily in the memory; others are developed to swap unnecessary elements out of the system memory. In Fig. 3, a BS Schema is given for the fictive coding format as presented in Fig. 2. This BS Schema contains our context-related attributes. As such, this BS Schema will be used in the following paragraphs to explain the functioning and semantic meaning of our attributes. The newly introduced attributes belong to the BSDL-2 namespace (prefix bs2) and are underlined in Fig. 3. Note the distinction between the attributes belonging to the Schema for BSDL1 (prefix bs1) and to the Schema for BSDL-2 (prefix bs2). • startContext: the value of this attribute is an XPath expression. The evaluation of the expression must result in a string data type representing a marker for the corresponding element. The attribute indicates that the

222

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

Fig. 3. BS Schema containing context-related attributes that represents the fictive coding format given in Fig. 2.

element, to which the attribute belongs, has to be stored in the current in-memory description tree at the active context node. The attribute points to the start of a subtree or context of the complete BSD. This attribute is typically used if the element is needed to evaluate a forthcoming XPath expression. If the element in question is conditional (i.e., if it contains a bs2:if or a bs2:ifNext attribute), the context-related bs2:startContext attribute is interpreted after a positive evaluation of the condition. Consequently, the element is kept in the memory and is added to the partially generated BSD. As one can see in line 7 of Fig. 3, by using this attribute, the element bitstream will be stored in the memory as it is necessary to evaluate the XPath expression in line 26. The marker is used to refer to the element in question in a later phase of the parsing process, offering the possibility to remove the element from the internal tree. This situation is shown in line 31 of Fig. 3: the element stream is temporarily kept and the corresponding marker is used in line 39 to remove the element from the internal tree (together with its children). • partContext: this attribute has the same functionality as the bs2:startContext attribute, in particular, it assists in building up an internal description tree. The

type of this attribute is a Boolean data type (instead of an XPath expression as is the case for the bs2:startContext attribute). A value of true implies that the corresponding element has to be added to the internal description tree (again, only after a positive evaluation of a condition, if present). In case the value is false, the element is never stored in the memory. This means that the corresponding element is not required for the evaluation of a forthcoming XPath expression. Of course, the element and accompanying value are still added to the partially generated BSD (only after a positive evaluation of a condition, if present). Hence, the inmemory description tree, against which the XPath expressions are evaluated, is a subset of the partially generated BSD. The default value of the bs2:partContext attribute is false. This implies that an element is not stored in the in-memory description tree if it does not contain a bs2:startContext or bs2:partContext attribute. Because this attribute contains no marker, the corresponding element can only be removed from the memory if an ancestor of the element is swapped out of the internal representation. A BS Schema developer will typically use this attribute if the element is part of the location step of an XPath

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

223

expression. This attribute is useful to keep the context manageable without an abundant and meaningless usage of markers. This attribute is for instance used in line 14 of Fig. 3. In this case, the header element has to be stored in the internal tree because it is part of the location step of the XPath expression that can be found in line 26. • stopContext: the previous two attributes force the context manager to add elements to the internal description tree. The bs2:stopContext attribute is used to remove elements from the internal representation so that the context manager can keep the memory consumption under control. Its value is a list of XPath expressions in which each expression should be resolved in a string data type representing a marker. The context manager removes the element and their children from the tree where the corresponding marker, reported by the bs2:startContext attribute, points to. In contrast to the previous attributes, this attribute is not conditional and it is evaluated every time it appears. This is necessary to prevent that the attribute can only appear in non-conditional elements. The markers indicate which elements are allowed to be removed The usage of this attribute is shown in line 39 of Fig. 3: the stream element is removed from the internal tree, as well as its children. • redefineMarker: this attribute is used to rename an existing marker. It contains a list of couples of XPath expressions in which the first component represents an already existing marker (after evaluation) and the second one represents the new marker. The XPath expressions are resolved in the in-memory description tree and are cast to a string data type. This attribute gives the author of a BS Schema advanced control pertaining to the management of the context. This attribute is only processed if the condition, associated with the element this attribute belongs to, is positively evaluated. A situation in which this attribute is useful is given in line 24 of Fig. 3. In line 22, the current parsed frame element is stored in the memory (referred to by the frame marker) and at the same time, the previous frame is removed from the tree (using the oldFrame marker). The frame marker is renamed to oldFrame in line 24 by using the bs2:redefineMarker attribute. This will result in the removal of the current frame element during the next iteration (when processing the following frame).

• defaultTreeInMemory: this attribute is necessary for achieving backwards compatibility with the current MPEG-21 BSDL specification. In contrast to the other context-related attributes, this attribute can only appear in the xsd:schema tag (similar to the bs2:rootElement attribute). Consequently, it is one of the first attributes that is interpreted by the BintoBSD Parser. The type of the attribute in question is a Boolean data type. A value of true indicates that the memory manager of the parser has to keep all elements in the memory. In this case, all other context-related attributes must be ignored. A value of false signals that the elements of the BS Schema can contain context-related attributes and that the parser stores nothing in the system memory by default. The default value of the bs2:defaultTreeInMemory attribute is true, resulting in backwards compatibility. A BS Schema designed by using the current version of the BSDL specification (i.e., the version not containing context-related attributes) does not contain the bs2:defaultTreeInMemory attribute. Since the default value is true, the complete BSD will be stored in the memory. In addition to the semantic meaning of the contextrelated attributes, a number of constraints have to be taken into account in order to ensure that every modified BintoBSD Parser generates the same BSD. A first constraint is that the bs2:startContext and bs2:partContext attributes cannot be used in the same xsd:element tag. This is necessary to eliminate ambiguities as the parser has to decide unmistakably whether to link a marker to a stored element or not. A second constraint is the order in which the attributes have to be interpreted if the element contains multiple context-related attributes. The first attribute that has to be evaluated, if present, is the bs2:startContext or bs2:partContext attribute, then the bs2:redefineMarker attribute, and finally the bs2:stopContext attribute. The bs2:stopContext attribute has to be evaluated after the value of the corresponding element is parsed (meaning that this value can be used in the XPath expression). As one can observe, most context-related attributes contain one or multiple XPath expressions that have to be evaluated to a string data type in order to obtain the value of the corresponding marker. This gives the author of a BS Schema a high flexibility to keep the memory usage

a
bitstream header

bitstream

b
bitstream

bitstream

c
bitstream

bitstream

d
bitstream

bitstream

frame header frame type type header color frame

oldFrame stream header color frame

oldFrame

color color stream length

Fig. 4. Evolution of the internal description tree for the BS Schema of Fig. 3.

224

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

manageable. It is possible to create the markers in a dynamic manner by using earlier parsed element values. For example, one can use an ID syntax element to obtain an ordering in successive markers (see Section 5 for such an example during the discussion of a BS Schema for JSVM6). In Fig. 4, we show the evolution of the internal description tree for the coding format given in Fig. 2 and represented by the BS Schema of Fig. 3 during the processing of our modified BintoBSD Parser. The dashed arrows in this figure represent the markers belonging to the corresponding elements. Explanatory notes for this figure are provided below: – (a) is the context tree after parsing the complete header; – (b) represents the context tree in which the type element is partially parsed (i.e., bs2:redefineMarker is not yet interpreted); – (c) represents the context tree just before the payload element is parsed (note the renaming of the frame marker to oldFrame); – and finally, (d) is the memory representation after parsing a complete frame. Remark the small internal representation in comparison to the fully generated BSD as given in Fig. 2 (which is the same when our context-related attributes are not taken into account). Finally, we have optimized the most time-consuming part of the BintoBSD Parser. More precisely, the internal description tree is stored in an XML document model and this model has a significant impact on the performance of the parser (as we will show in Section 6). The most obvious data structure is DOM. However, this is certainly not the best solution to efficiently evaluate an XPath expression. Such an XPath expression is typically evaluated by using a library such as Saxon [28] or Xalan [29]. For instance, the MPEG-21 DIA reference software uses the Xalan library to evaluate XPath expressions. For every XPath evaluation, this library converts the internal DOM tree to another internal XML document model, in particular to the Document Table Model (DTM). The XPath evaluation is executed on this alternative document model. This conversion is executed very often, resulting in a waste of time, especially for large or growing description trees. In our optimized parser, we have eliminated the conversion step for every XPath evaluation. Therefore, the in-memory description is no longer stored in a DOM tree, but immediately in a low-level DTM tree such that the XPath expressions are evaluated directly without a lot of overhead. This performance issue of the Xalan library, caused by the conversion step from a DOM to a DTM tree, could also be eliminated by using another XPath engine, e.g., by using the Saxon library. However, each engine contains its own internal representation of an XML document. By

using two different XML document models in this paper (i.e., a DOM-based and DTM-based approach), we take into account that the XPath engine used, and in particular its internal representation of an XML document, is an important component of a software implementation of the BintoBSD Parser. 5. Adaptation of joint scalable video model coded bitstreams The scalable video coding specification used in this paper is the scalable extension of the single-layered H.264/MPEG-4 AVC standard [30]. This specification is still under development and it will take up to 2007 before it will be finalized. Consequently, in the remainder of this paper, we make use of the latest available version, in particular Joint Scalable Video Model 6 (JSVM6, [31]). The structure of bitstreams compliant with JSVM6 is described in XML (using MPEG-21 BSDL), after which adaptations are executed along one or multiple embedded scalability axes. 5.1. Overview of JSVM6 5.1.1. Generation of an embedded scalable bitstream JSVM6 is a block-based coding scheme. Similar to H.264/AVC, the encoding of the input sequence is done on a macroblock basis. A non-scalable single-layer encoder typically takes the targeted bit rate as an input parameter and compresses the stream as good as possible. Conversely, a scalable video encoder contains a number of embedded scalability axes and the properties of these axes are given as parameters to the encoder, e.g., the number of temporal or spatial resolutions, the quantization of the different quality layers, etc. Based on these parameters, a scalable embedded bitstream is generated. The overall structure of such a JSVM6 encoder is discussed in [31]. Such an encoder generates scalable bitstreams containing the following scalability properties. • Spatial scalability is obtained by down-sampling the original video sequence to different resolutions. • Temporal scalability can be obtained by using hierarchical B pictures [32]. • SNR scalability is allowed in terms of Coarse and Fine Grain Scalability (CGS and FGS). In order to obtain quality scalability, quality enhancement layers are additionally coded on top of a quality base layer. In case of CGS, only complete quality enhancement layers can be removed. This is in contrast to FGS, where each enhancement layer can be truncated at any arbitrary point [33].

5.1.2. Structure of a fully embedded scalable bitstream Every coded macroblock has to be encapsulated into a bitstream in such a way that every compliant decoder can regenerate the desired sequence. The syntactical structure

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

225

Hierarchical coding structure in output order
B1 I0
= Temporal dependency = Spatial dependency +FGS +FGS +FGS

B3

+FGS 2nd spatial layer (e.g., 352x288) +FGS

B2 P4

B1 B2 I0
GOP

B3 P4

1st spatial layer (e.g., 176x144) H.264/AVC-compliant base layer

Bitstream structure in decoding order
scalability_info SEI message sequence parameter set sequence parameter set picture parameter set picture parameter set sub_seq_info SEI message b.0.0.0.0 zeroByte startCode nal unit forbiddenZeroBit nalRefIdc naluType slicePayload sliceHeader sliceData structure of a nal unit belonging to the base layer parameter sets belonging to the 2nd spatial layer parameter sets belonging to the 1st spatial layer

I0

s.0.0.1.0 s.0.0.1.1 sub_seq_info SEI message b.4.0.0.0

P4

s.4.0.1.0 s.4.0.1.1 sub_seq_info SEI message b.2.1.0.0

B2

s.2.1.1.0 s.2.1.1.1 sub_seq_info SEI message b.1.2.0.0 zeroByte startCode nal unit forbiddenZeroBit nalRefIdc naluType temporalLevel dependencyId qualityLevel sliceScalableExtPayload sliceHeaderScalableExt sliceDataScalableExt structure of a nal unit belonging to an enhancement layer

B1

s.1.2.1.0 s.1.2.1.1 sub_seq_info SEI message b.3.2.0.0

B3

s.3.2.1.0 s.3.2.1.1

b.W.X.Y.Z = NALU part of the H.264/AVC-compliant base layer - W represents a frame number, while X, Y, and Z respectively denote a temporal, spatial, and quality layer s.W.X.Y.Z = NALU part of a spatial enhancement layer - W represents a frame number, while X, Y, and Z respectively denote a temporal, spatial, and quality layer

Fig. 5. Structure of a bitstream containing a GOP size of 4 (resulting in 3 temporal levels), 2 spatial layers, and 2 quality layers in the 2nd spatial layer.

of such a bitstream is established in a standard. It is this high-level structure that has to be described in a BS Schema by using MPEG-21 BSDL. The hierarchical structure expressing the dependencies between the different frames and layers of a scalable video sequence is given in Fig. 5. The corresponding bitstream structure is provided as well. In this figure, one can see a bitstream that is characterized by a Group Of Pictures (GOP)4 size of 4. Every
4 In JSVM6, a GOP is built by taking a key picture and all pictures that are temporally located between the key picture and the previous key picture. A picture is called a key picture when all previously coded pictures precede this picture in display order [31].

original frame in the bitstream is divided in three dependent frames, in particular, a low-resolution frame (part of the base layer and indicated by the labels b.w.x.0.0), a low-quality full-resolution frame (indicated by s.w.x.1.0), and a high-quality full-resolution frame (coded by using FGS and indicated by s.w.x.1.1). Finally, as one can see in Fig. 5, the temporal scalability has an impact on the decoding order of the various frames. A JSVM6-compliant bitstream is built (similar to an H.264/AVC bitstream) as a succession of nal_units (further abbreviated as NALUs), preceded by a startcode. From the figure, one can observe that three main categories of building blocks exist.

226

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

(1) The first category is indispensable for a decoder and contains two types, in particular the Sequence Parameter Set (SPS) and the Picture Parameter Set (PPS). An SPS contains information applicable to a complete sequence, such as the frame resolution, the profile, cropping information, etc. In Fig. 5, the first NALUs after the Supplemental Enhancement Information (SEI) message NALU are two SPSs, one for each spatial layer. It is clear that a decoder cannot start without the presence of these SPSs. After the SPSs, at least as many PPSs as SPSs are expected, since every PPS must refer to an SPS (without such a reference, an SPS cannot be activated). A PPS applies to a number of pictures of a sequence or spatial layer. This parameter set contains information such as the presence of a deblocking filter, the type of the entropy encoder, the number of slice groups, and the corresponding structure of these slice groups. Every NALU containing coded luma or chroma data refers to a certain PPS (and hence, also to an SPS). (2) The second category contains NALUs needed by a decoder to reproduce the luma and chroma samples. These NALUs typically contain a slice (with coded macroblocks) and can be divided into two main types. These two types are given in Fig. 5, in particular, a NALU containing a slice belonging to the H.264/ AVC-complaint base layer (indicated by the label b.w.x.y.z such as b.0.0.0.0) or a slice belonging to an enhancement layer (indicated by the label s.w.x.y.z such as s.0.0.1.0). The H.264/AVC-compliant slices do not contain information about the scalability layer to which the coded slice belongs. This is in contrast to the slices belonging to a picture of an enhancement layer. These slices contain syntactical elements indicating the temporal, spatial, and quality layer of the picture in question (i.e., the temporalLevel, dependencyId, and qualityLevel syntax elements). This information is needed by an adaptation engine to generate a desired bitstream from the original scalable bitstream. (3) The third category contains the SEI messages. These messages are used by an adaptation engine to obtain an efficient bitstream adaptation process. The first NALU in the bitstream is a scalability_info SEI message (see Fig. 5); the others are sub_seq_info SEI messages. The scalability_info SEI message is used by an adaptation engine to obtain an in-depth view of the structure of the scalable bitstream and of the scalability axes that are embedded in the bitstream. Based on this information, an adaptation engine can build the structure of the customized scalable bitstream. The sub_seq_info SEI message5
5 The sub_seq_info SEI message is already specified in the H.264/ AVC standard and, in this context, it should be interpreted as follows: ‘‘The sub-sequence information SEI message maps a coded picture to a certain sub-sequence and sub-sequence layer’’ [32].

is needed to pin a picture of the H.264/AVC-compliant (spatial) base layer to the corresponding temporal level, necessary to obtain an efficient adaptation process. As one can see in the bitstream structure of Fig. 5, a NALU containing a slice that belongs to the base layer, does not contain any information about the temporal level. Indeed, the syntax element temporalLevel is missing in the NALU syntax for the base layer. Therefore, a sub_seq_info SEI message is used as an alternative in the base layer for the temporalLevel syntax element.

5.2. Design of a BS schema for JSVM6 Once the structure of a coded bitstream is known, a BS Schema can be designed by using MPEG-21 BSDL and our context-related attributes. In Fig. 6, a fragment of our BS Schema for a JSVM6-coded bitstream is given. This fragment contains the most important MPEG-21 BSDL attributes and data types, as well as our context-related attributes (which are underlined) and some extra extensions that are further explained in Section 5.3 (which are curly underlined). The bs2:rootElement attribute (of the xsd: schema tag, line 5) signals to the BintoBSD Parser which element of the schema has to be used to start the parsing process with. Since this will be the root element (i.e., the bitstream element in line 8) of the generated BSD, we also add a bs2:startContext attribute to allow the usage of absolute XPath expressions further in the BS Schema. Next, the consecutive NALUs are present, starting with a startcode possibly preceded by one or more zero bytes. These zero bytes are defined in line 22, 24, and 27. Line 28 specifies the startcode of an actual NALU. The NALU itself is referred to in line 29 and is defined from line 37. Every NALU contains a nal_unit_type element (line 41). Depending on this type, scalability information will be present in the NALU header. The nal_unit_type syntax element is kept in the memory (as indicated by the bs2:partContext attribute, line 41). Its value is used in the XPath expression in line 42 (a type of 20 or 21 means that the NALU contains slice data that is part of an enhancement layer). After this header, the payload of the NALU is parsed dependent on the type of the NALU. Therefore, we have used a combination of the xsd:choice group element (line 47) of W3C XML Schema and the bs2:if attribute (line 49, 50, and 51) of MPEG-21 BSDL. Dependent on the positive evaluation of the conditional elements, the corresponding element will be used to continue the parsing process. For example, in Fig. 6, a part of the schema representing a NALU containing SPS data is shown (namely NALU type 7 as represented by the XPath expression in line 50 of the bs2:if attribute).

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

227

Fig. 6. Fragment of a BS Schema for Joint Scalable Video Model 6.

228

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

The schema fragment for an SPS (from line 56) is a good example on how our context-related attributes keep the memory consumption of the parser, steered from the schema, low. In line 56, the bs2:redefineMarker attribute has been used to rename the byte_stream_nal_unit marker to seq_parameter_set_rbsp. This is the first place in the schema where the NALU type is known so that the marker (linked to the outer NALU tag) can be assigned a more precise name. Further in the BS Schema fragment (line 65), we observe the seq_parameter_set_id element representing the ID of the corresponding SPS. Because this ID is needed to activate the SPS from a PPS, this element has to be kept in the memory. Furthermore, this element also contains a bs2:stopContext attribute that takes an XPath expression as value. The XPath expression constructs a marker based on the last parsed SPS ID. This is particularly done by concatenating a constant string with the ID of the SPS. Due to this, the previous SPS with the same ID can be removed out of the memory, resulting in a manageable context. Finally, in line 67, a bs2:redefineMarker attribute is used, taking an XPath expression as value. This attribute couples the SPS to a marker containing the ID value in its marker name so that the SPS can be removed later based on this ID value (as can be seen in line 65). In the same manner, the other NALUs are described by using MPEG-21 BSDL and the context-related attributes. In line 5, remark the use of the bs2:defaultTreeInMemory attribute that signals to a BintoBSD Parser that the BS Schema contains context-related attributes. 5.3. JSVM-specific MPEG-21 BSDL extensions So far, we have explained our BS Schema based on the MPEG-21 DIA specification, as well as the usage of our context-related attributes in this schema. Nevertheless, it is still impossible to generate a BSD for a JSVM6-coded bitstream when using such a schema as input for the BintoBSD Parser. Therefore, we have extended the BSDL specification with some additional language extensions and data types such that correct BSDs can be generated. These extensions are curly underlined in Fig. 6 and are adopted by the MPEG consortium in the second amendment of the DIA specification [26]. 5.3.1. Exponentional Golomb data type Using MPEG-21 BSDL, it is impossible to parse variable-length coded syntax elements. In JSVM6 (as well as in H.264/AVC), signed and unsigned exponential Golomb entropy coded [34] syntax elements appear frequently in the specification. The number of bits used to represent such a syntax element depends on the value being coded. It is possible to describe those coded elements in MPEG-21 BSDL, but this will result in a tre-

mendous overhead since every single bit of the Golomb code corresponds to one XML element (e.g., the unsigned exponential Golomb code 00111 is for instance parsed as <b_0>0</b_0> <b_0>0</b_0> <b_1>1</b_1> <b_1>1</b_1> <b_1>1</b_1> instead of <id>6</id>). Moreover, the resulting BSD cannot be interpreted in an elegant manner. This issue could be solved by using the bs1:byteRange data type containing the start byte and the length of the syntax element. The disadvantage of this approach is that MPEG-21 BSDL does not allow to specify a range up to the bit level and that the corresponding XML element does not contain the value of the syntax element. This means that the syntax element cannot be used by the transformation engine in order to adapt the bitstream. Therefore, we have made use of the nonnormative6 bs0:implementation attribute. The usage of this attribute can be seen as an extension mechanism for MPEG-21 BSDL so that complex and new data types, which are not present in the standard, can be added to the framework. The bs0:implementation attribute allows to make calls to Java classes from the BS Schema. These Java classes implement a certain interface so that the MPEG-21 BSDL Parsers can integrate them into the parsing process [23]. The interface is non-normative and is only a part of the reference software. Our signed and unsigned exponential Golomb data type extensions implement this interface. In line 65 and 67 of Fig. 6, one can see two syntax elements of the SPS encoded by an unsigned exponential Golomb code. The type of these elements refers to a self-defined data type in our BS Schema (in particular, jsvm:UnsignedExpGolomb). From within this data type, a Java class can be called by using the bs0:implementation attribute as shown in the following fragment: <xsd:simpleType name="UnsignedExpGolomb" bs0:implementation="org.iso.mpeg. mpeg21.dia.bsdl.XSD.datatypes. UnsignedExpGolomb"> <xsd:restriction base="xsd:string"/> </xsd:simpleType>

5.3.2. Emulation prevention bytes Another essential shortcoming of MPEG-21 BSDL is the impossibility to manage emulation prevention bytes encapsulated in the bitstream. The JSVM6 specification defines startcodes to split a bitstream into successive NALUs (line 28 in Fig. 6). A bitstream structure based on startcodes implies that the specification must contain a mechanism ensuring that only the parse units (NALUs
6 Non-normative means that these features are not part of the standard; they are informative or a part of the reference software.

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

229

Fig. 7. Coding format containing emulation prevention bytes.

in case of JSVM6) start with a startcode. Consequently, startcodes may not be present in the payload of the parse unit. Therefore, a specification should define emulation prevention bytes (sometimes also called escape codes) so that a startcode cannot appear in the payload of a parse unit. The current version of the MPEG-21 BSDL specification [15] does not contain a mechanism to detect these emulation prevention bytes. As such, a bitstream with emulation prevention bytes may be parsed incorrectly (and hence, the values of the corresponding syntax elements in the BSD will be wrong). The same phenomenon occurs when a bitstream is generated from a BSD by the BSDtoBin Parser. In particular, the BSDtoBin Parser is currently not able to insert the necessary emulation prevention bytes in order to obtain a bitstream that can be correctly decoded.

In Fig. 7, a simple fictive coding format is given in order to explain the need for a mechanism in BSDL to anticipate the occurrence of emulation prevention bytes. This coding format describes the high-level structure of a frame starting with a fixed startcode (in particular, 0 · 0001) followed by a number of syntax elements. For reasons of simplicity, the actual payload of a frame is not provided. The specification containing this format also tells us that the byte sequence 0 · 0001 has to be emulated as 0 · 000301 in case this sequence appears in the payload of a frame (the sequence 0 · 0001 is a startcode and is used to subdivide the bitstream in successive frames). Furthermore, a bitstream, compliant with the above discussed specification is also given in the figure. This bitstream can be described in a BS Schema using MPEG-21 BSDL and the corresponding BSDs are shown in Fig. 7.

230

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

In order to generate correct BSDs for a (scalable) bitstream, we have extended the BSDL-2 Schema with the bs2:emulationBytes attribute. This attribute contains a list of couples. Every couple indicates how a byte sequence containing emulation prevention bytes has to be interpreted without the emulation bytes. Line 6 of Fig. 6 shows this attribute for JSVM6, it tells a BintoBSD Parser that 0 · 03 is an emulation prevention byte in the byte sequence 0 · 000003 and that it has to be ignored during the parsing of the value of a syntax element. Only when the read bits or bytes have to be tested by the bs2:ifNext, bs2:startCode, and bs2:endCode attributes, the read bits or bytes still have to contain the emulation prevention bytes (since these attributes will typically be used to check for the presence of a startcode). This last observation also applies to the value of an element that is an instance of the bs1:byteRange data type. The offset and the length in these elements have to refer to the original bitstream (and, of course, not to a bitstream without emulation prevention bytes). In Fig. 7, a BSD is given in case the bs2:emulationBytes attribute is not present in the corresponding BS Schema (as such no emulation prevention bytes are detected and removed), together with a BSD generated by a parser that has interpreted the emulation prevention bytes in a correct manner (steered by the bs2:emulationBytes attribute). From the example, it is clear that incorrect BSDs may be generated when the emulation prevention bytes are not taken into consideration, resulting in possibly incorrect transformations in the subsequent step. The BSDtoBin Parser cannot interpret the attributes belonging to the BSDL-2 Schema. Therefore, we have defined two new attributes on top of the BSDL-1 Schema such that the emulation prevention bytes are added to the bitstream at the correct places. The first new attribute, in particular bs1:emulateBytes (line 7 in Fig. 6), belongs to the xsd:schema tag of the BS Schema. This attribute is the counterpart of bs2:emulationBytes, it is used to add emulation prevention bytes at the correct places in the bitstream. Of course, it is possible that certain byte sequences do not have to be emulated, for example, when a startcode has to be written to the bitstream. To prevent this situation, we have introduced a second attribute, in particular bs1:doNotEmulate. This attribute tells the parser that the value, corresponding with this element, should not be emulated and must be written without modifications to the bitstream (line 27 and 28 in Fig. 6). 5.4. XML-based adaptation To realize a desired adaptation of a scalable bitstream, the structural metadata (i.e., the generated BSDs) have to be transformed. Fragments of the generated BSDs, using the BS Schema of Fig. 6, are provided in Figs. 8 and 9. In this section, we discuss such an XML-driven adaptation process along the temporal, spatial, and SNR scalability axes. Similar to the BintoBSD and

BSDtoBin Parser, the adaptation process in our framework is executed by a format-independent engine. Consequently, the BSD transformation is steered by a stylesheet, often implemented in an XML-based transformation language (e.g., XSLT or STX). Such a stylesheet describes the transformation of a BSD. Furthermore, these stylesheets also contain a number of transformation parameters describing the properties of the different scalability axes of the desired bitstream. These parameters are filled in based on the usage environment characteristics. 5.4.1. Processing of the scalability_info SEI message Before starting the transformation of a BSD, the structure of the scalable bitstream, which is subject to the adaptation, needs to be investigated. Information about the scalability properties of the bitstream is available in the scalability information SEI message. This SEI message is typically conveyed by the first NALU in the bitstream (indicated by scalability_info SEI message in Fig. 5). A part of the BSD, representing the SEI message in question, is given in Fig. 8. This excerpt shows information about the characteristics of the third layer of a scalable bitstream (layer_id is 2, line 23). The following information can be deducted in a straightforward manner from this fragment. (1) Every SEI message is encapsulated in a NALU preceded by a startcode (line 3). The type of data embedded in a NALU is dependent on the value of the nal_unit_type syntax element (line 7). A value of 6 represents an SEI message. (2) The JSVM standard specifies 27 different SEI messages. The scalability_info SEI message is the most important one in the context of an efficient bitstream adaptation process. As shown in line 12, this SEI message is identified with type 22. Based on this number, the transformation stylesheet can activate the corresponding part in the stylesheet, in particular, the instructions that are responsible to interpret this SEI message and to filter out the necessary information. (3) The scalability_info SEI message is composed of a number of layers indicated by num_layers_minus1 (line 21). Each layer contains information about the characteristics of possibly adapted bitstreams. From line 26 to line 29 in Fig. 8, the properties of the scalability axes for a certain layer are provided, in particular, the temporal, dependency, and quality level. The remaining information for this layer can be summarized as follows: • profile and level information for this layer is not present in the SEI message (indicated by the value 0 in line 32 and 40); • an adapted bitstream that only contains this layer (as well as the lower layers on which it depends) has a bit rate of 96,000 bps (line 43);

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

231

Fig. 8. BSD fragment representing a part of the scalability information SEI message.

232

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

Fig. 9. BSD fragment containing an SPS, a PPS, a slice NALU of the base layer, and a slice NALU of the enhancement layer.

• the pictures in this layer have to be displayed at a frame rate of 7.5 Hz (line 50);7 • the frames in this layer have a resolution of 176 · 144 pixels (i.e., 11 · 9 macroblocks, line 53 and 54).

7

The value of avg_frm_rate is expressed in units of frames per 256 s.

Once the different layers described in the scalability_info SEI message are investigated, the properties of the scalability axes of the desired adapted bitstream are determined. These values are indicated by the following three syntax elements: temporal_level, dependency_id, and quality_level. These three values are kept by the stylesheet and are used to guide the BSD transformation. Furthermore, in case the highest layer in the adapted bitstream is an FGS layer, the truncation ratio

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

233

for the slices belonging to this FGS layer can be calculated. This calculation is based on the desired bit rate and the bit rate given in the SEI message for the corresponding layer (line 43). The calculated ratio is also kept by the stylesheet. Finally, the stylesheet will generate a new scalability_info SEI message containing only the scalability properties (or layers) of the adapted bitstream (typically, the highest layers are removed and the payload_size_detection and align_sei_payload syntax elements have to be adjusted to remain compliant with the JSVM specification). This new scalability_info SEI message is stored at the beginning of the transformed BSD. 5.4.2. Processing of the parameter sets After having parsed and interpreted the scalability_info SEI message, the remainder of the BSD can be processed. Fig. 9 provides a number of BSD fragments for NALUs that do not convey SEI messages. The header information of these NALUs is the same as for the NALU that contains an SEI message (line 2–7 in Fig. 8). Hence, this information has been omitted in the different BSD fragments. The BSD transformation process continues by processing the parameter sets, in particular the SPSs and PPSs. An example of an SPS is described from line 1 to 20 in Fig. 9, while line 21–36 describes a PPS. Dependent on the NALU type (line 6 or 26), the following actions are taken during the BSD transformation. • In case of an SPS (nal_unit_type equals to 7), the resolution of the sequence that belongs to this SPS is investigated. Each SPS typically describes one spatial layer. The resolution is fixed in the SPS by the pic_width_in_mbs_minus1 (line 14) and pic_height_in_map_units_minus1 (line 15) syntax elements. The resolution of the adapted bitstream, suited for a particular usage environment, is determined during the first step, in particular during the analysis of the scalability_info SEI message. Based on this information, an SPS is copied to the transformed BSD if it represents a lower or equal resolution; the SPSs representing a higher resolution are removed from the BSD. • In case of a PPS (nal_unit_type equals to 8), the reference to the SPS is crucial in the transformation (i.e., the value of the seq_parameter_set_id in line 30). All PPSs referring to SPS IDs that are not present anymore (see previous bullet) are removed from the BSD.

that are part of an enhancement layer (nal_unit_type equals to 20 or 21, line 52–74). As discussed below, dependent on the scalability axis along which an adaptation is executed, the slice NALUs are processed in a different way. (1) Adaptations along the dependency axis will result in spatial or CGS rescaled bitstreams (which is also the reason why we are talking about dependency axis instead of spatial axis). In order to execute an adaptation along this axis, the value of the dependency_id syntax element has to be checked. This element is provided in line 63 of Fig. 9. Its value has to be compared with the corresponding value obtained during the analysis of the scalability_info SEI message (the reference value). If the value of dependency_id is higher than the reference value, then the complete NALU has to be removed from the BSD. Otherwise, the NALU has to be kept. NALUs belonging to the H.264/AVCcompliant base layer do not contain this syntax element and have to be kept as well. Indeed, without the spatial or quality base layer, the higher layers cannot be decoded. (2) Adaptations along the temporal axis can be executed in a similar way as for the dependency one. For this axis, we make a distinction between the NALUs belonging to the H.264/AVC-compliant base layer and the NALUs that are part of the enhancement layer because the base layer is also divided in different temporal levels. • A NALU that belongs to an enhancement layer contain a NALU header with scalability information about the layer to which it belongs. For the temporal scalability, the value of the temporal_ level syntax element is investigated. In case this value indicates that the NALU belongs to a higher temporal level than desired, then the complete NALU has to be removed from the BSD. Hereby, the value is again compared with the corresponding value obtained from the scalability_info SEI message. • A NALU that belongs to the H.264/AVC-compliant layer does not contain scalability information in the NALU header. Consequently, such a NALU does not contain information about the temporal levels. Therefore, these NALUs are preceded by a sub_seq_info SEI message as explained in Section 5.1.2. This message gives information about the temporal level to which the following NALUs belong and this information is used by the transformation stylesheet in order to decide to remove or to keep a certain NALU. This decision is again based on the value of the temporal_level syntax element obtained from the scalability_ info SEI message. An example and extensive explanation of such a sub_seq_info SEI message is given in [22].

5.4.3. Processing of the NALUs containing slices The last step in the transformation process consists of the processing of NALUs containing slices. The syntax of the slice NALUs can be split up in two categories. The first category represents slices belonging the H.264/AVC-compliant (spatial) base layer (nal_unit_type equals to 1 or 5, line 37–51), while the second category contains slices

234

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

It is clear that an adaptation along this axis results in a bitstream with a lower temporal resolution or frame rate. (3) The last scalability axis along which adaptations can be executed is the quality axis. This axis is used in order to determine the number of progressive refinement layers. In order words, the axis is needed to obtain Fine Grain Scalability (FGS). Because NALUs belonging to the H.264/AVC-compliant base layer cannot be coded as a progressive refinement layer, all these NALUs are kept in the transformed BSD. The NALUs of an enhancement layer contain the quality_level syntax element (line 64). The NALUs with a lower value for this element, compared to the value obtained during the analysis of the scalability_info SEI message, are kept in the adapted BSD. The NALUs with a higher value are removed from the BSD. If a NALU has the same value for the quality_level syntax element as obtained from the scalability_info SEI message and this value is strict positive, then the NALU belongs to a progressive refinement enhancement layer. This means that it can be truncated at any byte position. By using the truncation ratio, as calculated during the analysis of the scalability_info SEI message (see Section 5.4.1), it is possible to determine the byte position at which the slice has to be truncated. Therefore, the transformation will modify the value of the slice_payload element (line 69). The value of this element identifies a byte range and consists of two numbers: the first one is the start byte of the slice (in the original bitstream) and the second one is the length of the slice in bytes. By decreasing the length in this element (e.g., by changing 3862 to 2500 in line 69), the payload of the slice will be truncated, resulting in the desired bit rate and also in a lower visual quality.

6.1. Experimental methodology In the performance analysis, a number of scalable bitstreams are needed, containing different characteristics. Based on the bitstream properties, we investigate the impact of a certain characteristic on the performance of the different parsers. We have chosen to use two video sequences containing different content. We have used the well-known ice sequence together with a movie trailer. The sequences are encoded with varying parameters for the different scalability axes. Table 1 contains the most important coding parameters and characteristics of the resulting scalable bitstreams. Every sequence contains a dissimilar (maximal) spatial resolution resulting in a varying number of spatial layers. As mentioned in Section 3.3, the connection between the length of the sequence and the execution time will be crucial for the practicability of our adaptation framework. Therefore, we have generated scalable bitstreams containing various numbers of frames for each sequence (from very short sequences to longer sequences for the trailer). The size of the GOPs is fixed on 16 frames resulting in 5 temporal levels. The number of NALUs is also given in the table because these are the fundamental building blocks of a JSVM6 bitstream, which are described in XML. Furthermore, the number of SPSs and PPSs embedded in the bitstream is given, together with the number of FGS layers for each spatial layer. In order to obtain BSDs for the scalable bitstreams, we have developed two BS Schemata with a different granularity. The first BS Schema describes the syntax of the JSVM6 bitstreams up to and including the NALU header for the slices and the complete SPSs and PPSs (Fig. 9 contains an example of a BSD generated by using this BS Schema). The generated BSDs contain enough information to exploit temporal, spatial, and SNR scalability. The second BS Schema is created describing the syntax of the slices up to and including the slice header to give the adaptation engine more syntactical information about the scalable bitstream such as the frame number, slice type, and quantization parameter. This information is necessary to allow more complex adaptations such as ROI scalability

6. Performance analysis In this section, we discuss the performance of our extensions in a fully XML-driven adaptation framework based on MPEG-21 BSDL.
Table 1 Characteristics of the scalable bitstreams used in the performance analysis Name ice_32 ice_75 ice_150 ice_300 ice_480 trailer_50 trailer_100 trailer_250 trailer_500 trailer_1000 trailer_2000 #Frames 32 75 150 300 480 50 100 250 500 1000 2000 Resolution 704 · 576 704 · 576 704 · 576 704 · 576 704 · 576 1280 · 512 1280 · 512 1280 · 512 1280 · 512 1280 · 512 1280 · 512 #NALU 210 481 949 1885 3010 663 1313 3263 6513 13,013 26,013 #SPS/#PPS 3/6 3/6 3/6 3/6 3/6 4/8 4/8 4/8 4/8 4/8 4/8

#Temp levels 5 5 5 5 5 5 5 5 5 5 5

#Spat levels 3 3 3 3 3 4 4 4 4 4 4

#FGS levels 2/2/3 2/2/3 2/2/3 2/2/3 2/2/3 2/2/2/3 2/2/2/3 2/2/2/3 2/2/2/3 2/2/2/3 2/2/2/3

Size (bytes) 1656,240 3653,277 6480,466 12449,844 21634,940 838,641 1478,034 8669,816 39593,903 83549,861 175511,839

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

235

and key frame selection. More details regarding these two use cases are provided below. • Exploiting Region of Interest (ROI) scalability in JSVM6-coded bitstreams. An ROI will typically be specified by an encoder by using the Flexible Macroblock Ordering (FMO) tool. FMO type 2 allows to define a rectangular area in the video pane that will be coded independent of the other part of the pane. The generated bitstreams containing the ROI are described in XML (using BSDL) and the transformation engine will select the slices that only belong to the ROI. The selection process, in the XML domain, has to investigate the value of the first_mb_in_slice syntax element and based on this value, the transformation engine can remove or keep the corresponding NALU (or slice). This syntax element is a part of the slice header and therefore, we also have to describe this header in XML [35]. • Selecting the intra-coded frames in order to obtain a slide slow of the original sequence. Intra-coded frames in a JSVM6-coded bitstream can only be selected based on the value of slice_type syntax element. An intracoded slice has 2 or 7 as value for slice_type and a frame is intra coded if all slices of this frame are intra coded. The syntax element slice_type also belongs to the slice header. Consequently, it also has to be described in XML. The influence of our extensions on the performance of the MPEG-21 BSDL Parsers is determined by modifying version 1.2.1 of the MPEG-21 BSDL reference software [25]. In order to execute our experiments, we have created three software packages based on version 1.2.1. A description of the available features in the three packages is given in Table 2. For the three packages, the performance of the BintoBSD Parsers is investigated in terms of execution times and memory consumption. For every sequence of Table 1, a BSD is generated for the two BS Schemata. The generated BSDs are the same for the three packages. The transformation of the generated BSDs can be executed in different manners. Since we want to obtain a fully XML-driven adaptation framework, the different transformations have to be implemented by using an XML-based transformation language. The two most obvious languages are XSLT and STX [6]. XSLT is only applicable if the original BSD is segmented into successive, smaller process units. This is due to the fact that XSLT requires loading the
Table 2 Characteristics of the BintoBSD Parsers in the different software packages used Name Knowledge of emulation prevention bytes p p p Interpretation of context-related attributes p p DTM as XML document model

complete XML fragments prior to the start of the transformation. Such an approach is explained in more detail in [5]. However, we prefer using STX because of its intrinsic streaming capabilities and because it does not require a segmentation of the original BSD. The transformed BSDs are fed into the modified BSDtoBin Parser to obtain the adapted bitstreams. Because our context-related attributes and the DTM data structure are not used by a BSDtoBin Parser, this parser is the same in the three packages. The measurements were done on a PC having an Intel Pentium IV CPU, clocked at 2.8 GHz with Hyper-Threading and having 1GB of RAM at its disposal. The operating system used was Windows XP Pro (service pack 2). Sun Microsystem’s Java 2 Runtime Environment (Standard Edition version 1.5.0_02-b09) was running as Java Virtual Machine. The memory consumption of the different Javabased programs involved was registered by relying on the JProfiler8 4.1.1 software package. All time measurements were executed five times and the average was calculated over the five runs. For every five runs, the standard deviation was calculated in order to prove that five runs are sufficient to obtain reliable results. 6.2. Performance of the BintoBSD Parser The results of our experiments, conducted to investigate the performance of the BintoBSD Parser, are provided in Table 3. For the two developed BS Schemata, one can find the performance of the three packages in terms of execution time, speed, and memory consumption. Furthermore, the number of emulation prevention bytes is given, together with the sizes of the plain-text generated BSDs and the compressed ones. The high amount of emulation prevention bytes present in a JSVM6-compliant bitstream explains the need of our extension to the MPEG-21 BSDL specification in order to detect and to generate the emulation prevention bytes in the bitstream. In the table, one can observe that the memory usage increases directly proportional to the length of the sequence in case of package 1 (as we have proven in Section 3.2). By using our context-related attributes (package 2 and 3), it is possible to keep the memory usage constant and independent of the length of the sequence during the generation of the BSDs. Package 2 and 3 have the same constant behavior; they only differ in the amount of memory used. This means that the underlying XML document model has a significant impact on the memory consumption of the BintoBSD Parser. It is clear that without our proposed context-related attributes, the BintoBSD Parser cannot be used for parsing long video sequences such as movies. We can conclude that the first requirement of Section 3.3 is satisfied if our context-related attributes are used.

Package 1 Package 2 Package 3

p

8 The Java profiler JProfiler can be found on http://www.ej-technologies.com/.

236

Table 3 Performance results of the BintoBSD Parser Sequence BS Schema Package 1 (no context; no DTM) Time ice_32 ice_75 ice_150 ice_300 ice_480 trailer_50 trailer_100 trailer_250 trailer_500 trailer_1000 trailer_2000 Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended Basic Extended 59.68 293.02 217.90 2437.70 729.58 18415.95 2646.20 160542.50 6753.87 >200,000 390.32 13076.00 1320.67 116139.40 7748.57 >200,000 41322.33 >200,000 >200,000 >200,000 >200,000 >200,000 STD 0.48 0.66 1.98 3.88 3.72 90.61 3.17 432.04 72.05 n/a 0.89 28.99 0.71 221.12 27.02 n/a 157.58 n/a n/a n/a n/a n/a Speed 3.35 0.68 2.16 0.19 1.29 0.05 0.71 0.01 0.44 n/a 1.67 0.05 0.98 0.01 0.42 n/a 0.16 n/a n/a n/a n/a n/a MC 34.12 57.92 38.07 59.66 43.70 78.25 56.27 100.57 70.37 n/a 41.45 72.72 49.44 89.60 73.05 n/a 117.61 n/a n/a n/a n/a n/a Package 2 (context; no DTM) Time 12.18 23.40 23.04 49.67 41.44 93.72 78.40 181.81 123.64 289.43 30.68 88.71 56.40 172.94 135.56 428.03 277.10 856.63 549.88 1720.25 1100.79 3432.24 STD 0.04 0.07 0.05 0.58 0.05 0.42 0.19 0.87 0.17 1.52 0.08 0.09 0.10 0.81 0.15 2.97 1.04 2.00 0.83 4.07 2.75 7.61 Speed 16.42 8.55 20.44 9.48 22.66 10.02 23.92 10.31 24.26 10.37 21.19 7.33 23.05 7.52 23.97 7.59 23.46 7.59 23.64 7.56 23.62 7.58 MC 26.66 42.03 26.96 42.66 27.50 42.36 27.25 43.91 27.52 43.96 27.34 43.26 26.67 43.83 27.42 44.43 27.53 44.50 27.52 44.89 27.63 44.84 Package 3 (context; DTM) Time 2.77 5.98 4.57 11.83 7.16 21.96 12.62 41.75 19.98 66.75 3.73 28.57 5.94 55.41 15.09 138.68 39.48 285.32 76.22 573.92 156.60 1154.71 STD 0.01 0.01 0.01 0.01 0.02 0.01 0.04 0.02 0.04 0.08 0.01 0.03 0.01 0.11 0.08 0.49 0.43 0.07 0.05 1.20 0.63 2.34 Speed 72.25 33.47 103.06 39.82 131.07 42.76 148.57 44.91 150.15 44.94 174.17 22.75 218.71 23.46 215.32 23.44 164.64 22.78 170.55 22.65 166.03 22.52 MC 2.45 2.67 2.44 2.67 2.50 2.77 2.78 2.72 2.72 2.77 2.56 2.85 2.79 2.87 2.57 2.87 2.63 3.63 2.66 3.64 2.67 3.67 11,110 11,110 27,357 27,357 55,143 55,143 112,663 112,663 175,919 175,919 61,704 61,704 121,953 121,953 240,709 240,709 354,927 354,927 627,953 627,953 1219,208 1219,208 #EPB BSD sizes BSDo 282 476 557 1033 1032 1991 1982 3903 3125 6207 761 1432 1408 2773 3352 6812 6597 13,528 13,086 27,078 26,073 53,859 BSDc 7 11 11 20 18 35 31 66 48 103 22 22 20 39 45 93 90 187 178 376 352 745 D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

Time is expressed in seconds (s); STD, Standard Deviation; Speed is expressed in NALUs per second (NALUs/s); MC, Memory Consumption expressed in MegaBytes (MB); EPB, Emulation Prevention Bytes; BSDo, size of plain-text BSDs expressed in KiloBytes (KB); BSDc, size of compressed BSDs expressed in KB.

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239
250 4

237

3.5

Execution Speed for Packages 2 & 3 (NALU/s)

3

150

2.5

2 100

1.5

1 50 0.5

0 0 2000 4000 6000 8000 10000 12000

0 14000

Number of NALUs

Ice, Package 2 trailer, Package 3

Trailer, Package 2 Ice, Package 1

Ice, Package 3 Trailer, Package 1

Fig. 10. Execution speed of the BintoBSD Parser with the basic BS Schema.

The execution time, and the resulting generation speed, is the other measurement during our experiments. From package 1, it is immediately clear that MPEG-21 BSDL is unarguably unusable in a practical framework as discussed in [5]. The execution times are extremely high and the generation speed decreases with the length of the sequence. The reliability of our test set is substantiated by the small Standard Deviations (STDs). The performance of packages 2 and 3 can be better analyzed when the results of Table 3 are represented graphically. The generation speed for the three packages is visualized in Fig. 10 for the basic BS Schema and in Fig. 11 for the extended schema. Again, a first conspicuous phenomenon in both charts is the decreasing speed for package 1. For package 2 and 3, first, the speed increases after which it converges to a constant value. The low speed for very short sequences is caused by the start-up of the parser. During this start-up phase, a lot of required Java classes have to be dynamically loaded and the BS Schema has to be parsed and interpreted. The total start-up time, which is the time prior to the parsing of the first bit of the bitstream, takes approximately 1.2 seconds. The influence of the start-up time is certainly noticeable for package 3 because of the low execution times for short sequences. This value is spread out and can be ignored in case of long bitstreams such as in the trailer sequence, resulting in a linear execution time and a constant generation speed. From both figures, we can conclude that the second requirement in Section 3.3 can only be satisfied if our context-related attributes are taken into account and this independent of the complexity of the BS Schema. The complexity of a BS Schema only has an impact on the exact execution time and not on the linearity. The complexity can be expressed in terms of the number of existing

XPath expressions in the schema and the complexity of each individual XPath expression. The length of a BS Schema only has an impact on the start-up time and not on the global execution time. This becomes very clear from the table, because our extended schema contains a lot of complex XPath expressions, namely referring from within the slice header to syntax elements present in the PPSs and SPSs. The generation speed in case of the basic schema (Fig. 10) converges to approximately the same value and this independent of the video sequence, in particular 170 NALU/s in case of package 3 and 24 NALU/s for package 2. The small speed differences for the different sequences are coming from I/O operations (as we will show further in this paragraph). The fixed attractor point is not present anymore in Fig. 11 for our extended BS Schema. Hereby, every sequence iterates to an own attractor. The reason for the divergence character of the attractor points is the complexity of the XPath expressions present in the slice header syntax and the increasing number of PPSs and SPSs (see Table 1). Almost all XPath expressions of the slice header refer to an SPS or PPS present in the memory. Because of the different amount of SPSs and PPSs present, the evaluation time of an XPath expression depends on the number of available SPSs and PPSs. This results in an interdependency of the sequence and the generation speed. Our basic BS Schema does not contain XPath expressions referring to an SPS or PPS. Consequently, the number of SPSs and PPSs does not have an influence on the generation speed (as one can see in Fig. 10). A last remarkable phenomenon is the bulge in the curve for the trailer sequence and package 3 in Fig. 10. This bulge is also present in package 2 but less explicit. The bulge is caused by the content of the video sequence. The first

Execution Speed for Package 1 (NALU/s)

200

238
50

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239
0.8

45

Execution Speed for Packages 2 & 3 (NALU/s)

0.7

40 0.6 35 30 0.5

25

0.4

20 15

0.3

0.2 10 5 0.1

0 0 2000 4000 6000 8000 10000 12000

0 14000

Number of NALUs

Ice, Package 2 Trailer, Package 3

Trailer, Package 2 Ice, Package 1

Ice, Package 3 Trailer, Package 1

Fig. 11. Execution speed of the BintoBSD Parser with the extended BS Schema.

100 frames of the trailer are almost identical, in particular information about the movie on a static green background. The encoder can compress these frames very well resulting in small bitstream sizes, as one can see in Table 1. Because the bitstream does not contain much bytes for representing the first 100 frames, the BintoBSD Parser spends less time on I/O operations, resulting in a higher generation speed. In order to confirm this statement, we have profiled package 3 by using JProfiler. Most of the time (up to 70% of the total time) is spent on I/O operations. This observation immediately gives a good indication on how the parser can be further optimized in order to be used in commercial products. The transformation of a BSD is done using STX and a performance analysis of the stylesheets is given in [36]. The BSDtoBin Parser has already been evaluated multiple times in the literature. We refer to [12], [22], or [24] to obtain performance results that are in line with our measured results. From these results, one can conclude that the transformation (using STX), together with the generation of the adapted bitstream, can be realized in real time, meaning that the application scenarios discussed in [5] are applicable to JSVM6-coded bitstreams. 7. Conclusions In this paper, the MPEG-21 BSDL specification is discussed in order to obtain a format-agnostic framework for video content adaptation. This specification makes it possible to describe the structure of a bitstream in XML, after which the adaptation can be executed on the XML

description instead of the bitstream itself. Regrettably, the first version of the MPEG-21 BSDL specification is not usable in a practical adaptation architecture because of an increasing memory consumption, an unacceptable low execution time, and a decreasing generation speed of the BintoBSD Parser during the generation of the XML description. Therefore, we have formulated requirements in order to obtain a feasible framework for format-independent media content adaptation. A number of extensions to MPEG-21 BSDL are discussed in this paper to satisfy these requirements. Bitstreams compliant with the scalable extension of H.264/AVC are used to measure the performance of our optimized framework. In order to describe the structure of these scalable bitstreams in XML, it was needed to define a second set of fundamental extensions to the MPEG-21 BSDL specification. From our performance measurements, one can conclude that our extensions allow obtaining a constant memory consumption during the BSD generation, which is in contrast to an increasing memory usage when our extensions are not in use by the BintoBSD Parser. Furthermore, by using our extensions, the BintoBSD Parser is multiple times faster to describe the scalable bitstreams in XML, hereby achieving a constant generation speed. Finally, it is noteworthy that both sets of extensions are adopted in the second amendment of the DIA standard. Acknowledgments The research activities that have been described in this paper were funded by Ghent University, the Interdisciplin-

Execution Speed for Package 1 (NALU/s)

D. De Schrijver et al. / J. Vis. Commun. Image R. 18 (2007) 217–239

239

ary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union. References
[1] A. Perkis, Y. Abdeljaoued, C. Christopoulos, T. Ebrahimi, J.F. Chicharo, Universal multimedia access from wired and wireless systems, Circuits, Systems and Signal Processing—Special Issue on Multimedia Communications 20 (3–4) (2001) 387–402. [2] S.-F. Chang, A. Vetro, Video adaptation: concepts, technology, and open issues, Proceedings of the IEEE 93 (1) (2005) 148–158. [3] J.-R. Ohm, Advances in scalable video coding, Proceedings of the IEEE 93 (1) (2005) 42–56. [4] T. Wiegand, G. Sullivan, J. Reichel, H. Schwarz, M. Wien, Scalable Video Coding—Joint Draft 6, Doc. JVT-S201. [5] S. Devillers, C. Timmerer, J. Heuer, H. Hellwagner, Bitstream syntax description-based adaptation in streaming and constrained environments, IEEE Transactions on Multimedia 7 (3) (2005) 463–470. [6] P. Cimprich et al., Streaming transformations for XML, version 1.0 working draft. Available from: <http://stx.sourceforge.net/documents/spec-stx-20040701.html>. [7] A. Eleftheriadis, Flavor: a language for media representation, in: Proceedings of the fifth ACM international conference on Multimedia, Seattle, Washington, 1997, pp. 1–9. Available from: <http:// flavor.sourceforge.net>. [8] D. Hong, A. Eleftheriadis, XFlavor: Bridging bits and objects in media representation, in: Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Lausanne, Switzerland, 2002, pp. 773–776. Available from: <http://flavor.sourceforge.net>. [9] X. Sun, C.-S. Kim, C.-C. Jay Kuo, MPEG video markup language and its applications to robust video transmission, Journal of Visual Communication and Image Representation 16 (4–5) (2005) 589–620. [10] ISO/IEC 14496-2:2004 Information technology—Coding of audiovisual objects—Part 2: Visual, 2004. [11] D. Van Deursen, W. De Neve, D. De Schrijver, R. Van de Walle, BFlavor: an optimized XML-based framework for multimedia content customization, in: Proceedings of the 25th Picture Coding Symposium, Beijing, China, 2006, p. 6 (CD-ROM). [12] G. Panis, A. Hutter, J. Heuer, H. Hellwagner, H. Kosch, C. Timmerer, S. Devillers, M. Amielh, Bitstream syntax description: a tool for multimedia resource adaptation within MPEG-21, Signal Processing: Image Communication 18 (8) (2003) 721–747. [13] C. Timmerer, G. Panis, H. Kosch, J. Heuer, H. Hellwagner, A. Hutter, Coding format independent multimedia content adaptation using XML, in: Proceedings of SPIE International Symposium ITCom 2003 on Internet Multimedia Management Systems IV, vol. 5242, Orlando, FL, 2003, pp. 92–103. [14] W. De Neve, D. Van Deursen, D. De Schrijver, S. Lerouge, K. De Wolf, R. Van de Walle, BFlavor: A harmonized approach to media resource adaptation, inspired by MPEG-21 BSDL and XFlavor, Signal Processing: Image Communication 21 (10) (2006) 862–889. [15] ISO/IEC 21000-7:2004 Information technology—Multimedia framework (MPEG-21)—Part 7: Digital Item Adaptation, 2004.. [16] I.S. Burnett, F. Pereira, R. Van de Walle, R. Koenen, The MPEG-21 Book, Wiley, John & Sons, Inc, 2006. [17] XML Schema part 0: Primer. Available from: <http://www.w3c.org/ TR/xmlschema-0> (May 2001).

[18] ISO/IEC 15444-1:2004 Information technology—JPEG 2000 image coding system: Core coding system, 2004. [19] D. Mukherjee, E. Delfosse, J.-G. Kim, Y. Wang, Optimal adaptation decision-taking for terminal and network quality-of-service, IEEE Transactions on Multimedia 7 (3) (2005) 454–462. [20] M. Kay, XSLT Programmer’s Reference, second ed., Wrox Press Ltd, Birmingham, UK, 2001. [21] O. Becker, Transforming XML on the fly, in: Proceedings of XML Europe, 2003. Available from <http://www.idealliance.org/papers/ dx_xmle03/index.html>. [22] D. De Schrijver, C. Poppe, S. Lerouge, W. De Neve, R. Van de Walle, MPEG-21 bitstream syntax descriptions for scalable video codecs, Multimedia Systems 11 (5) (2006) 403–421. [23] W. De Neve, S. Lerouge, P. Lambert, R. Van de Walle, A performance evaluation of MPEG-21 BSDL in the context of H.264/AVC, in: Proceedings of SPIE annual meeting 2004: Signal and Image Processing and Sensors, vol. 5558, Denver, USA, 2004, pp. 555–566. [24] T. Zgaljic, N. Sprljan, E. Izquierdo, Bitstream syntax description based adaptation of scalable video, in: Proceedings of European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT), London, UK, 2005, pp. 173–178. [25] ISO/IEC 21000-8:2006 Information technology—Multimedia framework (MPEG-21)—Part 8: Reference Software, 2006. [26] ISO/IEC 21000-7:2004/FPDAM 2 Information technology—Multimedia framework (MPEG-21)—Part 7: Digital Item Adaptation, amendment 2: Dynamic and distributed adaptation, 2006. [27] D. De Schrijver, W. De Neve, K. De Wolf, R. Van de Walle, Generating MPEG-21 BSDL descriptions using context-related attributes, in: Proceedings of the 7th IEEE International Symposium on Multimedia, Irvine, CA, 2005, pp. 79–86. [28] Saxonica: XSLT and XQUERY Processing. Available from: <http:// www.saxonica.com/>. [29] The Apache Xalan Project. Available from: <http://xalan.apache.org/index.html>. [30] ISO/IEC 14496-10:2005 Information technology—Coding of audiovisual objects—Part 10: Advanced Video Coding, 2005. [31] J. Reichel, H. Schwarz, M. Wien, Joint Scalable Video Model JSVM6, Doc. JVT-S202. [32] D. Tian, M.M. Hannuksela, M. Gabbouj, Sub-sequence video coding for improved temporal scalability, in: IEEE International Symposium on Circuits and Systems, 2005, vol. 6, Kobe, Japan, 2005, pp. 6074–6077. [33] J. Ridge, Y. Bao, M. Karczewicz, X. Wang, Fine-grained scalability for H.264/AVC, in: Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, vol. 1, 2005, pp. 247–250. [34] S.W. Golomb, Run-length encodings, IEEE Transactions on Information Theory 12 (3) (1966) 399–401. [35] P. Lambert, D. De Schrijver, D. Van Deursen, W. De Neve, Y. Dhondt, R. Van de Walle, A real-time content adaptation framework for exploiting ROI scalability in H.264/AVC, in: Lecture Notes in Computer Science (8th international conference on Advanced Concepts for Intelligent Vision Systems), vol. 4179, Antwerp, Belgium, 2006, pp. 442–453. [36] D. De Schrijver, W. De Neve, D. Van Deursen, J. De Cock, R. Van de Walle, On an evaluation of transformation languages in a fully XML-driven framework for video content adaptation, in: Proceedings of 2006 IEEE International Conference on Innovative Computing, Information and Control, vol. 3, Beijing, China, 2006, pp. 213–216.