You are on page 1of 8

Analysis of Prediction Mode Decision in Spatial Enhancement Layers in H.

264/AVC SVC
Koen De Wolf, Davy De Schrijver, Wesley De Neve, Saar De Zutter, Peter Lambert, and Rik Van de Walle
Ghent University IBBT Department of Electronics and Information Systems Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium {koen.dewolf, davy.deschrijver, wesley.deneve, saar.dezutter, peter.lambert, rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be/

Abstract. On top of the prediction modes dened in the H.264/AVC standard, Scalable Video Coding denes prediction modes for inter-layer prediction. These inter-layer prediction modes allow the re-use of coded data from the base layer, at the cost of increasing the search space at the encoder and as a result increase the encoding time. In this paper, we investigate the relation between the coding decisions taken in the base layer and the enhancement layer. Our tests have shown that a number of relations can be clearly identied. We have observed that the co-located macroblock of a base layer macroblock coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode as well. Further, we have observed that the P Skip mode is only used when the quantization parameter in the enhancement layer is high. For macroblocks coded in B Skip mode, the co-located macroblock in the enhancement layer will be coded in the B Skip mode when it is highly quantized (probability of 63 % to 92 % for quantization parameter 30). These observations can be used to construct a model for fast mode decision in SVC.

Introduction

H.264/AVC Scalable Video Coding (SVC), as proposed by the Joint Video Team (JVT), can be classied as a layered video specication based on the single-layer H.264/AVC standard [1], [2]. Additional Enhancement Layers (ELs) contain information pertaining to the embedded spatial and SNR enhancements. Similar single-layer prediction techniques as in H.264/AVC are applied, in particular, intra and motion-compensated prediction. However, additional inter-layer prediction mechanisms have been developed for the minimization of redundant information between dierent layers. In case of Rate Distortion-Optimized coding (RDO), the encoder complexity is signicantly increased due to the large search space constructed by the numerous prediction modes incorporated in SVC. These prediction modes can be divided into two groups. The rst group contains the prediction modes that originate from the single-layer H.264/AVC specication, i.e., the various intra and
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 848855, 2007. c Springer-Verlag Berlin Heidelberg 2007

Analysis of Prediction Mode Decision in Spatial Enhancement Layers

849

motion-compensated prediction modes. The second group contains the interlayer prediction modes. In this paper, we will analyze and discuss the mode decisions of an SVC encoder. This analysis can be used to construct a model for fast mode decision. The use of such a model in an SVC encoder can signicantly reduce the encoding time at the cost of a lower coding eciency (i.e., possibly higher bit rate together with lower visual quality). Such an analysis was already performed by He Li et al. [3] with version 2 of the reference software (dated 2005). However, in the mean time several new coding tools were added to the SVC specication. Furthermore, in our analysis, we also take the Quantization Parameters (QPs) into account. This paper is organized as follow. First, we provide a detailed outline of the prediction modes that are available in SVC, used in the context of spatial scalability. In the third section, we analyze and discuss the relation between the coding decision in the base layer and the enhancement layer. Finally, conclusions are provided in Sect. 4.

SVC Prediction Modes

As mentioned in the introduction, SVC is a layered extension of the H.264/AVC video coding specication. The structure of a possible SVC encoder is shown in Fig. 1. In this gure, the original input video sequence is down-scaled in order to obtain the pictures for all dierent spatial layers (resulting in spatial scalability). In each spatial layer, a motion-compensated pyramidal decomposition is performed, taking into account the characteristics of each layer (i.e., GOPstructure, bit rate, ...). This temporal decomposition results in a motion vector eld on the one hand and residual texture data on the other hand. This information is coded by using similar techniques as in H.264/AVC, extended with progressive SNR renement features. Several methods that allow the reuse of coded information among dierent spatial resolution layers are under investigation by the JVT. In particular, the layered structure of SVC allows the reuse of motion vectors, residual data, and intra-texture information of lower spatial and SNR layers for the prediction of higher-layer pictures in order to reduce inter-layer redundancy. In the next section, these inter-layer prediction methods are discussed in more detail. 2.1 Intra-layer Prediction

Depending on the slice type, a macroblock (MB) in a slice can be coded using one of the following three dierent prediction modes: intra prediction (I-macroblock), uni-directional prediction (P-macroblock), or bi-directional (B-macroblock) prediction. Whereas the rst prediction mode only uses the values of spatially neighboring samples of the block (intra-slice prediction), the latter two modes rely on Motion-Compensated (MC) prediction signals based on samples originating from

850

K. De Wolf et al.

Current Picture ME

+ -

T SVC Bit Stream Multiplex

F Reference reference F reference


Pictures

Reorder & Entropy Encoding

MC Intra Prediction + Q-1

Reconstructed Pictures 2 C Current Picture + A B MVR ME + 2

T-1

Base Layer Encoder T

Reorder & Entropy Encoding

F Reference reference F reference


Pictures 2 Reconstructed Pictures

MC Intra Prediction D +

Q-1

Original Pictures

T-1 + Enhancement Layer Encoder

Fig. 1. Possible SVC encoder structure that supports two spatial layers

preceding or subsequent pictures. The prediction error is obtained by subtracting the prediction signals from the original signal. Usually, this prediction error (or residual) is then transform-coded. In a next step, these transform coecients are quantized and entropy coded. Intra-Slice Prediction Modes. Three types of intra prediction are dened in the H.264/AVC specication: I 4x4, I 16x16, and I PCM. In the rst 2 types, the prediction signal (respectively 4x4 and 16x16 samples) is constructed by copying or interpolating the values of previously coded neighboring samples. Nine and four modes are respectively dened for the I 4x4 type and I 16x16 type. These modes stipulate the direction of the interpolation or the direction of the copy of the signal1 . The I PCM mode (Intra-slice Pulse Code Modulation) allows an encoder to bypass the prediction and transform coding processes, directly sending the values of the samples to the entropy encoder. Motion-Compensated Prediction Modes. The concepts of MC prediction, as dened in H.264/AVC, are used in the SVC ELs as well. Hence, variable
1

Except for mode 2, in which the prediction signal is constructed by taking the average of adjacent pixels.

Analysis of Prediction Mode Decision in Spatial Enhancement Layers

851

block-size motion compensation, quarter-sample motion vector accuracy, multiple reference pictures, and weighted prediction are supported by SVC. For an introduction to this tools, we refer to [2]. In case of uni-directional coding, only one MC prediction block is used. The translation of this block commonly referred to as motion vector and the index of the picture in the reference list i.e., list 0 (L0 ) used for this prediction need to be coded. This reference list may contain pictures before and after the current picture in display order. H.264/AVC allows variable block sizes, ranging from 16x16, 16x8, 8x16 to 8x8 (P L0 16x16, P L0 L0 16x8, P L0 L0 8x16, and P 8x8 respectively). An 8x8 partition can be further divided into 8x4, 4x8, and 4x4 sub-partitions. This means that for each of these partitions both a motion vector and a picture reference index need to be coded, with the exception of sub-partitions that all must use the same reference picture. An additional unidirectional prediction type is called P Skip. For this type, no motion vector, reference index, or residual is coded. The prediction signal (16x16 samples) is constructed by using the picture with index 0 in the picture reference list (L0 ). The motion vector is equal to the motion vector predictor2 of that MB. In case of bi-directional coding, at most two motion-compensated prediction signals can be used. These signals originate from pictures that are stored in two separate reference lists: list 0 (L0 ) and list 1 (L1 ). For B-macroblocks, four modes of MC prediction are dened in H.264/AVC: list 0, list 1, bi-predictive, and direct. In case of mode list 0 and list 1, only one motion vector per partition is transmitted, together with the index of the picture in the appropriate reference list. For the bi-predictive mode, a weighted average of the MC prediction blocks is used to obtain the prediction signal. In the direct mode, the coding mode is inferred from the neighboring prediction modes and can be list 0, list 1, or bi-predictive. When a MB is coded in direct mode and no prediction error signal is coded, this mode is referred to as B Skip mode. Together with the possible MB partitions, this results in 24 prediction modes. 2.2 Inter-layer Prediction

On top of the MC prediction and intra prediction modes, as dened in H.264/AVC, four prediction methods are dened that re-use coding information from a Base Layer (BL). This is called Inter-Layer Prediction (ILP). The BL for a particular EL is the layer that can be used for ILP. This BL does not need to be the layer with the lowest quality nor lowest spatial resolution. In the rst ILP method, BL motion information is re-used for ecient coding of the EL motion data. In this mode, commonly referred to as base layer mode, no additional motion vectors are transmitted for the EL. When the base layer is a down-scaled version of the current layer, the motion vectors and the MB partitioning mode are up-sampled accordingly. This is illustrated in Fig. 1 by the dashed arrow A. Also, for the current MB, the same reference indices as for the corresponding 8x8 partitions of the base layer are used [4].
2

The motion vector predictor is the average of the motion vectors of the upper and left MB.

852

K. De Wolf et al.

The second ILP prediction mode is called quarter pel renement mode and is an extension to the base layer mode. The motion vectors, the partitioning mode, and the reference indices of the current MB are derived from the corresponding sub-macroblock in the base layer. Additionally, a quarter pel Motion Vector Renement (MVR) is transmitted for that particular block (dashed arrow B in Fig. 1). This allows to improve the MC prediction. The third ILP method is called residual prediction mode. Here, residual information of MC coded MBs from the base layer can be used for the prediction of the residual of the current layer. A ag, indicating whether residual prediction is used, is transmitted for each MB of the current EL. When residual prediction is applied, the base layer residuals of the corresponding MBs are block-wise up-sampled using a bi-linear lter with constant border extension. Doing so, only the dierence between the residual of the current layer, obtained after motion compensation (MC), and the upsampled residual of the base layer is coded (dashed arrow C in Fig. 1). The fourth ILP mode is inter-layer intra prediction (dashed arrow D in Fig. 1). In this mode, an intra-predicted MB of a slice in the base layer can be used for the prediction of the co-located block in the current EL. Therefor, the upsampled version of decoded block is used. In SVC, the 6-tap lter used for halfsample interpolation, as dened in the H.264/AVC specication, is used for the interpolation of these decoded samples. The drawback of this approach is that a decoder needs to decode all the referred intra-coded MBs in the base layer. Note that residual prediction mode can be combined with the base layer mode, quarter pel renement mode, and the inter-layer intra prediction mode. The combination of base layer mode and residual prediction mode is also called BL Skip mode. A detailed analysis of the coding performance and time-complexity of the ILP modes is given in [5].

Prediction Mode Decision

The prediction mode decision is taken by minimizing the Lagrangian cost functional J = D + R for all possible MB coding modes. Here, D represents the distortion between original and reconstructed signal, R is the bit rate needed for coding of the motion vectors and residual data, and is the Lagrangian multiplier, which depends on the chosen QP setting [6]. The value of this Lagrangian parameter is layer-specic. The minimization of J is a very time-consuming operation as for all possible modes, motion vectors (within the search window), and reference pictures this cost function needs to be evaluated. Fast Mode Decision and Fast Motion Estimation algorithms are developed in order to minimize the complexity of this decision taking process [7], [8], [9]. A side eect of such an algorithm is the introduction of a coding eciency penalty as the mode decision will be sub-optimal.

Analysis of Prediction Mode Decision in Spatial Enhancement Layers

853

Foreman
50 45 50 45

Foreman

Selection Probability (%)

40 35 30 25 20 15 10 5 0
x8 4x 4 8x 16 P_ LO _1 6x 16 P_ 8x 8 BL _S ki p P_ Sk ip _1 6 I_
(12, 12) (12, 18) (12, 24) (12, 30)

Selection Probability (%)

40 35 30 25 20 15 10 5 0
x8 P_ LO _1 6x 16 8x 16 BL _S ki p P_ Sk ip P_ 8x 8 _1 6 I_ 4x 4
(18, 12) (18, 18) (18, 24) (18, 30)

P_ LO _L O

P_ LO _L O

Prediction Modes

Foreman
50 45 50 45

P_ LO _L O

Prediction Modes

Foreman

Selection Probability (%)

Selection Probability (%)

40 35 30 25 20 15 10 5 0
x8 P_ LO _1 6x 16 8x 16 BL _S ki p P_ Sk ip P_ 8x 8 4x 4 _1 6 I_
(24, 12) (24, 18) (24, 24) (24, 30)

40 35 30 25 20 15 10 5 0
P_ LO _L O _1 6x 8 P_ LO _1 6x 16 ip p 6 P_ LO _L O 8x 1 BL _S k P_ Sk i P_ 8x 8 I_ 4x 4
(30, 12) (30, 18) (30, 24) (30, 30)

P_ LO _L O

Prediction Modes

P_ LO _L O

Prediction Modes

Fig. 2. Prediction mode decision in spatial EL for the Foreman sequence when the colocated MB in BL is coded in P 8x8 mode. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).

3.1

Test Conguration

The statistical analysis of the used prediction modes, is performed on ve sequences: Crew, Foreman, Mobile & Calendar, Mother & Daughter, and Stefan. Two spatial layers are used: QCIF and CIF, both at 30Hz. The GOP size is set to 16; intra-coded slices are inserted every 32 frames. Full-search ME and all ILP modes are enabled. We coded the sequences with version 8 of the reference software [10] using 16 dierent QP-combinations (QPBL , QPEL )|QPBL , QPEL {12, 18, 24, 30} for all sequences. 3.2 Results and Discussion

Due to space limitations, we are unable to publish all results in detail. Therefore, we will discuss two of the most selected MB prediction modes in the BL i.e., P 8x8 mode for P-pictures and B Skip mode for B-pictures. Co-located MB is Coded in P 8x8 Mode. This mode is used in the BL for about 30 % of the MBs in P-pictures. We can see from the graphs in Fig. 2 that in approximately 40 % of the co-located MBs in the enhancement layer are also coded in P 8x8 mode, except when the EL is highly quantized. In that case, BL Skip mode is selected most (24-40 %). When the QP of the EL is low, the

P_ LO _L O

854

K. De Wolf et al.

Stefan
100 90
100 90

Stefan

Selection Probability (%)

80 70 60 50 40 30 20 10 0
BL _S ki p B_ Sk ip ct B_ L0 _1 6x 16 B_ L1 _1 6x 16 B_ Bi _1 6x 16 B_ Bi _B i_ 16 x8 B_ Bi _B i_ 8x 16 B_ 8x 8 B_ D ire

Selection Probability (%)

80 70 60 50 40 30 20 10 0
6x 16 6x 16 x1 6 _8 x1 6 _1 6x 8 ire ct BL _S ki p B_ Sk ip B_ Bi _1 6 B_ D B_ 8x 8
(18, 12) (18, 18) (18, 24) (18, 30)

(12, 12) (12, 18) (12, 24) (12, 30)

B_ L0 _1

B_ L1 _1

B_ Bi _B i

Prediction Modes

Prediction Modes

Stefan
100 90 100 90

Stefan

Selection Probability (%)

80 70 60 50 40 30 20 10 0
6x 16 6x 16 _8 x1 6 _1 6x 8 x1 6 ire ct B_ 8x 8 BL _S ki p B_ Sk ip B_ Bi _1 6 B_ D
(24, 12) (24, 18) (24, 24) (24, 30)

Selection Probability (%)

80 70 60 50 40 30 20 10 0
6x 16 6x 16 x1 6 _8 x1 6 _1 6x 8 ire ct BL _S ki p B_ Sk ip B_ Bi _1 6 B_ D B_ 8x 8
(30, 12) (30, 18) (30, 24) (30, 30)

B_ L1 _1

B_ L0 _1

B_ Bi _B i

B_ Bi _B i

B_ L0 _1

B_ L1 _1

B_ Bi _B i

Prediction Modes

Prediction Modes

Fig. 3. Prediction mode decision in spatial EL for the Stefan sequence when the colocated MB in BL is coded in B Skip. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).

I 4x4 mode is selected second most (22-25 %). For the tested congurations, the P Skip mode is only used for higher QPs in the EL. Co-located MB is Coded in B Skip Mode. In Fig. 3, the prediction mode decisions (expressed in %) in the spatial EL for the Stefan sequence are shown when the co-located MB in the base layer is coded using the B Skip mode. Notwithstanding the QP of the BL, for high QPs in the EL, the MB will be coded in B Skip mode (accuracy ranging from 39 % to 86 % for QPEL = 24 and from 63 % to 92 % for QPEL = 30). For low QPs in the EL, the B Skip mode is seldom used. Instead, the used prediction modes are more or less equally divided over BL Skip, B Direct, B Bi 16x16, and B 8x8. Also, we observe that for less quantized MBs in the BL, BL Skip, and B Skip modes are selected more often in the EL. The same behaviour is observed for the other sequences.

Conclusion

SVC is an extension of H.264/AVC providing spatial, temporal, and SNR scalability with a high compression eciency. This compression eciency is achieved by relying on the available coding modes. Exhaustive search techniques are used to select the best coding mode for each MB. Doing so, these techniques achieve the highest possible coding eciency, but at the cost of a higher computational

B_ Bi _B i

B_ Bi _B i

Analysis of Prediction Mode Decision in Spatial Enhancement Layers

855

complexity. We have analyzed the relation between the coding mode decisions made in the BL and the EL. Our tests have shown that a number of relations can be clearly identied. We have observed that the co-located MB of a BL MB coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode. Moreover, we have noticed that the P Skip mode is only used for high QPs in the EL. For MBs coded in B Skip mode, the co-located MB in the EL will be coded in the B Skip mode when the EL is highly quantized (63 % to 92 % for QPEL = 30). The observations in this paper can be used to construct a model for fast mode decision. Such a model can be used to guide the mode decision algorithm of an SVC encoder, hereby reducing the overall encoding time. Acknowledgements. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientic Research-Flanders (FWO-Flanders), and the European Union.

References
1. ITU-T, ISO/IEC JTC 1: Advanced video coding for generic audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC (2003) 2. Wiegand, T., Sullivan, G., Bjntegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 560576 (2003) 3. Li, H., Li, Z., Wen, C., Chau, L.P.: Fast mode decision for spatial scalable video coding. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 4. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M. (eds.).: Joint Scalable Video Model 8: Joint Draft 8 with proposed changes, Doc. JVT-U202. JVT (2006) 5. De Wolf, K., De Schrijver, D., De Zutter, S., Van de Walle, R.: Scalable Video Coding: Analysis and coding performance of inter-layer prediction. In: Proceedings of the 9th International Symposium on Signal Processing and its Applications, Dubai (U.A.E.), SuviSoft Oy Ltd, 4 (2007) 6. Sullivan, G., Baker, R.: Rate-distortion optimized motion compensation for video compression using xed or variable size blocks. In: Proceedings of the IEEE Global Telecommunications Conference, Phoenix, AZ. vol. 3, pp. 8590 (1991) 7. Pan, F., Lin, X., Rahardja, S., Lim, K., Li, Z., Wu, D., Wu, S.: Fast mode decision algorithm for intraprediction in H.264/AVC video coding. IEEE Transactions on Circuits and Systems for Video Technology 15, 813822 (2005) 8. Dai, Q., Zhu, D., Ding, R.: Fast mode decision for inter prediction in H.264. In: Proceedings of International Conference on Image Processing (ICIP) (2004) 9. Lin, Z., Yu, H., Pan, F.: A scalable fast mode decision algorithm for H.264. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 10. Vieron, J., Wien, M., Schwarz, H. (eds.).: Joint Scalable Video Model (JSVM) 8 software, Doc. JVT-Q203. JVT (2006)

You might also like