Professional Documents
Culture Documents
Comparison of Prediction Schemes With Motion Information Reuse For Low Complexity Spatial Scalability
Comparison of Prediction Schemes With Motion Information Reuse For Low Complexity Spatial Scalability
Koen De Wolfa , Robbie De Suttera , Wesley De Nevea , and Rik Van de Wallea University - IBBT, Department of Electronics and Information Systems - Multimedia Lab, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium;
ABSTRACT
Three low complexity algorithms that allow spatial scalability in the context of video coding are presented in this paper. We discussed the feasibility of reusing motion and residual texture information of the base layer in the enhancement layer. The prediction errors that arise from the discussed filters and schemes are evaluated in terms of the Mean of Absolute Differences. For the interpolation of the decoded pictures from the base layer, the presented 6-tap and bicubic filters perform significantly better than the bilinear and nearest neighbor filters. In contrast, when reusing the motion vector field and the error pictures of the base layer, the bilinear filter performs best for the interpolation of residual texture information. In general, reusing the motion vector field and the error pictures of the base layer gives the lowest prediction errors. However, our tests showed that for some sequences that have regions with complex motion activity, interpolating the decoded picture of the base layer gives best result. This means that an encoder should compare all possible prediction schemes combined with all interpolation filters in order to achieve optimal prediction. Obviously this would not be possible for real-time content creation. Keywords: Scalable video coding, multi-layer motion prediction, spatial scalability, interpolation filters
a Ghent
1. INTRODUCTION
The Moving Picture Experts Group (MPEG, ISO/IEC JTC 1/SC 29/WG 11) and the Video Coding Experts Group (VCEG, ITU-T SG16) recently started exploring the field of Scalable Video Coding (SVC).1 The proposed technologies are mainly focusing on temporal, spatial, and quality scalability. At the moment, several indications exist that the high complexity of the algorithms might be a problem for (real-time) content creation.24 Motion estimation and compensation is considered as the main temporal decorrelation technique for interpicture prediction in conventional video compression schemes. This technique is typically one of the most complex and time-consuming parts of a video encoder. In scenarios where live content is streamed in a scalable manner, it may not be possible to determine the ideal motion vectors for all spatial resolution levels due to the time-complexity of the motion estimation process. In this paper, we present three different low complexity prediction schemes for which we discuss the possibility of reusing motion vectors and residual texture information. The latter are gathered at the lowest spatial resolution (generally called base layer) for the prediction of higher resolution versions (enhancement layers). In these schemes, interpolation filters will be used for the prediction of high resolution pictures both for the upsampling of the decoded pictures as for the upsampling of the prediction errors of the base layer. Therefore, an important issue will be the choice of interpolation filters that can be used. More in particular, the question is whether an increase of the number of filter taps - and therefore also an increase of the complexity of the filter
Further author information: (Send correspondence to Koen De Wolf) Koen De Wolf: E-mail: koen.dewolf@ugent.be, Telephone: +32 9 331 49 57 Robbie De Sutter: E-mail: robbie.desutter@ugent.be, Telephone: +32 9 331 49 59 Wesley De Neve: E-mail: wesley.deneve@ugent.be, Telephone: +32 9 331 49 57 Rik Van de Walle: E-mail: rik.vandewalle@ugent.be, Telephone:+32 9 331 49 12
Visual Communications and Image Processing 2005, edited by Shipeng Li, Fernando Pereira, Heung-Yeung Shum, Andrew G. Tescher, Proc. of SPIE Vol. 5960 (SPIE, Bellingham, WA, 2005) 0277-786X/05/$15 doi: 10.1117/12.633358 Proc. of SPIE Vol. 5960 59605L-1
Downloaded from SPIE Digital Library on 08 Jul 2010 to 143.248.227.93. Terms of Use: http://spiedl.org/terms
- will lead to a better prediction? In this paper, four interpolation filters will be used with a different number of filter taps in combination with one decimation filter. The mentioned schemes and filters are all evaluated in the context of H.264/AVC. This state-of-the-art video coding standard achieves up to 50% bit rate savings for equivalent perceptual quality compared to prior standards.5 This video coding specification now also serves as a base for the development of several scalable video coding compression algorithms.6 This paper is organized as follows, section 2 discusses some important issues concerning spatial scalability and describes the used prediction schemes and the interpolation filters used for the evaluation of the prediction schemes. Section 3 discusses the conducted experiments and results. Final conclusions and some remarks are made in section 4.
Interpolated
Decimation by 2
Entropy Coding
BASE LAYER
Memory
MV Motion Estimation
In case a block of a picture in the base layer is intra predicted and therefore no motion information is available, the corresponding block in the enhancement layer can also be predicted using techniques such as intra macroblok prediction .5, 9 The intra prediction mode of the block from the base layer may serve as an estimation of the best intra prediction mode for the associated block in the enhancement layer. In that case, no extra parameters regarding the applied intra prediction mode of that block need to be encoded for the enhancement layer. The motion vectors of the base layer can be further refined in order to obtain better coding performance.6, 10 However, in our schemes no further refinement is present. This means that the precision of the motion vectors remains the same for both spatial layers; i.e., for the luminance component, motion vectors have 1/4 pixel accuracy for the base layer and 1/2 pixel accuracy for the enhancement layer. Motion compensation in the enhancement layer is done on macroblock partitions which are doubled in size compared to the corresponding macroblock partitions of the base layer. The difference between the original high resolution pictures and the predicted pictures (using one of the discussed schemes) can be coded using the same concepts, techniques, and partition modes as denoted in the H.264/AVC specification.5, 9 More in particular: integer-valued transformations (DCT and Hadamard), quantization, and context adaptive entropy coding. Doing so allows the reuse of already available components at the encoder. To allow quality scalability, the obtained residual pictures can be coded using a Fine-Granular Scalability (FGS) based coding mechanism. This is not discussed in this paper. The effect on temporal scalability is tackled by Schwarz et al.11 .
Schemes 2 & 3
Input Video Transformation & Quantization Inverse Transformation & Inverse Quantization
Entropy Coding
ENH. LAYER
Decimation by 2
Entropy Coding
BASE LAYER
Motion Estimation
3.2. Evaluation
The Mean of Absolute Differences (MAD) between the obtained predicted pictures and the original spatial high resolution pictures is determined. These values are used for the evaluation of the prediction schemes. This gives us the average deviation per pixel between the predicted picture and the picture to be encoded. Furthermore, in Fig. 4 and 5, the differences between the measured MAD values are plotted in terms of percentages. The relative MAD deviation of M ADA compared to M ADB is given by
We define an I-picture in the context of H.264/AVC as a picture consisting entirely of intra predicted slices.
Description Tennis player; camera is following the player, high motion from camera and subject, complex textures Man presenting a construction yard; moderate movement from both the subject and the camera Mother and daughter sitting in front of a camera; no camera movement Riding bus; camera is following the bus, homogeneous motion of camera (panning) and subject Astronauts followed by the camera, lots of ash lights
Table 2. Test conditions.
Bus, Crew, Foreman, Mother & Daughter, Stefan 32, 300 16, 32 0, 12, 24, 36, 48
(M ADA M ADB ) 100. M ADB This allows us to conclude on how certain configurations are performing relative to one another. relative M AD deviation = The most interesting observations are given in the next section.
3.3. Observations
3.3.1. Scheme 1 In Fig. 3, we see that there is only a small difference between the bilinear, the bicubic and the 6-tap filter when using scheme 1. As could be expected, the nearest neighbor filter is for all sequences performing worst since MAD values are up to 25% higher than those of the other three interpolation filters. Also note that the bicubic filter performs best on average for all tested sequences. This can be due to the fact that 6 filter taps might be too much for the interpolation of detailed textures. This can clearly be seen for the Mother & Daughter sequence (hair) and the Foreman sequence (construction yard). 3.3.2. Scheme 2 In Fig. 4, the deviation of the MAD compared with the bicubic filter is plotted for both the Stefan and Bus sequences. We have choosen to use the bicubic filter as a reference since that filter performs best of all four as seen in the previous section. The impact of the interpolation filters is only visible in the first pictures following an I-picture. The deviation of the MAD stabilizes for the subsequent pictures. Although the bilinear filter performs second worst for the interpolation of I-pictures (see Fig. 3), its interpolation result together with the motion compensated coding produces the best prediction for all other pictures (See Fig. 4). The fact that motion compensation is done on the decoded pictures of the enhancement layer results in rather small prediction errors. This is especially true for sequences with low to moderate motion activity.
3.3.3. Scheme 3 Adding the interpolated residual of the base layer - as described in scheme 3 - gives lower prediction errors for sequences with substantial motion activity. In Fig. 5, we plotted the results of the different interpolation filters for the error pictures relative to the MAD values of the bilinear filter. For all plotted configurations, Ipictures are interpolated using the bicubic filter for the same reason as explained in 3.3.2; the fixed quantization parameter of the base layer and enhancement layer is 16 and 24 respectively. As can be seen in Fig. 5, the bilinear interpolation of the error picture gives the best result. The bicubic and 6-tap filter perform simular. For some sequences (e.g., Mother & Daughter) the nearest neighbor filter performs second best, whilst for other (e.g., Bus and Crew) it performs worst. This is due the fact that for the former sequence there is no camera movement and for the latter two sequences there is a panning of the camera. This finding is confirmed by the results of the Foreman sequence. In the first half of the sequence there is hardly no camera movement; for this part the nearest neighbor filter performs well. In the second half of the sequence, when the camera starts to pan, the performance of the nearest neighbor filter drops. Overall, scheme 3 is particularly beneficial for sequences with high motion activity or abrupt changes in luminance, such as ashing lights in the crew sequence as can be seen in Fig. 6. In Fig. 8 the predicted pictures using scheme 2 and scheme 3 are plotted next to the original picture during an ashlight. A distinct prediction improvement can be seen on the wall in the right picture. Adding the interpolated residual also softens block artifacts created by the motion compensation process. 3.3.4. Group of Pictures length For scheme 1, the Group of Pictures (GOP) length has no inuence on the quality of the prediction, as no motion compensated prediction is applied. The occurrence of an I-picture for scheme 2 and 3 has an adverse impact on the first few predicted pictures immediately following an I-picture. This is due to the fact that I-pictures of the enhancement layer are predicted by interpolating the corresponding I-picture of the base layer. This effect disappears for the subsequent pictures because of the coding loop in the enhancement layer. 3.3.5. Quantization The quantization parameter of the base layer has an obvious impact on the quality of the interpolated picture used as prediction in the enhancement layer. In Fig. 7 we see in the top left graph that for the Foreman sequence, using 32 as fixed quantization parameter instead of 16 increases the MAD with about 33%. Lower quantization for the coding of the error picture in the enhancement layer results in more accurate motion compensated prediction. Although PSNR values of the decoded pictures in the enhancement layer may be higher for lower quantization parameters, MAD values will not decrease proportionally as clearly can be seen in Fig. 7. For instance, the difference between QP=0 and QP=12 can hardly be seen. The difference between QP=12 and QP=24 is less then 1. On the other hand, for higher quantization the MAD seems to accumulate (QP=36, QP=48). This may be caused by the fact that for such high quantization value, the MAD of the decoded high resolution picture is higher then the MAD of the predicted picture. 3.3.6. Context adaptive interpolation If we look at the left graph in Fig. 6, we see that for a sequence with low motion activity (e.g., Mother & Daughter), scheme 2 and 3 outperform scheme 1. As can be seen in the graph on the right, for some sequences (e.g., Foreman), scheme 1 performs better in regions with complex motion activity. For this graph we used the bicubic filter for the interpolation of the decoded pictures and the bilinear filter for the interpolation of the error pictures of the base layer as these filters tend to give best results (see sect. 3.3.1 and 3.3.3).
A GOP groups I-, B- and P-pictures into a specified sequence in order to reduce the temporal redundancy. A GOP typically starts or ends with an I-picture. A group can be made of different lengths to suit the type of video being encoded. Note that in our experiments no B-pictures were used.
Ideally, an encoder should compare all possible prediction schemes combined with all interpolation filters in order to achieve optimal prediction. This has of course a major drawback on the complexity of the encoder. Using the discussed filters and schemes, this means that already 36 configurations should be evaluated for every macroblok. Obviously this would not be possible for real-time content creation. 3.3.7. Joint Scalable Video Model As already mentioned above, MPEG and VCEG are developing a scalable extension on top of the state-of-theart H.264/AVC standard. In this specification, motion vectors can be refined for prediction in higher spatial resolutions. This allows better coding performance, but at the cost of a higher complexity and results in more modes to be evaluated.
3
Scheme 1: Bus
7 6
MAD
MAD
nearest neigbor bilinear bicubic 6-tap
Frame
Frame
Scheme 1: Foreman
7 6 5
Scheme 1: Crew
6 5 4
MAD
MAD
Frame
Frame
Figure 3. Scheme 1; QP base layer=16; Intra period=300; Sequences: Mother & Daughter (top left), Bus (top right), Foreman (bottom left) and Crew (bottom right).
Scheme 2: Stefan
25 nearest neighbor bilinear 20 bicubic 6-tap 12 14 16
Scheme 2: Bus
nearest neighbor bilinear bicubic 6-tap
15
MAD deviation
MAD deviation
10 8 6 4 2
10
0 0 -5 -2
Frame
Frame
Figure 4. Scheme 2; QP base layer=32; QP enhancement layer=36; Intra period=32; Sequences: Stefan (left) and Bus (right).
Scheme 3: Bus
5
bilinear 6-tap
MAD deviation
MAD deviation
-1
Frame
Frame
Scheme 3: Foreman
4,5 4 3,5 10
Scheme 3: Crew
nearest neighbor bilinear bicubic 6-tap
8
MAD deviation
MAD deviation
3 2,5 2 1,5 1
0 0,5 0 -2
Frame
Frame
Figure 5. Scheme 3; QP base layer=16; QP enhancement layer=24; Intra period=300; Sequences: Mother & Daughter (top left), Bus (top right), Foreman (bottom left) and Crew (bottom right).
Foreman
MAD
MAD
scheme 1 scheme 2 scheme 3
5 4 3 2 1 0
scheme 1 scheme 2 scheme 3
1,5 1 0,5 0
Frame
Frame
Figure 6. Comparison of schemes; QP base layer=32; QP enhancement layer=24; Intra period=300; Interpolation lters: bicubic (decoded picture) & bilinear (error picture); Sequences: Mother & Daughter (left) and Stefan (right).
Scheme 1
8 7 6 5 20 30
Scheme 2
25
QP=16 QP=32
MAD
4 3
MAD
15
10 2 1 0 5
Frame
Frame
Scheme 3
30 25
20
MAD
15
10
Frame
Figure 7. Inuence of the quantization parameter on the accuracy of the prediction. Intra period=300; Sequence: Foreman; interpolation decoded pictures: bicubic; interpolation error pictures: bilinear; Scheme 1 (top left), Scheme 2 (top right), Scheme 3 (bottom)
Figure 8. Crew sequence picture 2. Original (left), predicted by scheme 2 (center) and by scheme 3 (right).
ACKNOWLEDGMENTS
The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union.
REFERENCES
1. ISO/IEC JTC1/SC29/WG11, Applications and requirements for scalable video coding. ISO/IEC JTC1/SC29/WG11 N6880, Januari 2005. 2. F. Wu, S. Li, R. Yan, X. Sun, and Y.-Q. Zhang, Ecient and universal scalable video coding, in IEEE International Conference on Image Processing (ICIP), 2, pp. 3740, 2002. 3. G. Landge, M. van der Schaar, and V. Akella, Complexity analysis of scalable motion-compensated wavelet video decoders, in Applications of Digital Image Processing XXVII, A. G. Tescher, ed., Proc. SPIE 5558, pp. 444453, 2004. 4. S. Saponara, C. Blanch, K. Denolf, and J. Bormans, The JVT advanced video coding standard: Complexity and performance analysis on a tool-by-tool basis, in IEEE Workshop Packet Video (PV03), 2003. 5. T. Wiegand, G. Sullivan, G. Bjntegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. CS Video Technology 13, pp. 560576, 2003. 6. J. Reichel, M. Wien, and H. Schwarz, eds., Joint Scalable Video Model JSVM 1, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, 2005. 7. G. V. der Auwera, A. Munteanu, P. Schelkens, and J. Cornelis, Bottom-up motion compensated prediction in the wavelet domain for spatially scalable video coding, in IEE Electronics Letter, 38(21), pp. 12511253, 2002. 8. M. Mrak, G. Abhayaratne, and E. Izquierdo, Scalable generation and coding of motion vectors for highly scalable video coding, in Picture Coding Symposium 2004 (PCS-04), 2004. 9. JVT, Draft ITU-T recommendation and nal draft international standard of joint video specication (ITUT Rec. H.264 ISO/IEC 14496-10 AVC). JVT-G050r1, May 2003. 10. M. Mrak, G. Abhayaratne, and E. Izquierdo, On the inuence of motion vector precision limiting in scalable video coding, in International Conference on Signal Processing (ICSP04), Proc. ICSP 2, pp. 11431146, 2004. 11. H. Schwarz, D. Marpe, and T. Wiegand, MCTF and scalability extension of H.264/AVC, in Picture Coding Symposium 2004 (PCS-04), 2004. 12. G. Bjontegaard, Motion compensation with 1/4 pixel accuracy. ITU-T SG16/Q15, Februari 2000. 13. K. De Wolf, Y. Dhondt, J. De Cock, and R. Van de Walle, Complexity analysis of interpolation lters for scalable video coding, in To appear in: Proceedings of Euromedia 2005, 2005. 14. H. Schwarz, D. Marpe, and T. Wiegand, Scalable extension of H.264/AVC, in ISO/IEC JTC 1/SC 29/WG11 MPEG2004/M10569/S03, 2004.