You are on page 1of 8

High Efficiency Video Coding

(HEVC): Challenges & Benefits


SANJEEV VERMA
Principal Systems Engineer, Aricent

www.aricent.com

High Efficiency Video Coding


(HEVC): Challenges & Benefits
Display technologies play a very crucial role in defining the user experience for a whole
range of devices from big screen TVs to small handheld devices such as mobile phones. To
satisfy the ever-increasing demand for better visual experience, display technologies are
continually evolving. In a relatively short span of time we have gone from High Definition
(HD) to full HD and now most devices offer Ultra HD as well. Delivering heavy UHD content
over legacy carriers is a huge challenge and demands a much more efficient video compression technology. High Efficiency Video Coding (HEVC) standard effectively addresses
this problem and delivers high resolution content without any jitter even on a low bandwidth connection.
This whitepaper discusses how HEVC adoption helps in saving bandwidth and enables
distribution of UHD content. The paper discusses how an end user benefits from HEVC
adoption in terms of enjoying higher resolution, improved playback smoothness and
higher bit-depth video quality. The paper provides insights into the HEVC industry trends
and the challenges involved in migrating to HEVC using the currently available hardware
platforms.
The paper also provides details on the additional complexity introduced by H.265 standard
and the challenges involved in implementing the associated toolsets. The paper proposes
GPU accelerated HEVC decoder for improved battery life and discusses the hybrid multithreading approach for better load balancing between the CPU cores. The paper also
touches upon the profiling techniques to identify the hot spots in the code and cache
memory considerations that need to be followed while architecting video software for
improved performance.

HEVC Definition and Differentiator


Joint Collaborative Team on Video Coding (JCT-VC) released
the final publication of High Efficiency Video coding (HEVC)
standard worldwide in Q4 2013. HEVC is a video coding standard that provides much better quality (at the same bit-rate)

Native parallel tools (Tiles and WPP) introduced in the standard


make it a multi-core friendly codec. More exhaustive prediction
modes, hierarchical block partitioning strategy, and improved
post processing are a few of the key enhancements that enable
HEVC deliver the quality required by the UHDTV revolution.

than its predecessor H.264 and enables a multimedia experience that is even ahead of High Definition Ultra-High
Definition!

High Efficiency Video Coding (HEVC): Challenges & Benefits

Benefits of HEVC

Enables Adaptive Streaming

Higher compression offered by HEVC technology has opened

The Internet speed fluctuations, variations in the content bitrate

up doors for seamlessly streaming full HD content @ 60/120

and instantaneous increase in the (computational) complexity of

frames per second (fps) on the channels that were originally

the video can cause undesirable frame drops or re-buffering during

made for streaming full HD 30 fps media. HEVC is a boon for

streaming. Adaptive streaming is a technology that provides a user

online media hubs, IPTV companies, broadcasters and other

an option to switch between the contents of various bit-rates in

network operators as it would enable them to deliver a

accordance with the available bandwidth or CPU speed.

compelling user experience even over low speed broadband

MPEG-DASH, Microsoft Smooth Streaming (MSS), and Apples

connections.

HTTP Live Streaming (HLS) are the few of the leading technologies
that address frame drop issues and provide a smooth playback on
users device by adapting between the right content.

Better Quality
A video standard is said to be more efficient if it achieves

A solution incorporating both MPEG-DASH and HEVC can leverage

better Peak Signal to Noise Ratio (PSNR) or loses lesser

HEVC to encode the content with very high compression ratio

quality for a given bit-rate during encode-decode cycle. Fig-1

(even at low bitrates) and utilize MPEG-DASH for adaptive stream-

compares the PSNR data, obtained for HEVC and H.264

ing thus delivering unprecedented quality of experience to the end

Codecs. It is clearly seen that HEVC consistently leads H.264

users.

and delivers better PSNR at all bitrates. Experiments reveal


that HEVC is able to save almost 40 to 50 percent bit-rates for
most of the standard content scenarios and hence opens up

Enables UHDTV Broadcasting

doors for 4K video streaming on the current networks.

Not just on-line streaming, satellite television will also be greatly


benefited by HEVC adoption. Leading DTH service providers are
planning to upgrade their content which will then be delivered over
HEVC technology. The DTH ecosystem is laying the foundation for

Average PSNR

UHD content delivery so that UHDTV broadcasting can become


main-stream by 2016. Ultra-HD enabled televisions are already

40

being manufactured by Sony, Samsung, LG and other consumer


electronics leaders. NHK (Nippon Hs Kykai), a Japanese public
broadcaster, is preparing to broadcast UHD content in Japan in the

35

near future. In fact, recently NHK announced an 8K sensor that is


capable of shooting at 120 fps (frame per second).

30
2 Mbps

4 Mbps

6 Mbps

Bitrate

HEVC
H.264

HEVC Adoption Trend


Online streaming is fast becoming the most preferred medium to
watch video. In fact, more and more people, these days, watch
movies, TV programs, etc. on YouTube rather than on their TV sets.

Fig 1: PSNR Comparison: H.264 vs HEVC (for Aricent generated


high motion content)

Smoother Playback
Frequent re-buffering and a jerky playback due to lack of
speed (bandwidth) is very annoying and reduces the quality
of user experience. As a result there are still a huge number of
people who prefer to watch downloaded content rather than
watching it online. HEVC can change this scenario by reducing the channel traffic by 50 percent. This extra buffer can be
used to avoid re-buffering and gives user a smooth playback
experience, without any interruptions.

High Efficiency Video Coding (HEVC): Challenges & Benefits

The paid viewership is also increasing by the day, leading to a steep


increase in consumers average expense towards online video
streaming.
According to statistics published by YouTube Over 6 billion hours
of video are watched each month on YouTube ,that's almost an
hour for every person on Earth, and 50% more than last year.
Around 100 hours of video content is uploaded to YouTube every
minute. Given this scenario, it is a must for content providers/aggregators to deliver content at a lower cost, while improving the
quality of the video.
HEVC would play a significant role in further bringing down the cost
of online streaming. With the current infrastructure, whatever a

user spends for video streaming can be straight away cut

size 64x64, 32x32 or 16x16. CTU may be split recursively into

down by 50% by using HEVC technology because HEVC

four parts called Coding Units (CUs) all the way down to 8x8.

provides 50% more compression compared to legacy technol-

Fig-2 depicts the quad tree recursion based partitioning

ogies. Alternatively, by deploying HEVC, the quality of the

system for a CTU pictorially. Each CU can be further divided

content can be upgraded without any extra load on the

into Prediction Units (PUs) in a symmetrical or asymmetrical

channels and users can enjoy enhanced quality at the same

way, as shown in the Fig-3.

cost. Using HEVC on 3G/ 4G network is certainly going to


reduce the cost for mobile users and would encourage more

CU
(8x8)

video viewing over mobile networks. In fact, Vodafone is

CU
(8x8)

CU (16x16)

already marketing themselves as A network for 24x7


streaming with regards to e-learning and online video viewership.

CU (32x32)

Challenges with High Efficiency


Video Coding (HEVC)
Computational needs in video coding have increased drastically after Joint Collaborative Team on Video Coding (JCT-VC)

CU (16x16)

CU (16x16)

announced the HEVC standard for video compression. While

CU (32x32

higher compression offered by HEVC provides better quality, it


also poses the need to come up with equally efficient
platforms and implementations that can handle the increased
complexity brought by the standard. Sections below discuss
the complexity metric for various modules of HEVC when
compared to the H.264 standard.

Fig. 2: Quad Tree based recursion within a Coding Tree Unit (CTU)

Increased Complexity in Intra-prediction


Intra (or IDR) frames act as key frames in video coding process
and hence the prediction accuracy of intra frames play a vital
role in deciding the overall quality of the video. Intra frame acts
as an initial reference frame for other P or B predicted frames
within a Group of Pictures (GOP). If there is a significant loss of
quality in the intra-prediction process of a key frame, it can
propagate in a massive way to rest of non-I frames till a next I
frame arrives. Keeping this in mind HEVC standard proposes

2Nx2N

2NxN

Nx2N

NxN

2NxnU

2NxnD

total of 35 different modes while H.264 used maximum of 9


modes for a block based intra-prediction. Searching in
additional directions provides better quality but at the same
time computational complexity is increased multifold. Intra
smoothing is another feature that brings in further complexity
in the key frames processing.

Flexible Block Partitioning


H.264 divides the frame uniformly into processing units of size
16x16 called as macroblocks. Macroblocks can be further
divided into smaller blocks of size 8x8 or 4x4 for prediction
purpose. H.265 has a much more complex image partitioning
method and replaces macroblocks with concept called Coding
Tree Unit (CTU) that allows quad tree recursion based block
partitioning. A frame is divided into CTUs which could be of

High Efficiency Video Coding (HEVC): Challenges & Benefits

nLx2N

nRx2N

Fig. 3: Coding Unit (CU) Splits - Symmetrical and Asymmetrical

More versatile block sizes mean more complex motion estima-

particular CTU. The offset also depends on neighboring pixel

tion search algorithms in HEVC which require more computa-

values and the direction indicated in the SAO parameters. While

tional power. Dynamically changing CU split architecture

it brings an additional computational complexity during codec

introduces many condition checks at a block level, which may

implementation, it also induces neighboring dependencies

not be straight forward to implement for deep pipeline based

making it challenging to be implemented on a parallel architec-

architecture such as ARMv7/v8.

ture like GPU.

Inter-prediction complexity has been increased in HEVC by


6 taps. Chroma interpolation uses 4 tap based interpolation as

Addressing HEVC Challenges


through Aricents Offerings

compared to bilinear filter in H.264. Additionally, motion vector

Leading processor makers such as ARM, Intel and AMD

prediction module becomes more computationally intensive

have been continuously striving to deliver faster yet low power

by introducing merge and skip modes as explained in [8].

platforms to meet the computational needs of ever growing

using 8-tap interpolation filters while H.264 used maximum of

multimedia market. Single Instruction, Multiple Data (SIMD)


Neon technology combined with a load store architecture

Variable Size Block Transform


HEVC standard supports 4x4, 8x8, 16x16 and 32x32 sizes for
block transformation while H.264 supports a uniform
transform block size of 4x4 for main profile. Having versatile
transform size methodology provides better compression but
at the same time performing transform on bigger blocks
becomes more complex from (Single Instruction Multiple
Data) SIMD instructions and data cache perspective.
Increased precision for the coefficients in the transform matrix
further adds to the complexity of the overall transformation
process. Fig-4 below captures how a transform unit (TU) size is
varied across an HEVC frame.

present in ARMv7 based processors (ARM Cortex-A8, A9,


A15 etc.) enables parallel processing at the instruction level
where 128 bit wide vectors can be operated upon in a single
instruction. This means Neon co-processor can either operate
on sixteen 8-bit elements or eight 16-bit elements in parallel for
any arithmetic or logical or a memory load/store operation.
Similarly Intels latest architectures like SSE 4.0, AVX and AVX2
have varied forms of parallel processing capabilities that
leverage SIMD architecture and deliver the best performance
as needed by HEVC.
With current silicon technology it may not be possible to
increase the CPU clock beyond a certain extent due to thermal
issues. However, heterogeneous System on Chips (SoCs) with
multiple processing units have been launched in the market
recently by chip makers which can deliver the desired compute
performance to fulfill the increasing demand of video
algorithms. Samsung Exynos, NVIDIA Tegra and
Qualcomm Snapdragon chipset series are to name just a
few, powered by ARMv7 architecture and incorporate multiple
CPU cores (running as high as 2.5GHz) along with GPU
Compute capability. No doubt, these platforms provide greater
computational power to video software makers, but at the
same time programmers need to design and architect their
software in a parallel way to extract the maximum performance
out of multi-core based systems.

Fig 4: TU Split variation in HEVC

Additional Post Processing (Sample Adaptive


Offset)

Leveraging GPU Compute


ARM Mali Graphics Processing Unit (GPU) T6xx loaded with

Sample Adaptive Offset (SAO) is a toolset that has been added

128 bit SIMD capabilities and parallel computing technology is

in HEVC after the de-blocking stage. This improves the PSNR

now being leveraged by video algorithm developers at Aricent

by reducing the ringing related distortions and also enhances

to develop codec solutions with low power consumption and

the visual quality of the video. In SAO, an offset is added to a

improved performance targeting Ultra HD resolution. OpenCL

pixel sample based on the SAO parameters signaled for a

APIs exposed by the Mali GPU facilitate quicker implementa-

High Efficiency Video Coding (HEVC): Challenges & Benefits

tion of video algorithms, which saves time-to-market for new

memory access are recommended for a CPU based platforms.

products. By offloading certain modules of HEVC video decoder

However for architecture like AMD Radeon GPU, memory

to GPU, not only is the decoding made faster but also a lot of

bank conflicts [11] need to be taken care while deciding the

power saving is achieved, which otherwise would have been

memory access pattern. One may need to study the cache

consumed by the CPU as GPUs are highly power efficient when

allocation and eviction policy to plan the data flow for software.

compared to CPUs.

Aricent HEVC Software Enabler


Effective CPU loading with Hybrid Multithreading

Aricent offers highly optimized HEVC Software codecs that are

Parallel computing is becoming commonplace and most

deployed on various Operating Systems such as Android, iOS,

performance critical software is being ported to take advantage

Linux and Windows Phone on both ARM and Intel based

of multi-core architectures. Optimal load balancing can be a

devices. The codecs are fully compliant to HEVC standard and

bottleneck if the software has not been suitably architected.

support full HD (1920x1080) and UHD resolutions including 2K

Aricent proposed [10] hybrid design approach that combines

and 4K. The software solutions have been highly optimized to

functional and spatial techniques of multithreading and

achieve peak performance on various SoCs like Qualcomm

effectively leverages a multi-core architecture to develop highly

Snapdragon, Samsung Exynos, Apple A6 and other next gener-

efficient video software in various content scenarios. By using a

ation chipsets and support GPGPU offloading for better battery

hybrid multithreading approach Aricent is able to develop

life. The HEVC decoder solution also enables multi-screen

HEVC decoder that is capable of delivering up to 90 frames per

support for varying resolution of various consumer devices.

ideal for early adoption. The platform agnostic codecs can be

second with full HD (1920x1080 resolution) on quad core A15


based ARM platform. Hybrid approach showed better results
in optimizing HEVC decoder software on Intel Core i5

Conclusion

architecture as well and showed improved numbers for most of

UHDTV broadcasting will become mainstream very soon and

the content when compared to the conventional techniques of

HEVC will play a vital role in delivering the required compres-

multithreading.

sion to complement the technology. VP9 is emerging as a


competing technology to HEVC and has the advantage of being

Identifying Hot Spots and Software Profiling


Identifying performance critical functions in software is an
important step in the optimization cycle. Typically 20% of the
software runs 80% of the time and needs to be optimized for
performance. This is done by using profiling tools such as GNU
profiler GPROF, DS5 by ARM, codeXL by AMD to name

a license free codec. Nevertheless, due to better compression


efficiency, wider color space/format coverage, and having
originated from a more reliable standard body HEVC will remain
a leading technology for video compression in this decade.

References

a few. Profiling and optimization is an iterative process that is

1. Bingbing Xia,Fei Qiao,Huazhong Yang and Hui Wang, An

followed till the desired performance is achieved. Once perfor-

efficient methodology for transaction-level design of multi-core

mance critical functions are identified, they are coded in

h.264 video decoder, Consumer Electronics (ICCE), 2011 IEEE

assembly language to get the best performance. When used in

International Conference, Jan. 2011

conjunction with SIMD instructions, manually coded assembly

2. Kue-Hwan Sihn, Hyunki Baik, Jong-Tae Kim, Sehyun Bae and

functions perform 4 to 5 times faster than compiler optimized

Hyo Jung Song, Novel approaches to parallel H.264 decoder

functions on most platforms.

on symmetric multicore systems, Acoustics, Speech and


Signal Processing, 2009. ICASSP 2009. IEEE International

Cache Friendly Memory Access


Rearranging the data structures and modifying memory access
patterns as per the cache architecture is yet another important
step in optimization process. Based on the available cache
memory and levels of cache, code flow needs to be worked out,
for example in HEVC, block based decode pipeline is more
cache friendly than a frame based decoding. If data cache is
relatively bigger, one can choose to process few blocks or a row
at a time to gain additional performance for code cache. In all
scenarios, memory access patterns that allow consecutive

High Efficiency Video Coding (HEVC): Challenges & Benefits

Conference, Apr. 2009


3. Nishihara, K., Hatabu, A. and Moriyoshi,T., Parallelization of
H.264 video decoder for embedded multicore processor,
Multimedia and Expo, 2008 IEEE International Conference, Apr.
2008
4. Falcao, G., Sousa, L., and Silva, V.,Massively LDPC Decoding
on Multicore Architectures, Parallel and Distributed Systems,
IEEE Transactions, Feb. 2011
5. Ngai-Man Cheung, Xiaopeng Fan, Au, O.C. and Man-Cheung
Kung,Video Coding on Multicore Graphics Processors, Signal

Processing Magazine, IEEE, Issue 2, Mar. 2010Processing


Magazine, IEEE, Issue 2, Mar. 2010
6. Yun-il Kim, Jong-Tae Kim, Sehyun Bae, Hyunki Baik and Hyo
Jung Song, H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load
balancing, Multimedia and Expo, 2008 IEEE International
Conference, June 23 2008-April 26 2008
7. ARM Limited, Cortex-A15 Revision: r2p0, Technical
Reference Manual , http://infocenter.arm.com, Sept 2011
8. ITU-T, Recommendation ITU-T H.265, www.itu.int, Apr.
2013
9. Sanjeev Verma, Enabling GPU Compute on an ARM
Mali-T600 GPU creates a power efficient HEVC decode
solution, http://goo.gl/PxmuWS, Feb 2014
10. Sanjeev Verma, Parallel Computing: Architecting video
software for multi-core heterogeneous platforms, http://goo.gl/nTWj3B, Jul 2014
11. AMD, AMD Accelerated Parallel Processing OpenCL
Programming Guide, http://goo.gl/te0mB8, Jul 2014

High Efficiency Video Coding (HEVC): Challenges & Benefits

Engineering excellence.Sourced
Aricent is the worlds #1 pure-play product engineering services and software firm. The
company has 20-plus years experience co-creating ambitious products with the leading
networking, telecom, software, semiconductor, Internet and industrial companies. The
firm's 10,000-plus engineers focus exclusively on software-powered innovation for the
connected world.
frog, the global leader in innovation and design, based in San Francisco is part of Aricent.
The companys key investors are Kohlberg Kravis Roberts & Co. and Sequoia Capital.
info@aricent.com

2014 Aricent. All rights reserved.


All Aricent brand and product names are service marks, trademarks, or registered marks of Aricent in the United States and other countries.

You might also like