Aricent HEVC Whitepaper

High Efficiency Video Coding
(HEVC): Challenges & Benefits

SANJEEV VERMA
Principal Systems Engineer, Aricent
www.aricent.com
High Efficiency Video Coding

(HEVC): Challenges & Benefits
Display technologies play a very crucial role in defining the user experience for a whole
range of devices from big screen TVs to small handheld devices such as mobile phones. To
satisfy the ever-increasing demand for better visual experience, display technologies are
continually evolving. In a relatively short span of time we have gone from High Definition
(HD) to full HD and now most devices offer Ultra HD as well. Delivering heavy UHD content
over legacy carriers is a huge challenge and demands a much more efficient video compression technology. High Efficiency Video Coding (HEVC) standard effectively addresses
this problem and delivers high resolution content without any jitter even on a low bandwidth connection.
This whitepaper discusses how HEVC adoption helps in saving bandwidth and enables
distribution of UHD content. The paper discusses how an end user benefits from HEVC
adoption in terms of enjoying higher resolution, improved playback smoothness and
higher bit-depth video quality. The paper provides insights into the HEVC industry trends
and the challenges involved in migrating to HEVC using the currently available hardware
platforms.
The paper also provides details on the additional complexity introduced by H.265 standard
and the challenges involved in implementing the associated toolsets. The paper proposes
GPU accelerated HEVC decoder for improved battery life and discusses the hybrid multithreading approach for better load balancing between the CPU cores. The paper also
touches upon the profiling techniques to identify the hot spots in the code and cache
memory considerations that need to be followed while architecting video software for
improved performance.
HEVC Definition and Differentiator

Joint Collaborative Team on Video Coding (JCT-VC) released
the final publication of High Efficiency Video coding (HEVC)
standard worldwide in Q4 2013. HEVC is a video coding standard that provides much better quality (at the same bit-rate)
Native parallel tools (Tiles and WPP) introduced in the standard

make it a multi-core friendly codec. More exhaustive prediction
modes, hierarchical block partitioning strategy, and improved
post processing are a few of the key enhancements that enable
HEVC deliver the quality required by the UHDTV revolution.
than its predecessor H.264 and enables a multimedia experience that is even ahead of High Definition Ultra-High
Definition!
High Efficiency Video Coding (HEVC): Challenges & Benefits
Benefits of HEVC
Enables Adaptive Streaming
Higher compression offered by HEVC technology has opened
The Internet speed fluctuations, variations in the content bitrate
up doors for seamlessly streaming full HD content @ 60/120
and instantaneous increase in the (computational) complexity of
frames per second (fps) on the channels that were originally
the video can cause undesirable frame drops or re-buffering during
made for streaming full HD 30 fps media. HEVC is a boon for
streaming. Adaptive streaming is a technology that provides a user
online media hubs, IPTV companies, broadcasters and other
an option to switch between the contents of various bit-rates in
network operators as it would enable them to deliver a
accordance with the available bandwidth or CPU speed.
compelling user experience even over low speed broadband
MPEG-DASH, Microsoft Smooth Streaming (MSS), and Apples
connections.
HTTP Live Streaming (HLS) are the few of the leading technologies
that address frame drop issues and provide a smooth playback on
users device by adapting between the right content.
Better Quality
A video standard is said to be more efficient if it achieves
A solution incorporating both MPEG-DASH and HEVC can leverage
better Peak Signal to Noise Ratio (PSNR) or loses lesser
HEVC to encode the content with very high compression ratio
quality for a given bit-rate during encode-decode cycle. Fig-1
(even at low bitrates) and utilize MPEG-DASH for adaptive stream-
compares the PSNR data, obtained for HEVC and H.264
ing thus delivering unprecedented quality of experience to the end
Codecs. It is clearly seen that HEVC consistently leads H.264
users.
and delivers better PSNR at all bitrates. Experiments reveal

that HEVC is able to save almost 40 to 50 percent bit-rates for
most of the standard content scenarios and hence opens up
Enables UHDTV Broadcasting
doors for 4K video streaming on the current networks.
Not just on-line streaming, satellite television will also be greatly

benefited by HEVC adoption. Leading DTH service providers are
planning to upgrade their content which will then be delivered over
HEVC technology. The DTH ecosystem is laying the foundation for
Average PSNR
UHD content delivery so that UHDTV broadcasting can become

main-stream by 2016. Ultra-HD enabled televisions are already
40
being manufactured by Sony, Samsung, LG and other consumer

electronics leaders. NHK (Nippon Hs Kykai), a Japanese public
broadcaster, is preparing to broadcast UHD content in Japan in the
35
near future. In fact, recently NHK announced an 8K sensor that is

capable of shooting at 120 fps (frame per second).
30
2 Mbps
4 Mbps
6 Mbps
Bitrate
HEVC
H.264
HEVC Adoption Trend

Online streaming is fast becoming the most preferred medium to
watch video. In fact, more and more people, these days, watch
movies, TV programs, etc. on YouTube rather than on their TV sets.
Fig 1: PSNR Comparison: H.264 vs HEVC (for Aricent generated

high motion content)
Smoother Playback
Frequent re-buffering and a jerky playback due to lack of
speed (bandwidth) is very annoying and reduces the quality
of user experience. As a result there are still a huge number of
people who prefer to watch downloaded content rather than
watching it online. HEVC can change this scenario by reducing the channel traffic by 50 percent. This extra buffer can be
used to avoid re-buffering and gives user a smooth playback
experience, without any interruptions.
The paid viewership is also increasing by the day, leading to a steep

increase in consumers average expense towards online video
streaming.
According to statistics published by YouTube Over 6 billion hours
of video are watched each month on YouTube ,that's almost an
hour for every person on Earth, and 50% more than last year.
Around 100 hours of video content is uploaded to YouTube every
minute. Given this scenario, it is a must for content providers/aggregators to deliver content at a lower cost, while improving the
quality of the video.
HEVC would play a significant role in further bringing down the cost
of online streaming. With the current infrastructure, whatever a
user spends for video streaming can be straight away cut
size 64x64, 32x32 or 16x16. CTU may be split recursively into
down by 50% by using HEVC technology because HEVC
four parts called Coding Units (CUs) all the way down to 8x8.
provides 50% more compression compared to legacy technol-
Fig-2 depicts the quad tree recursion based partitioning
ogies. Alternatively, by deploying HEVC, the quality of the
system for a CTU pictorially. Each CU can be further divided
content can be upgraded without any extra load on the
into Prediction Units (PUs) in a symmetrical or asymmetrical
channels and users can enjoy enhanced quality at the same
way, as shown in the Fig-3.
cost. Using HEVC on 3G/ 4G network is certainly going to

reduce the cost for mobile users and would encourage more
CU
(8x8)
video viewing over mobile networks. In fact, Vodafone is
CU
(8x8)
CU (16x16)
already marketing themselves as A network for 24x7

streaming with regards to e-learning and online video viewership.
CU (32x32)
Challenges with High Efficiency

Video Coding (HEVC)
Computational needs in video coding have increased drastically after Joint Collaborative Team on Video Coding (JCT-VC)
CU (16x16)
CU (16x16)
announced the HEVC standard for video compression. While
CU (32x32
higher compression offered by HEVC provides better quality, it

also poses the need to come up with equally efficient
platforms and implementations that can handle the increased
complexity brought by the standard. Sections below discuss
the complexity metric for various modules of HEVC when
compared to the H.264 standard.
Fig. 2: Quad Tree based recursion within a Coding Tree Unit (CTU)
Increased Complexity in Intra-prediction

Intra (or IDR) frames act as key frames in video coding process
and hence the prediction accuracy of intra frames play a vital
role in deciding the overall quality of the video. Intra frame acts
as an initial reference frame for other P or B predicted frames
within a Group of Pictures (GOP). If there is a significant loss of
quality in the intra-prediction process of a key frame, it can
propagate in a massive way to rest of non-I frames till a next I
frame arrives. Keeping this in mind HEVC standard proposes
2Nx2N
2NxN
Nx2N
NxN
2NxnU
2NxnD
total of 35 different modes while H.264 used maximum of 9

modes for a block based intra-prediction. Searching in
additional directions provides better quality but at the same
time computational complexity is increased multifold. Intra
smoothing is another feature that brings in further complexity
in the key frames processing.
Flexible Block Partitioning

H.264 divides the frame uniformly into processing units of size
16x16 called as macroblocks. Macroblocks can be further
divided into smaller blocks of size 8x8 or 4x4 for prediction
purpose. H.265 has a much more complex image partitioning
method and replaces macroblocks with concept called Coding
Tree Unit (CTU) that allows quad tree recursion based block
partitioning. A frame is divided into CTUs which could be of
nLx2N
nRx2N
Fig. 3: Coding Unit (CU) Splits - Symmetrical and Asymmetrical
More versatile block sizes mean more complex motion estima-
particular CTU. The offset also depends on neighboring pixel
tion search algorithms in HEVC which require more computa-
values and the direction indicated in the SAO parameters. While
tional power. Dynamically changing CU split architecture
it brings an additional computational complexity during codec
introduces many condition checks at a block level, which may
implementation, it also induces neighboring dependencies
not be straight forward to implement for deep pipeline based
making it challenging to be implemented on a parallel architec-
architecture such as ARMv7/v8.
ture like GPU.
Inter-prediction complexity has been increased in HEVC by

6 taps. Chroma interpolation uses 4 tap based interpolation as
Addressing HEVC Challenges

through Aricents Offerings
compared to bilinear filter in H.264. Additionally, motion vector
Leading processor makers such as ARM, Intel and AMD
prediction module becomes more computationally intensive
have been continuously striving to deliver faster yet low power
by introducing merge and skip modes as explained in [8].
platforms to meet the computational needs of ever growing
using 8-tap interpolation filters while H.264 used maximum of
multimedia market. Single Instruction, Multiple Data (SIMD)

Neon technology combined with a load store architecture
Variable Size Block Transform

HEVC standard supports 4x4, 8x8, 16x16 and 32x32 sizes for
block transformation while H.264 supports a uniform
transform block size of 4x4 for main profile. Having versatile
transform size methodology provides better compression but
at the same time performing transform on bigger blocks
becomes more complex from (Single Instruction Multiple
Data) SIMD instructions and data cache perspective.
Increased precision for the coefficients in the transform matrix
further adds to the complexity of the overall transformation
process. Fig-4 below captures how a transform unit (TU) size is
varied across an HEVC frame.
present in ARMv7 based processors (ARM Cortex-A8, A9,

A15 etc.) enables parallel processing at the instruction level
where 128 bit wide vectors can be operated upon in a single
instruction. This means Neon co-processor can either operate
on sixteen 8-bit elements or eight 16-bit elements in parallel for
any arithmetic or logical or a memory load/store operation.
Similarly Intels latest architectures like SSE 4.0, AVX and AVX2
have varied forms of parallel processing capabilities that
leverage SIMD architecture and deliver the best performance
as needed by HEVC.
With current silicon technology it may not be possible to
increase the CPU clock beyond a certain extent due to thermal
issues. However, heterogeneous System on Chips (SoCs) with
multiple processing units have been launched in the market
recently by chip makers which can deliver the desired compute
performance to fulfill the increasing demand of video
algorithms. Samsung Exynos, NVIDIA Tegra and
Qualcomm Snapdragon chipset series are to name just a
few, powered by ARMv7 architecture and incorporate multiple
CPU cores (running as high as 2.5GHz) along with GPU
Compute capability. No doubt, these platforms provide greater
computational power to video software makers, but at the
same time programmers need to design and architect their
software in a parallel way to extract the maximum performance
out of multi-core based systems.
Fig 4: TU Split variation in HEVC
Additional Post Processing (Sample Adaptive

Offset)
Leveraging GPU Compute

ARM Mali Graphics Processing Unit (GPU) T6xx loaded with
Sample Adaptive Offset (SAO) is a toolset that has been added
128 bit SIMD capabilities and parallel computing technology is
in HEVC after the de-blocking stage. This improves the PSNR
now being leveraged by video algorithm developers at Aricent
by reducing the ringing related distortions and also enhances
to develop codec solutions with low power consumption and
the visual quality of the video. In SAO, an offset is added to a
improved performance targeting Ultra HD resolution. OpenCL
pixel sample based on the SAO parameters signaled for a
APIs exposed by the Mali GPU facilitate quicker implementa-
tion of video algorithms, which saves time-to-market for new
memory access are recommended for a CPU based platforms.
products. By offloading certain modules of HEVC video decoder
However for architecture like AMD Radeon GPU, memory
to GPU, not only is the decoding made faster but also a lot of
bank conflicts [11] need to be taken care while deciding the
power saving is achieved, which otherwise would have been
memory access pattern. One may need to study the cache
consumed by the CPU as GPUs are highly power efficient when
allocation and eviction policy to plan the data flow for software.
compared to CPUs.
Aricent HEVC Software Enabler

Effective CPU loading with Hybrid Multithreading
Aricent offers highly optimized HEVC Software codecs that are
Parallel computing is becoming commonplace and most
deployed on various Operating Systems such as Android, iOS,
performance critical software is being ported to take advantage
Linux and Windows Phone on both ARM and Intel based
of multi-core architectures. Optimal load balancing can be a
devices. The codecs are fully compliant to HEVC standard and
bottleneck if the software has not been suitably architected.
support full HD (1920x1080) and UHD resolutions including 2K
Aricent proposed [10] hybrid design approach that combines
and 4K. The software solutions have been highly optimized to
functional and spatial techniques of multithreading and
achieve peak performance on various SoCs like Qualcomm
effectively leverages a multi-core architecture to develop highly
Snapdragon, Samsung Exynos, Apple A6 and other next gener-
efficient video software in various content scenarios. By using a
ation chipsets and support GPGPU offloading for better battery
hybrid multithreading approach Aricent is able to develop
life. The HEVC decoder solution also enables multi-screen
HEVC decoder that is capable of delivering up to 90 frames per
support for varying resolution of various consumer devices.
ideal for early adoption. The platform agnostic codecs can be
second with full HD (1920x1080 resolution) on quad core A15

based ARM platform. Hybrid approach showed better results
in optimizing HEVC decoder software on Intel Core i5
Conclusion
architecture as well and showed improved numbers for most of
UHDTV broadcasting will become mainstream very soon and
the content when compared to the conventional techniques of
HEVC will play a vital role in delivering the required compres-
multithreading.
sion to complement the technology. VP9 is emerging as a

competing technology to HEVC and has the advantage of being
Identifying Hot Spots and Software Profiling

Identifying performance critical functions in software is an
important step in the optimization cycle. Typically 20% of the
software runs 80% of the time and needs to be optimized for
performance. This is done by using profiling tools such as GNU
profiler GPROF, DS5 by ARM, codeXL by AMD to name
a license free codec. Nevertheless, due to better compression

efficiency, wider color space/format coverage, and having
originated from a more reliable standard body HEVC will remain
a leading technology for video compression in this decade.
References
a few. Profiling and optimization is an iterative process that is
1. Bingbing Xia,Fei Qiao,Huazhong Yang and Hui Wang, An
followed till the desired performance is achieved. Once perfor-
efficient methodology for transaction-level design of multi-core
mance critical functions are identified, they are coded in
h.264 video decoder, Consumer Electronics (ICCE), 2011 IEEE
assembly language to get the best performance. When used in
International Conference, Jan. 2011
conjunction with SIMD instructions, manually coded assembly
2. Kue-Hwan Sihn, Hyunki Baik, Jong-Tae Kim, Sehyun Bae and
functions perform 4 to 5 times faster than compiler optimized
Hyo Jung Song, Novel approaches to parallel H.264 decoder
functions on most platforms.
on symmetric multicore systems, Acoustics, Speech and

Signal Processing, 2009. ICASSP 2009. IEEE International
Cache Friendly Memory Access

Rearranging the data structures and modifying memory access
patterns as per the cache architecture is yet another important
step in optimization process. Based on the available cache
memory and levels of cache, code flow needs to be worked out,
for example in HEVC, block based decode pipeline is more
cache friendly than a frame based decoding. If data cache is
relatively bigger, one can choose to process few blocks or a row
at a time to gain additional performance for code cache. In all
scenarios, memory access patterns that allow consecutive
Conference, Apr. 2009

3. Nishihara, K., Hatabu, A. and Moriyoshi,T., Parallelization of
H.264 video decoder for embedded multicore processor,
Multimedia and Expo, 2008 IEEE International Conference, Apr.
2008
4. Falcao, G., Sousa, L., and Silva, V.,Massively LDPC Decoding
on Multicore Architectures, Parallel and Distributed Systems,
IEEE Transactions, Feb. 2011
5. Ngai-Man Cheung, Xiaopeng Fan, Au, O.C. and Man-Cheung
Kung,Video Coding on Multicore Graphics Processors, Signal
Processing Magazine, IEEE, Issue 2, Mar. 2010Processing

Magazine, IEEE, Issue 2, Mar. 2010
6. Yun-il Kim, Jong-Tae Kim, Sehyun Bae, Hyunki Baik and Hyo
Jung Song, H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load
balancing, Multimedia and Expo, 2008 IEEE International
Conference, June 23 2008-April 26 2008
7. ARM Limited, Cortex-A15 Revision: r2p0, Technical
Reference Manual , http://infocenter.arm.com, Sept 2011
8. ITU-T, Recommendation ITU-T H.265, www.itu.int, Apr.
2013
9. Sanjeev Verma, Enabling GPU Compute on an ARM
Mali-T600 GPU creates a power efficient HEVC decode
solution, http://goo.gl/PxmuWS, Feb 2014
10. Sanjeev Verma, Parallel Computing: Architecting video
software for multi-core heterogeneous platforms, http://goo.gl/nTWj3B, Jul 2014
11. AMD, AMD Accelerated Parallel Processing OpenCL
Programming Guide, http://goo.gl/te0mB8, Jul 2014
Engineering excellence.Sourced
Aricent is the worlds #1 pure-play product engineering services and software firm. The
company has 20-plus years experience co-creating ambitious products with the leading
networking, telecom, software, semiconductor, Internet and industrial companies. The
firm's 10,000-plus engineers focus exclusively on software-powered innovation for the
connected world.
frog, the global leader in innovation and design, based in San Francisco is part of Aricent.
The companys key investors are Kohlberg Kravis Roberts & Co. and Sequoia Capital.
info@aricent.com
2014 Aricent. All rights reserved.

All Aricent brand and product names are service marks, trademarks, or registered marks of Aricent in the United States and other countries.

Aricent HEVC Whitepaper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aricent HEVC Whitepaper

Uploaded by

Copyright:

Available Formats

High Efficiency Video Coding

(HEVC): Challenges & Benefits

High Efficiency Video Coding

HEVC Definition and Differentiator

Native parallel tools (Tiles and WPP) introduced in the standard

High Efficiency Video Coding (HEVC): Challenges & Benefits

Enables Adaptive Streaming

Higher compression offered by HEVC technology has opened

The Internet speed fluctuations, variations in the content bitrate

up doors for seamlessly streaming full HD content @ 60/120

and instantaneous increase in the (computational) complexity of

frames per second (fps) on the channels that were originally

the video can cause undesirable frame drops or re-buffering during

made for streaming full HD 30 fps media. HEVC is a boon for

streaming. Adaptive streaming is a technology that provides a user

online media hubs, IPTV companies, broadcasters and other

an option to switch between the contents of various bit-rates in

network operators as it would enable them to deliver a

accordance with the available bandwidth or CPU speed.

compelling user experience even over low speed broadband

MPEG-DASH, Microsoft Smooth Streaming (MSS), and Apples

A solution incorporating both MPEG-DASH and HEVC can leverage

better Peak Signal to Noise Ratio (PSNR) or loses lesser

HEVC to encode the content with very high compression ratio

quality for a given bit-rate during encode-decode cycle. Fig-1

(even at low bitrates) and utilize MPEG-DASH for adaptive stream-

compares the PSNR data, obtained for HEVC and H.264

ing thus delivering unprecedented quality of experience to the end

Codecs. It is clearly seen that HEVC consistently leads H.264

and delivers better PSNR at all bitrates. Experiments reveal

Enables UHDTV Broadcasting

doors for 4K video streaming on the current networks.

Not just on-line streaming, satellite television will also be greatly

UHD content delivery so that UHDTV broadcasting can become

being manufactured by Sony, Samsung, LG and other consumer

near future. In fact, recently NHK announced an 8K sensor that is

HEVC Adoption Trend

Fig 1: PSNR Comparison: H.264 vs HEVC (for Aricent generated

High Efficiency Video Coding (HEVC): Challenges & Benefits

The paid viewership is also increasing by the day, leading to a steep

user spends for video streaming can be straight away cut

size 64x64, 32x32 or 16x16. CTU may be split recursively into

down by 50% by using HEVC technology because HEVC

provides 50% more compression compared to legacy technol-

Fig-2 depicts the quad tree recursion based partitioning

ogies. Alternatively, by deploying HEVC, the quality of the

system for a CTU pictorially. Each CU can be further divided

content can be upgraded without any extra load on the

into Prediction Units (PUs) in a symmetrical or asymmetrical

channels and users can enjoy enhanced quality at the same

way, as shown in the Fig-3.

cost. Using HEVC on 3G/ 4G network is certainly going to

video viewing over mobile networks. In fact, Vodafone is

already marketing themselves as A network for 24x7

Challenges with High Efficiency

announced the HEVC standard for video compression. While

higher compression offered by HEVC provides better quality, it

Increased Complexity in Intra-prediction

total of 35 different modes while H.264 used maximum of 9

Flexible Block Partitioning

High Efficiency Video Coding (HEVC): Challenges & Benefits

Fig. 3: Coding Unit (CU) Splits - Symmetrical and Asymmetrical

More versatile block sizes mean more complex motion estima-