Professional Documents
Culture Documents
Google Inc., 1600 Amphitheatre Pkwy., Mountain View, CA, USA 94043
{yilin,isasi,badsumilli}@google.com
(c) Color = 0.00 (d) Color = 1.07 (c) Chunk Variation = 0.05 (d) Chunk Variation = 22.26
Fig. 2. Various complexity in spatial and color spaces. Fig. 3. Various complexity in temporal and chunk variation
spaces, where the ith column is the row sum of the ith frame.
• Color We define color complexity metric as the ra- Temporal Color Chunk Variation
tio between the average of mean Sum of Squared Er- Spatial 93% 90% 88%
ror (SSE) in U and V channels to the mean SSE in Temporal 92% 88%
Y channel (obtained from PSNR measurements). A Color 83%
high score means complex color variations in the frame
(Fig. 2 (d)), and a low score usually means a gray image Table 1. Coverage rates for pairwise feature spaces.
with only luminance changes (Fig. 2 (c)).
• Temporal The number of bits used to encode a P 3. Permutate all non-empty bins uniformly at random.
frame is proportional to the temporal complexity of
a sequence. However visual material with high spa- 4. For current bin, randomly select one candidate clip and
tial complexity (large I frames) tends to have large P add it to the dataset if and only if:
frames because small motion compensation errors lead
• The Euclidean distance of feature scores, between
to large residuals. To decouple this effect, we normalize
this clip and each of the already selected clips is
the P frame bits by taking a ratio with the I frame bits,
greater than a certain threshold (0.3).
as a fair indicator of temporal complexity. Fig. 3.(a)
and Fig. 3.(b) show row sum map of videos with low • The original video that this clip belongs to,
and high temporal complexity, where the ith column doesn’t already have another clip in the dataset.
is the row sum of the ith frame. Frequent column
5. Move to the next bin and repeat step 4 until desired
color changes in Fig. 3.(b) imply fast motion among
number of samples are added.
frames, while small changes in temporal domain leads
to homogeneous regions in Fig. 3.(a). We manually remove mislabeled clips from the 50 selected
• Chunk Variation To explore quality variation within clips, leaving with 15 to 25 selected clips from each (cate-
the video, we measured the standard deviation of com- gory, resolution) subgroup. We design the sampling scheme
pressed bitrates among all 1 sec chunks (also normal- to comprehensively sample along all four feature spaces. As
ized by the frame size). If the scene is static or has the complexity distribution (normalized) shown in Fig. 4, the
a smooth motion, the chunk variation score should be feature scores in our sampled set are less spiky than that of
close to 0, and the row sum map has no sudden changes the initial set. The sample distribution in color space is left
(Fig. 3 (c)). Multiple scene changes within a video skewed, because many videos with high complexity in other
(common in Gaming videos) will lead to multiple dif- feature spaces are uncolorful. To evaluate the coverage of the
ferent regions on the row sum map (Fig. 3 (d)). sampled set, we divide each pairwise space into 10 × 10 grid.
The percentage of grids covered by the sampled set is shown
Each original video and time offset combination forms a in Table 1, where the average coverage rate is 89%.
candidate clip for the dataset. We sample from this sparse 4-D
feature space as described below: 4. NO-REFERENCE METRICS FOR UGC
1. Normalize the feature space F by subtracting min(F )
and dividing with (percentile(F, 99) − min(F )). As mentioned in Section 1, most UGC videos are non-
pristine, which may confuse traditional reference-based qual-
2. For each feature, divide the normalized range [0, 1] uni- ity metrics. Fig. 5 and Fig. 6 show two clips (id: Vlog 720P-
formly into N (= 3) bins, where the last bin can go over 343d and Vlog 720P-670d) from our UGC dataset, as well as
1 and extend up to the maximum value. their compressed versions (by H.264 with CRF 32). We can
0.2 0.15 Noise Banding SLEEQ
600 1200 800
All All
0.15 Sampled Sampled 500 1000
0.1
Probability
Probability
600
0.1 400 800
0.05
0.05 300 600 400
200 400
0 0
0 1 2 3 4 0 0.2 0.4 0.6 0.8 200
100 200
Spatial Temporal
0.15 0.6 0 0 0
0 0.5 1 0.2 0.4 0.6 0.8 0 0.5 1
All All
Sampled Sampled
0.1 0.4
Probability
Probability
0.05 0.2
Fig. 7. Distribution of no-reference quality metrics.
0 0 0.09
0.45
0 0.5 1 0 5 10 15 20 Animation Animation Animation
Cover Cover Cover
0.08
Color Chunk Variation 0.4 LiveMusic LiveMusic
0.2
LiveMusic
UgcVert UgcVert UgcVert
0.07
0.35
0.05
0.15
0.25
0.15 0.03
0.05 0.01
0 0 0
0.2 0.4 0.6 0.8 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Noise Banding SLEEQ
Fig. 5. Misleading reference quality scores, where PSNR, ticeable quality degradation. We can get the same conclusion
SSIM, and VMAF are 29.02, 0.86, and 58.15. for videos in Fig. 6 by applying a noise detector [4].
Our UGC dataset provides three no-reference metrics:
Noise, Banding [5], and SLEEQ. The first two metrics are
see that the corresponding PSNR, SSIM, and VMAF scores
artifact-oriented, which can be interpreted as meaningful
are bad, however the original and compressed versions are
extents of specific compression issues. The third metric
similar visually. Both the original and compressed versions
(SLEEQ) is designed to measure compression artifacts on
in Fig. 5 contain significant artifacts, while the compressed
natural scenes. All these metrics have good correlations with
version in Fig. 6 actually has less noise artifacts than the orig-
human ratings, and outperform other related no-reference
inal one. The low reference quality scores are mainly caused
metrics. Fig. 7 shows the distribution for the three quality
by not correctly accounting for artifacts in originals.
metric scores (normalized to [0, 1] where lower score is bet-
Although no existing quality metric can perfectly evalu- ter) on our UGC dataset. We can see that heavy artifacts are
ate quality degradation between the original and compressed not detected in most videos, which in some sense tells us that
versions, a possible way is to identify quality issues sepa- the overall quality of videos uploaded to YouTube is good.
rately. For example, since we can tell the major quality is- Fig. 8 shows some quality issues within individual categories.
sue for videos in Fig. 5 is distortions in natural scene, we can For example, Animation videos seem to have more banding
compute SLEEQ [6] scores on the original and compressed artifacts than others, and nature scenes in Vlog tend to contain
versions independently, and use their differences to evaluate more artifacts than other categories.
the quality degradation. In this case, SLEEQ scores for the
original and compressed versions are 0.21 and 0.18, respec-
tively, which implies that compression didn’t introduce no- 5. CONCLUSION
[3] Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush [15] Stefan Winkler, “Analysis of public image and video
databases for quality assessment,” IEEE JOURNAL OF
Moorthy, and Megha Manohara, “Toward a practical
SELECTED TOPICS IN SIGNAL PROCESSING, vol. 6,
perceptual video quality metric,” Blog, Netflix Technol-
no. 6, 2012.
ogy, 2016.
[16] Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu,
[4] Chao Chen, Mohammad Izadi, , and Anil Kokaram, “A
David Oppenheimer, Eric Tune, and John Wilkes,
no-reference perceptual quality metric for videos dis-
“Large-scale cluster management at Google with Borg,”
torted by spatially correlated noise,” ACM Multimedia,
in Proceedings of the European Conference on Com-
2016.
puter Systems (EuroSys), Bordeaux, France, 2015.
[5] Yilin Wang, Sang-Uok Kum, Chao Chen, and Anil
Kokaram, “A perceptual visibility metric for banding ar-
tifacts,” IEEE International Conference on Image Pro-
cessing, 2016.