You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

Study of Subjective Quality and Objective Blind


Quality Prediction of Stereoscopic Videos
Balasubramanyam Appina1 , Student Member IEEE, Sathya Veera Reddy Dendi1 , K. Manasa1 ,
Sumohana S. Channappayya1 , Member, IEEE, Alan C. Bovik2 , Fellow, IEEE.

Abstract—We present a new subjective and objective study content, 3D content also undergoes several post-acquisition
on full high-definition (HD) stereoscopic (3D or S3D) video processing stages such as sampling, quantization and compres-
quality. In subjective study, we constructed an S3D video dataset sion. These and other acquisition and post-processing steps
with 12 pristine and 288 test videos, and the test videos are
generated by applying the H.264 and H.265 compression, blur can produce considerable degradation of the overall perceived
and frame freeze artifacts. We also propose a no reference (NR) stereoscopic 3D (S3D) video quality. In principle, each of
objective video quality assessment (QA) algorithm that relies on these stages could be perceptually optimized against objective
measurements of the statistical dependencies between the motion evaluations of the resulting perceived S3D picture quality.
and disparity subband coefficients of S3D videos. Inspired by Quality assessment (QA) can be of two types: subjective and
the Generalized Gaussian Distribution (GGD) approach in [1],
we model the joint statistical dependencies between the motion objective. In subjective assessment, human subjects perform
and disparity components as following a Bivariate Generalized the quality assessment task, which is a cumbersome and
Gaussian Distribution (BGGD). We estimate the BGGD model time consuming process. However, subjective assessment is
parameters (α, β) and the coherence measure (Ψ) from the important, since most S3D video content is meant for human
eigenvalues of the sample covariance matrix (M) of the BGGD. consumption, and human opinion scores serve as valuable
In turn, we model the BGGD parameters of pristine S3D videos
using a Multivariate Gaussian (MVG) distribution. The likelihood benchmarks on objective assessment algorithms. Objective
of a test video’s MVG model parameters coming from the pristine assessment entails computing predicted scores that correlate
MVG model is computed and shown to play a key role in the well with subjective judgment.
overall quality estimation. We also estimate the global motion In this work, we present two contributions: 1) A subjec-
content of each video by averaging the SSIM scores between tive study on full-HD (1920 × 1080) resolution S3D videos.
pairs of successive video frames. To estimate the test S3D video’s
spatial quality, we apply the popular 2D NR unsupervised NIQE The proposed dataset has 12 pristine S3D videos and 288
image QA model on a frame-by-frame basis on both views. The test stimuli, and the test stimuli are distorted videos whose
overall quality of a test S3D video is finally computed by pooling distortions are caused by H.264 and H.265 compression, Blur
the test S3D video’s likelihood estimates, global motion strength and Frame freeze. 2) A completely blind S3D NR video
and spatial quality scores. The proposed algorithm, which is quality assessment (VQA) model based on measuring the joint
‘completely blind’ (requiring no reference videos or training on
subjective scores) is called the Motion and Disparity based 3D statistical dependency strength between motion and disparity
video quality evaluator (MoDi3D ). We show that MoDi3D delivers components, global motion, and spatial NIQE scores. Our
competitive performance over a wide variety of datasets including algorithm is called the Motion and Disparity 3D video quality
the IRCCYN dataset, the WaterlooIVC Phase I dataset, the evaluator MoDi3D .
LFOVIA dataset and our proposed LFOVIAS3DPh2 S3D video The rest of the paper is organized as follows. Section II gives
dataset.
a short recent literature survey on subjective and objective
Index Terms—Stereoscopic video, subjective study, Full-HD, studies of stereoscopic videos. Section III explains the subjec-
unsupervised algorithm, joint statistics. tive study experiment and Section IV describes the proposed
objective algorithm. Section V describes the validation of
I. I NTRODUCTION the proposed LFOVIAS3DPh2 dataset and a performance
evaluation of MoDi3D . Section VI presents concluding remarks
According to a survey by eMarketer [2], “people spend and directions for future exploration.
more time with digital video than with social media.” Given
the online availability of digital content, along with signifi-
II. BACKGROUND
cant advancements in computing, communication and display
technologies, consumer viewing of digital videos continues We first review recent subjective and objective studies on
to tremendously increase [3]. This is also true of three di- S3D video quality assessment.
mensional (3D) multimedia content. As with 2D digital video
A. Subjective Quality Assessment of 3D Videos
1 The authors are with the Lab for Video and Image
Analysis (LFOVIA), Department of Electrical Engineering, Indian The factors underlying the design of a good 3D stereoscopic
Institute of Technology Hyderabad, Kandi, India, 502285 e-mail: dataset cannot be overemphasized. The 2D video dataset land-
{ee13m14p100001, ee16resch01003, ee12p1002, sumohana}@iith.ac.in. scape is rich with a diverse collection of datasets [4]–[6]. 2D
2 The author is with the Laboratory for Image and Video Engineering
(LIVE), Department of Electrical and Computer Engineering, The University video acquisition is inexpensive and ubiquitous, even at ultra
of Texas at Austin, Austin, TX 78712 USA. e-mail: bovik@ece.utexas.edu. HD resolution. By comparison, the creation and availability of

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

(a)RMIT Courtyard View of (b) RMIT University (c) State Library of Victoria (d) St. Kilda Rd Gardens
Melbourne Central Tower. Courtyard. (La Trobe Reading Room). (Water Fountain Jet-Streams).

(e) Domain Parklands. (f) Melbourne Bicycle Stand. (g) Melbourne Bicycle Stand. (h) Flower Garden.

(i) Swanston Street (j) Southbank Art Sculpture (k) Botanic Gardens and (l) Flinders Street
Tram Stop. and Melbourne Skyline. Government House. Station.

Fig. 1: One frame from each pristine video in the LFOVIAS3DPh2 dataset.

3D video datasets has been slower, owing to complexities that 9 different sets of quantization parameter (QP) values. They
arise [7] in stereo video acquisition. Fang et al. [8] presented calculated the overall score by combining the spatial and
a survey on publicly available stereoscopic video datasets, and disparity quality scores. They concluded that the subjective
reported on the specifications and properties of the videos that judgment of an S3D video correlates with reported spatial
have been used in several studies. They also highlighted the quality, but relates differently to disparity quality scores.
contributions made by these authors as well as existing gaps The Waterloo-IVC S3D video dataset [14] is a combination
that limit the utility of publicly available datasets. of two phases. They used 10 reference sequences afflicted
We briefly review frequently used S3D video datasets that by distortions from H.265 compression artifacts and post-
involve compression artifacts and other common distortions. processing effects to create 704 test video sequences. The
De Silva et al. [9] created an S3D video dataset containing video resolutions are either 1024 × 768 or 1920 × 1080
H.264 and H.265 compression artifacts. The dataset has 14 and of durations of either 6 sec or 10 sec. The LFOVIA
reference and 116 test sequences of full HD resolution, down- dataset [15] contains H.264 compressed stereoscopic video
sampled to 960×1080. They concluded that higher quanti- sequences, including 6 pristine and 144 distorted contents.
zation step sizes caused more significant perceptual quality They concluded that compression artifacts highly effect the
differences than lower quantization step sizes. Hewage et al. perceptual quality of S3D videos containing smaller disparity
[10] created an S3D video dataset which they used to explore ranges than those containing large ranges of disparity values.
the effects of random packet losses on the overall perception Wang et al. [16] created an S3D video dataset by applying
of S3D videos. They used 9 reference sequences, 54 test H.264/AVC or H.264/MVC video coding on S3D videos. They
sequences, and 6 different packet loss rates. They concluded also considered the effects of temporal and spatial resolution
that S3D perceptual quality was significantly affected by the reduction on perceived quality. They arrived at a variety of
loss of packets from either the left or right views of an S3D conclusions regarding potential bandwidth savings on S3D
video. videos such as: a) frame rate reductions reduce the quality of
Urvoy et al. [11] created a symmetric stereoscopic video videos, b) Multi Video Coded (MVC) videos more effectively
dataset with artifacts from H.264, image sharpening, resolu- trade off quality against than does H.264 simulcast. Chen
tion reduction, downsampling, and JP2K compression. They et al. [17] created an S3D video dataset to investigate the
concluded that downsampling and sharpening did not affect relationship between disparity quality, video quality and S3D
subjective judgments of quality, but the tested objective models quality of experience. The dataset is a combination of synthetic
failed to reproduce this trend. Aflaki et al. [12] explored and natural video sequences, consisting of 6 reference and 126
the effects of asymmetric encoding of S3D videos. This HEVC distorted video sequences. Dumic et al. [18] created
dataset contains both synthesized and natural sequences, which an S3D video dataset that consists of 8 pristine and 176
were viewed by 16 subjects. They observed that asymmetric distorted S3D video sequences. The test video sequences are
encoding offers bitrate savings as compared to symmetric a combination of H.264, JP2K, long duration frame freezes,
encoding. Chen et al. [13] created an H.264 S3D video dataset packet loss and scaling distortions. The video sequences have
to study the perceptual properties of spatial quality, disparity a resolution of 1920 × 1080 with a frame rate of 25 fps, and
quality and visual comfort of stereoscopic videos. They used 35 subjects participated in the associated human study.
video sequences drawn from the EPFL dataset and the ETRI The subjective studies [9]–[18] are a significant contribution
Korea dataset as references. They compressed the videos using to the 3D research community’s efforts to understand the

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

perceptual quality of S3D videos. These datasets consider perform auto-regressive prediction based on disparity mea-
distortions due to compression (H.264, H.265, JP2K), RTP surements and estimated natural S3D video statistical model
packet loss, post-processing, image shaping, and scaling. How- parameters to predict the quality of an S3D video. Jiang et al.
ever, none of these studies specifically performed subjective [36] proposed an S3D NR supervised VQA model based on
experiments on distortions due to blur or rendering errors, such tensor decomposed motion feature extraction. They estimate
as short duration frame freezes. In our study, we address a univariate Generalized Gaussian Distribution (UGGD) and
variety of commonly occurring artifacts including those due asymmetric GGD model parameters, and spatial and spectral
to H.264 and H.265 compression, blur, and frame freezes. entropies from the tensor decomposition. A random forest
We created 288 test videos (144 symmetrically compressed classifier is used to predict S3D video quality. None of
videos and 144 asymmetrically compressed videos) derived these supervised S3D VQA algorithms ( [17], [28]–[36]) have
from 12 pristine videos. These were used in a subjective study utilized statistical dependencies between motion and depth
participated in by 20 human subjects. Our dataset is freely components. Appina et al. [37] proposed S3D NR VQA
available to the research community at [19]. algorithm based on modeling the joint statistics of subband
motion and disparity components of an S3D video. They used
B. Objective Quality Assessment of 3D Videos the BGGD to model the joint statistics, and computed off-
Several authors [20]–[25] have proposed objective models the-shelf 2D NR IQA models on a frame-by-frame basis to
to assess S3D videos by reusing popular 2D image quality estimate spatial quality. Finally, a support vector regressor was
assessment (IQA) and video quality assessment (VQA) algo- applied to estimate the overall quality of an S3D video.
rithms on the individual views (including the disparity view) of Here we propose a robust and completely blind no refer-
S3D videos. The IQA and VQA models are typically applied ence S3D video quality assessment algorithm that is based
either on a frame-by-frame basis or view basis to estimate the on measurements of the joint statistical dependencies that
quality of an S3D video. These studies have concluded that exist between motion and disparity, and on measured motion
VQA models show better performance than IQA models, and variation. This new model extends the effort in [37] to the
the use of depth information improves algorithm performance. completely blind setting and is described in detail in Section
IV.

III. S UBJECTIVE S TUDY AND A NALYSIS


This section describes the design and execution of the
LFOVIAS3DPh2 stereoscopic video dataset and the subjective
experiment.

A. Reference Video Sequences


We selected the pristine S3D video sequences from the
publicly available RMIT3D [38] uncompressed video dataset.
(a) (b) To the best of our knowledge, these videos have not been
Fig. 2: Plots of spatial and temporal indices (SI and TI) of pristine used in any other stereoscopic subjective evaluation. The
videos, and disparity SI and TI (DSI and DTI) indices of RMIT3D dataset consists of 46 left and right sequences and
corresponding reference S3D videos. that were captured using a professional stereoscopic camera
(Panasonic AG-3DA1). All of the S3D videos in the dataset
Yu et al. [26] proposed an S3D reduced reference (RR) have full HD (1920×1080) resolution in YUV 422P 10-bit
VQA model based on motion vector strength, binocular fusion, format of varying durations in .mov containers. Motivated by
and rivalry scores. Hewage and Martini et al. [27] proposed the studies in [39], [40], we conducted a pilot subjective study
an S3D RR VQA metric based on depth map edges and to choose a representative set of pristine S3D videos. Six
chrominance information of an S3D view. They computed subjects participated in the preliminary study by rating each
the PSNR values of edge maps of the disparity map and video on a scale between 0 to 5 based on their perceptual
chrominance map to estimate the quality of an S3D video. senses of disparity, spatial activity, and motion information. A
Several supervised S3D NR VQA algorithms [28]–[34] have score of 0 represented very poor quality while 5 represented
been proposed that are based on spatiotemporal segmentation, excellent quality. The other indices were 1 for poor, 2 for fair,
spatial structural loss measurement, motion inconsistencies, 3 for good and 4 for very good. Based on subjective score
and intra and inter disparity variations. Yang et al. [35] pro- agreement of the above properties, we selected 12 reference
posed an S3D supervised NR VQA model based on binocular S3D videos, each clipped to 10 seconds duration. The first
perception and a multi-view model. They compute spatial frame of each pristine left video is shown in Figure 1.
texture features and temporal features computed on optical Figure 2 show the Spatial and Temporal Indices (SI and
flow. Finally, they estimate the overall quality of an S3D video TI), as well as the Disparity Spatial and Temporal Indices
by pooling the spatial and temporal features using empirical (DSI and DTI) of the reference S3D video sequences [9]. The
weights. Chen et al. [17] proposed an S3D NR supervised S3D video SI and TI indices were computed as the mean
VQA model based on a binocular energy mechanism. They of the two individual view SI and TI scores. DSI and DTI

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

were computed from the disparity maps of the reference videos 3) Frame freezes: Frame Freeze distortion frequently oc-
using a simple SSIM based stereo matching algorithm [41]. curs when software renderers fail to decode voluminous video
From these plots, it is evident that the chosen reference videos data streams at a specific frame rate. We were motivated by
contain a broad spectrum of spatial, temporal, and disparity [45], [46] to include the frame freeze distortion in our study.
information. The alphabet indices in both plots are labels To mimic this distortion, we dropped a sequence of frames and
assigned to the reference S3D videos, which are depicted in replaced each dropped frame with the immediately previous
Figure 1. frame. H.264 encoding was then applied on the frame dropped
videos. We considered freeze durations of 5, 7 and 9 dropped
TABLE I: Distortion types (H.264, H.265, Blur, Frame Freeze)
and levels (CRF, Radii, Position) of the test videos in the proposed
frames on each view of the S3D video, and chose CRF = 45
LFOVIAS3DPh2 dataset. and 50. We created 72 frame freeze distorted videos from the
12 pristine S3D videos. Table II shows the starting position of
S.No. Test Test stimuli S.No. Test Test stimuli
Left Right Left Right each frame freeze occurrence in each pristine video.
1 35 35 13 35 35 To facilitate playback on the TV, the left and right views are
2 35 45 14 35 45 concatenated side-by-side using ffmpeg and encoded at a very
3 35 50 15 35 50
4
H.264
45 45 16
H.265
45 45 high rate of 200 Mbps. Following the study [45] we used a
5 CRF 45 50 17 CRF 45 50 rate that is at least 20 times higher than the best quality video
6 50 50 18 50 50 bitrate (5 Mbps). No scaling was done as the display could
7 3 3 19 5, (45) 5, (45)
8 3 5 20 5, (45) 7, (45)
support both Ultra HD (UHD) resolution and the side-by-side
9 3 7 21 5, (45) 9, (50) views.
Blur Frame Freeze
10 5 5 22 7, (45) 7 ,(45) The subjective study was conducted in the Lab for Video
11 Radii 5 7 23 (CRF) 7, (45) 9, (50)
12 7 7 24 9, (50) 9, (50)
and Image Analysis (LFOVIA) at the Indian Institute of
Technology Hyderabad. We used an LG passive circularly
We converted all of the reference videos from YUV 422P 10 polarized 3D display (LG49UF850T) for video playback. The
bit format to YUV 8 bit format using ffmpeg [42], to ensure display has ultra HD resolution and 3D projection based
smooth playback on TV. We performed a sample subjective on Film-type patterned retarder (FPR) technology. The other
study [43] to determine whether there were any negative display settings were based on ITU-R BT [47].
effects on video quality introduced by the format conversion.
Eight subjects participated in the study, none of whom were TABLE II: Starting position (Frame number) of each video frame
freeze.
involved in the preliminary study. The subjects were shown
the YUV 422p 10 bit and YUV 420 8 bit videos and asked to Seq. No. Position Seq. No. Position
1 31 7 121
render opinions regarding perceptual quality degradation. The 2 100 8 50
subjects showed good agreement that there was no quality 3 150 9 170
degradation of the converted videos. 4 81 10 80
5 200 11 130
6 15 12 70
B. Test Video Sequences
We created 288 distorted test video sequences by introduc-
ing H.264 and H.265 compression, blur and short duration
frame freezes. The distortion strengths were designed to cover C. Subjective Test
a wide range of perceptual qualities. Each of the distortions We involved 22 naive subjects (13 male and 9 female) in
and the corresponding ranges of quality levels are described our study having an average age of 24 years. We conducted a
in Table I. demo session of 3 minutes to familiarize each viewer with the
1) H.264 and H.265 Compression Artifacts: We encoded S3D videos, and distortion types and to receive feedback on
the pristine left and right views of each S3D video using the experienced visual discomfort. The demo session consisted of
H.264 and H.265 encoder libraries available in the ffmpeg a collection of videos representative of the types of videos to
application to generate the test stimuli. We were motivated be shown during the study, and these videos are not involved in
by [44] to use the Constant Rate Factor (CRF) parameter our subjective study. During this session, two viewers reported
as the quality affecting variable. The CRF retains the unique that they experienced visual discomfort when viewing the
quality of a frame and has an inverse relationship with video videos. We relieved the two subjects from the experiment and
quality. Each pristine left and right video was encoded at proceeded with the remaining 20 subjects. The study was con-
three different CRF levels to generate a total of 72 symmetric ducted in two sessions of 30 minutes duration. We stipulated
and asymmetric H.264 and H.265 encoded videos on the 12 a break between the two sessions of at least 30 hours, to
reference S3D videos. ensure that each subject could adequately recover from any
2) Blur: Blur is a commonly occur distortion due to camera experienced feelings of visual discomfort/fatigue. Since we
defocus, camera motion, object motion, poor lighting, etc. We had to rely on hardware renderers to achieve smooth playback
used the ‘box blur’ parameter in ffmpeg to create blur distorted of the S3D videos, incorporating a GUI was not plausible.
videos. We applied three different blur levels on the left and Therefore, the subjects were trained to call out the scores at
right views of the 12 pristine S3D videos to create 72 blur the end of each video. The subjective study was conducted
distorted S3D videos. using a Single Stimulus method and Absolute Category Scale

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

(a) DMOS of all the video (b) Histogram of (c) Histogram of standard deviation of
sequences. DMOS. DMOS.
Fig. 3: Distribution of DMOS scores.

with hidden reference. As mentioned previously, we used the divisions and computed the mean (µ), median (m) and standard
ITU-R BT [47], [48] ACR scale. Therefore, each subject rated deviation (σ) of LCC and SROCC of the obtained scores over
each video as ‘Bad,’ ‘Poor,’ ‘Fair,’ ‘Good’ or ‘Excellent’ based the 100 splits. These quantities clarify the degree to which the
on their perception of quality. subjects agreed on the video ratings. Table III provides clear
evidence of the efficacy of our subjective study based on this
D. Subjective Data Analysis analysis of the internal consistency of the obtained data.
Each of the 20 subjects scored all of the 300 videos (12 IV. O BJECTIVE Q UALITY A SSESSMENT
pristine + 288 distorted). The subjective scores were processed
A wide variety of psychovisual experimental studies [49],
using the procedure in the ITU-R [47] recommendations. First,
[50] have been carried out on the mammalian visual cortex
we computed difference scores between corresponding test and
to explore disparity selectivity in visual area MT and the
reference videos:
dependencies that exist between motion and disparity [51].
diqi qj = subqi qj ref − subqi qj , (1) These studies have concluded that a large portion of area
MT is responsible for disparity processing and that these
where qi indicates the subject and qj indicates the video id,
components exhibit patchy, distributive and directional depen-
subqi qj ref is the reference score and subqi qj is the distorted
dencies. Inspired by these experiments, Potetz and Lee [52]
score. Observers were discarded if they exhibited a strong shift
and Liu et al. [1] studied the scene statistics of natural S3D
of votes as compared to average behavior. The ITU-R BT
images. They concluded that luminance and disparity subband
[47] recommendation was followed to remove these outliers
coefficients of S3D pictures have sharp peaks and heavy tails
from the study. This procedure resulted in four outliers being
that can be modeled using an UGGD. Appina et al. [37],
identified and excluded from further analysis. The final step
[53], [54] performed a series of experiments on S3D scene
of subjective processing was calculation of the DMOS scores.
components (spatial, disparity and motion/temporal) of natural
DMOS is calculated by taking the mean of the diqi qj scores
S3D images and videos to explore the statistical dependencies
across all the subjects per video:
that arise among these scene components. They found that S3D
Z
P scene components exhibit strong dependencies and that these
diqi qj
qi =1 dependencies can be well modeled as following a BGGD. We
DMOSqj = , (2) were motivated by the psychovisual studies [49]–[51] and the
Z
S3D scene component statistical studies [1], [37], [52]–[54] to
where Z = 16.
propose a completely blind S3D NR VQA algorithm based on
Figure 3(a) plots the recorded DMOS across all the distorted
a BGGD model of the joint statistical dependencies between
videos. Figure 3(b) plots the histogram of the DMOS scores
motion and disparity. We describe the proposed algorithm in
over the entire stereo dataset. Figure 3(c) shows the histogram
the following.
of the standard deviation of DMOS across all subjects. The
Let the multivariate random vector x ∈ RN follow a
average standard error of DMOS was 0.0520 across all videos.
Multivariate Generalized Gaussian Distribution (MGGD) with
We evaluated the efficacy of our subjective study by ex-
density function given by
amining the internal structure of the dataset. This was done
by randomly dividing all of the collected DMOS scores into 1 T −1
p(x|M, α, β) = 1 gα,β (x M x), (3)
two halves, where the human subjects associated with the |M| 2
two halves were mutually exclusive. We then computed the βΓ( N2 ) 1 y β
Linear Correlation Coefficient (LCC) and Spearman’s Rank gα,β (y) = 1 N
e− 2 ( α ) , (4)
N
(2 Πα) Γ( 2β )
β 2
Order Correlation Coefficient (SROCC) between these two
halves. Further, we quantified the statistical consistency of where M is an N × N covariance matrix, α is a scale
these results by repeating this computation over 100 random parameter, β is a shape parameter and gα,β (·) is the density

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

TABLE III: Mean (µ), median (m) and standard deviation (σ) of LCC and SROCC over 100 inter subject trials across different
distortions of the proposed dataset.
H.264 H.265 Blur Frame freeze Overall dataset
Score
LCC SROCC LCC SROCC LCC SROCC LCC SROCC LCC SROCC
µ 0.9408 0.9417 0.9357 0.9304 0.9368 0.9268 0.9388 09382 0.9424 0.9429
m 0.9537 0.9535 0.9429 0.9322 0.9475 0.9473 0.9472 0.9431 09489 0.9488
σ 0.0564 0.0536 0.0502 0.0521 0.0509 0.0519 0.0528 0.0516 0.0445 0.0447

(a) Reference left view. (b) H.264 compressed left view. (c) H.265 compressed left view. (d) Blur distorted left view.
DMOS = 0 DMOS = 2.9, DMOS = 2.7, DMOS = 3,
α = 4 × 10−9 , β = 0.1926, α = 8.6 × 10−5 , β = 0.3948, α = 2.4 × 10−4 , β = 0.4654, α = 3.5 × 10−8 , β = 0.2129,
Ψ = 0.9377, NIQE = 3.8911. Ψ = 0.9459, NIQE = 6.1848. Ψ = 0.8710, NIQE = 5.7658. Ψ = 0.9869, NIQE = 8.3790.

(e) Frame-wise α scores of (f) Frame-wise β scores of (g) Frame-wise Ψ scores of (h) Frame-wise NIQE scores of
reference and H.264 reference and H.264 reference and H.264 reference and H.264
compression S3D videos. compression S3D videos. compression S3D videos. compression S3D videos.

(i) α score scatter plot of (j) β score scatter plot of (k) Ψ score scatter plot of (l) NIQE score scatter plot of
reference and H.264 reference and H.264 reference and H.264 reference and H.264
compression S3D videos. compression S3D videos. compression S3D videos. compression S3D videos.
Fig. 4: Illustration of α, β, Ψ and NIQE frame wise scores and scatter plots of pristine ‘Domain Parklands’ S3D video and
its H.264 compressed versions (CL and CR represent the CRF compression rates of the left and right views).

generator. We utilized the popular Maxumum Likelihood Es- [56] to compute the Ψ scores in the form (5) from the
timation (MLE) method [55] to compute the parameters α, β eigenvalues of M. We decompose the motion vector and
and M of the BGGD. disparity maps at multiple scales (3 scales) and at multiple
In our model, motion and disparity provide the primary orientations (00 , 300 , 600 , 900 , 1200 , 1500 ) using the steerable
features and this results in N = 2. Therefore, the multivariate pyramid decomposition [57]. Motion vectors and disparity
GGD becomes a bivariate GGD (BGGD). The BGGD model maps are computed on a frame-by-frame basis in our analysis.
parameters α and β, and the coherence score (Ψ) are used for Specifically, we employ the three-step search [58] to estimate
quality prediction. The coherence score is defined as: the motion vectors and a SSIM based algorithm [41] to con-
duct disparity estimation. Each of the corresponding motion
(λmax − λmin )2
 
Ψ= , (5) and disparity subbands is jointly modeled using a BGGD.
(λmax + λmin )2
where λmax and λmin represent the maximum and mini-
mum eigenvalues of M. These eigenvalues are capable of A. Distortion and Quality Discrimination
accurately capturing directional dependencies between the Figure 4(a) shows the 150th frame of the left video of the
disparity and motion components. We were motivated by ‘Domain Parklands’ S3D video, and Figures 4(b), 4(c) and

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

4(d) show distorted versions of the same frame. The distortions RMIT S3D video dataset
are H.264 (CRF = 50), H.265 (CRF = 50) and Blur (radii = 7) Test S3D video
pristine videos [38]
respectively. Figures 4(e), 4(f) and 4(g) show the frame wise α,
β and Ψ scores of the ‘Domain Parklands’ pristine S3D video
and its H.264 compressed versions, respectively. Figures 4(i), Motion Disparity
4(j) and 4(k) show scatter plots of the α, β and Ψ scores of the Features Features Motion Disparity
same reference and distorted S3D video sequences. From the p p p
Features Features
plots, it is clear that the features follow a number of trends: fα , fβ , fΨ
1) the features are able to clearly discriminate videos having feature set d d d
fα , fβ , fΨ
large perceptual quality differences, e.g., the computed BGGD MVG parameters. feature set, ∆ scores.
features (α, β and Ψ) of the S3D video compressed at (CL p
ανp , βνp , Ψpν , αΣ p
, βΣ , ΨpΣ
= 35, CR = 35) significantly differ from those of the S3D
video compressed at (CL = 50, CR = 50). 2) the features
MoDi2D computation
take similar values on videos that are perceptually similar in
quality. For example, the BGGD features computed on the
S3D videos compressed at (CL = 45, CR = 45), (CL = 45,
CR = 50) and at (CL = 50, CR = 50) yield similar feature Spatial NIQE scores
values. These observations further motivate us to use them
as quality features in the proposed MoDi3D algorithm. The MoDi3D computation
plots in Figure 4 correspond to the first scale 00 orientation
of the steerable pyramid decomposition, and the plots use the Fig. 5: Flowchart of the proposed MoDi3D algorithm.
negative logarithmic scores of all features for better visualiza-
tion. The x-axis represents the frame sequence number of the
S3D video set. Additionally, we show the frame-wise average of the left view to estimate the disparity, and where the
NIQE scores and scatter plots of the average NIQE scores of maximum pixel disparity was limited to 30. We compute
the left and right views of the ‘Domain Parklands’ pristine the disparity map for a given S3D pair on a frame-wise
S3D video and its H.264 compressed versions in Figures 4(h) basis.
and 4(l), respectively. The plots clearly show quality variations The steerable pyramid decomposition was performed on the
with respect to the distortion levels. motion and disparity maps at multiple scales and multiple
orientations. Since the motion vectors were estimated using
a block size of 8 × 8, we downsampled the subbands of the
B. Proposed Method
disparity map to the same size by averaging over 8 × 8 blocks.
The flowchart of the proposed algorithm is shown in Figure 2) Spatial Feature Extraction: NIQE is an ‘opinion un-
5. The proposed algorithm has four stages. The first stage aware’ and ‘distortion unaware’ i.e., ‘completely blind’ 2D
computes motion and disparity features of an S3D video. The NR IQA model. We computed the NIQE [59] scores on a
second stage performs the MoDi2D score computation. In the frame-by-frame basis on both views and calculated the mean
third stage, we evaluate the NIQE model on the individual value of all frame level scores to estimate the spatial quality
views of an S3D video to compute the spatial features. In the of each S3D video.
last stage, we compute the MoDi3D score of an S3D video. Q
We describe these stages next. 1 X N IQEjL + N IQEjR
S= × , (7)
1) Motion and Disparity Feature Extraction: Q j=1 2
• Motion Feature Set: In our model, we use the motion
where j represents the frame number and Q represents the
vector map of the left view of an S3D video. The motion
total number of S3D video frames. L, R represent the left and
vectors are computed using the three-step search motion
right views of an S3D video. N IQE L and N IQE R represent
estimation algorithm [58] with a macroblock size of 8×8.
the frame level NIQE scores of the left and right views, while
The magnitude of the motion vector is used as a motion
S denotes the overall spatial quality of an S3D video.
feature in our algorithm. 3) MoDi2D Computation: As stated previously, the mo-
q
Tt = TH 2 + T2, (6) tion vector maps and disparity maps of an S3D video are
V
decomposed at three scales and six orientations using the
where Tt represents the motion vector strength, and TH steerable pyramid decomposition. We estimate the BGGD
and TV are the horizontal and vertical motion vector model parameters (α, β) and coherence score (Ψ) at each
components respectively. subband of an S3D view of a video and denoted as:
• Disparity Feature Set: The computation of the disparity
f α = [αji ], f β = [βji ], f Ψ = [Ψji ]; (8)
map is complex and sensitive to the distortion. We chose
a SSIM based stereo matching algorithm [41] to compute where i represents the subband level (1 ≤ i ≤ 18). The total
the disparity based on the trade-off between time com- number of motion vector maps computed in an S3D video is
plexity and accuracy. The algorithm computes the best Q − 1. Therefore, 1 ≤ j ≤ Q − 1. f α , f β and f Ψ are video
matching block in the right view of a corresponding block level feature sets of the α, β and Ψ scores, respectively.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

d d d
(a) Lj (f α ; ναp , Σαp ). (b) Lj (f β ; νβ p , Σβ p ). (c) Lj (f Ψ ; νΨp , ΣΨp ).
Fig. 6: Illustration of frame wise logarithmic score distribution of likelihood estimates of pristine and H.264 compressed
versions of the ‘Domain Parklands’ S3D video (CL and CR represent the CRF compression rates of the left and right views).

• Pristine Multivariate Gaussian Models: of each feature and it is a single value per frame
d d
We used reference video sequences from the RMIT S3D of each feature. Lj (f α ; ναp , Σαp ), Lj (f β ; νβ p , Σβ p )
video dataset [38] to estimate the parameters of the Ψd
and Lj (f ; νΨp , ΣΨp ) represent frame level likelihood
pristine MVG model. We excluded the videos which we estimates of each feature type of a test S3D view.
used as pristine S3D sequences in our subjective study. Figure 6 shows the logarithmic scores of the likeli-
d d
Specifically, we used the 34 remaining uncompressed hood estimates (Lj (f α ; ναp , Σαp ), Lj (f β ; νβ p , Σβ p )
d
RMIT S3D videos as the reference set in our objective and Lj (f Ψ ; νΨp , ΣΨp )) of the pristine and H.264 com-
evaluations. pressed versions of the ‘Domain Parklands’ S3D video.
We estimate the BGGD model parameters α, β and Ψ It is clear that the likelihood estimates of each feature set
scores over all subbands of motion vector and disparity vary with respect to perceptual quality.
maps from the 34 pristine S3D video sequences. Create
three individual feature sets from the α, β and Ψ scores
of the reference S3D video set.
p p p p p
f α = [αji ], f β = [βji ], f Ψ = [Ψpji ]; 1 ≤ p ≤ P, (9)
where p represents the pristine S3D video and P repre-
sents the total number of pristine videos (P = 34).
As in NIQE [59], the pristine S3D video parameter sets
p p p
(f α , f β and f Ψ ) are modeled using a Multivariate
Gaussian (MVG) distribution denoted by N (ν, Σ), where
ν and Σ are the mean vector and covariance matrix (a) SSIM scores computed (b) Scatter plot of SSIM scores
respectively. Specifically, the means (ναp , νβ p , νΨp ) and between successive frames of computed between successive
p p
covariances (Σαp , Σβ p , ΣΨp ) correspond to the f α , f β videos at various quality frames of videos at various
p levels. quality levels.
and f Ψ sets respectively.
• Distorted Feature Set: Fig. 7: Illustration of frame difference SSIM scores of the
We estimate the BGGD parameters α, β and Ψ scores pristine ‘Domain Parklands’ S3D video and corresponding
over all the subbands of the frame wise motion vectors H.264 compressed versions of it.
and disparity maps of each distorted S3D video. The
feature sets of a distorted S3D video are: Next, we compute the video level likelihood estimates by
αd d d d averaging the frame level estimates as follows:
f = [αji ], f β = d
[βji ], f Ψ = [Ψdji ]; (10)
Q−1
where the superscript d represents that the S3D video is α 1 X d
γ = × Lj (f α ; ναp , Σαp ), (14)
distorted. Q − 1 j=1
To check whether a given test video frame is pristine Q−1
or distorted, we evaluate the likelihood of its parameters β 1 X d
γ = × Lj (f β ; νβ p , Σβ p , ) (15)
coming from the pristine MVG distribution. This is Q − 1 j=1
evaluated as follows: Q−1
αd
1 X d
Lj (f ; ναp , Σαp ) = d
L(αji ; ναp , Σαp ) (11) γΨ = × Lj (f Ψ ; νΨp , ΣΨp ), (16)
d
Q − 1 j=1
Lj (f β ; νβ p , Σβ p ) = L(βji
d
; νβ p , Σβ p ) (12)
d where, γ α , γ β and γ Ψ denote the mean values of the
Lj (f Ψ ; νΨp , ΣΨp ) = L(Ψdji ; νΨp , ΣΨp ) (13)
frame level likelihood estimation scores of the individual
where L is the likelihood. The likelihood estimate features. It may be observed that the α, β and Ψ scores
is computed on a frame level set (1 × 18 vector) equally affect the overall quality computation. Therefore,

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

we compute the sum of pairwise products of likelihood V. R ESULTS AND D ISCUSSION


estimates scores γ α , γ β and γ Ψ as
The performance of the proposed MoDi3D objective algo-
γ = −log(γ α γ β + γ β γ Ψ + γ α γ Ψ ), (17) rithm was evaluated on the following datasets: IRCCYN S3D
where γ represents the overall departure of a distorted video dataset [11], WaterlooIVC Phase I dataset [14], LFOVIA
video’s statistics with respect to the pristine model. dataset [15] and our proposed LFOVIAS3DPh2 S3D video
• Global motion strength: dataset.
We compute the SSIM score between successive frames The IRCCYN dataset [11] contains 10 full HD (1920 ×
of the left and right views to measure the degree of frame 1080) pristine video sequences and the videos are saved in
level motion variation of an S3D video: .avi containers. The video sequences are of durations 16 sec.
or 13 sec. with frame rate 25 fps. The pristine videos were
∆L L L
j = SSIM(wj , wj−1 ), (18)
subjected to H.264 encoder and JP2K compression to create
∆R
j = SSIM(wjR , wj−1
R
), (19) the distorted videos. They used the JM reference software to
where wj and wj−1 are successive frames, and ∆L apply the H.264 compression artifacts on the pristine left and
j and
R
∆j denote the frame level motion variation of the left and right views and varied the quantization parameter (QP = 32,
right views, respectively. Finally, we compute the mean 38, 44) to create quality variations. The JP2K compression
of the ∆L R artifacts were applied on a frame by frame basis on each view
j and ∆j scores from both views to estimate
the global motion strength of an S3D video. and bitrate (2, 8, 16, 32 Mb/s) was used as a quality affecting
Q−1
parameter. These distortions were symmetrically applied on
1 X each view of an S3D video.
∆= (∆L + ∆R
j ), (20)
2 × (Q − 1) j=1 j The WaterlooIVC Phase I dataset [14] is a combination
of natural and synthetic (computer generated or animated)
where ∆ represents the global motion strength of an
S3D videos. The dataset has 4 pristine and 176 test S3D
S3D video. Figure 7 shows scatter plots of frame wise
video sequences. The test video sequences were generated
∆ scores of the ‘Domain Parklands’ pristine and cor-
by applying blur on HEVC compressed videos. The video
responding H.264 compressed S3D videos. It is clear
sequences have a resolution of 1024 × 768. The duration of
that the computed ∆ scores are quality sensitive and
each video varied from 6 sec. to 10 sec. at different frame rates.
this observation motivated us to use this feature in our
We have used only the natural S3D videos in our analysis.
proposed S3D video quality prediction model.
The LFOVIA dataset [15] has H.264 compressed stereo-
We compute the product of ∆ and γ scores to measure
scopic video sequences. The dataset contains 6 pristine and
the joint quality of motion and disparity components of
144 distorted video sequences. The compression artifacts were
each test S3D video.
introduced using ffmpeg with bitrate (100, 200, 350, 1200
MoDi2D = log(γ × ∆). (21) Kbps) as the quality variation parameter. The video sequences
where MoDi2D is the quality estimated using the motion are of resolution 1836 × 1056 pixels with a frame rate of 25
and disparity components. Since the ∆ and γ values fps and a duration of 10 sec. This dataset is a combination of
have significantly different numerical ranges, we apply symmetric and asymmetric stereoscopic video sequences. The
the logarithm to their product. LFOVIAS3DPh2 S3D video dataset details are explained in
Section III.
TABLE IV: LCC and SROCC scores of individual and overall The performance of the proposed MoDi3D algorithm is
pooling of MoDi2D features on proposed LFOVIAS3DPh2 S3D
video dataset.
measured using the LCC, SROCC and Root Mean Square
Error (RMSE). LCC indicates the linear dependence between
γα γβ γΨ ∆ MoDi2D two quantities. SROCC measures the monotonic relationship
LCC 0.3523 0.1912 0.2403 0.2392 0.5785
SROCC 0.3410 0.1370 0.2082 0.2252 0.5408
between two input sets. RMSE measures the magnitude error
between estimated objective scores and subjective DMOS
Table IV show the efficacies of the individual features scores. Higher LCC and SROCC values indicate good agree-
γ α , γ β , γ Ψ and ∆, and the proposed MoDi2D pooling ment between subjective and objective measures, and lower
on the LCC and SROCC scores over the LFOVIAS3DPh2 RMSE signifies more accurate prediction performance. All
S3D video dataset. It is clear that each feature contributes performance results were computed after performing a non-
significantly to quality estimation. linear logistic fit. We followed the standard procedure recom-
4) Overall Quality Computation: The spatial feature S and mended by the Video Quality Experts Group (VQEG) [74] to
the MoDi2D scores increase with the distortion strength of an perform the non-linear regression using a 4-parameter logistic
S3D video and these scores jointly impact the overall quality transform
estimate of an S3D video. So, we compute the product between τ1 − τ2
f (x) = + τ2 , (23)
spatial and MoDi2D scores to estimate the MoDi3D score of 1 + exp( ζ−τ
|τ4 | )
3

a test S3D video as


where ζ denotes the raw objective score, and τ1 , τ2 , τ3 and τ4
MoDi3D = MoDi2D × S. (22)
are the free parameters selected to provide the best fit of the
We discuss the performance of MoDi3D next. predicted scores to the DMOS values.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

TABLE V: 2D & 3D I/VQA performance evaluation (LCC, SROCC and RMSE) on the IRCCYN S3D video dataset [11]
(Bold indicates the best performance numbers and the italic indicates the proposed algorithm performance numbers).
Model Type H.264 JP2K Overall
Algorithm
LCC SROCC RMSE LCC SROCC RMSE LCC SROCC RMSE
SSIM [60] 0.5786 0.5464 0.9223 0.6714 0.5942 0.9215 0.4754 0.2465 1.1898
2D FR IQA MS-SSIM [61] 0.7885 0.6673 0.6955 0.9439 0.9299 0.4327 0.8506 0.8534 0.5512
VIF [62] 0.8950 0.8723 0.4774 0.9504 0.9112 0.3929 0.8914 0.8652 0.6217
BRISQUE [63] 0.7915 0.7637 0.7912 0.8048 0.8999 0.5687 0.7535 0.8145 0.6535
2D NR IQA
NIQE [59] 0.6403 0.6617 0.8686 0.8808 0.7240 0.6206 0.5524 0.4183 1.0326
STMAD [64] 0.7641 0.7354 0.7296 0.8388 0.7236 0.7136 0.6400 0.3495 0.9518
2D FR VQA
VQM [65] 0.8097 0.7715 0.6793 0.8352 0.8021 0.6851 0.7242 0.7020 0.7833
Chen et al. [41] 0.6620 0.5720 0.6915 0.8817 0.8724 0.6182 0.7980 0.7861 0.7464
3D FR IQA
STRIQE [66] 0.7913 0.7167 0.8433 0.9017 0.8175 0.5666 0.7931 0.7734 0.7544
FLOSIM3D [67] 0.9589 0.9478 0.3863 0.9738 0.9548 0.2976 0.9178 0.9111 0.4918
Chen3D [67] 0.7963 0.8035 2.5835 0.9358 0.8884 3.2863 0.8227 0.8201 2.9763
3D FR VQA STRIQE3D [67] 0.6836 0.6263 2.3683 0.8778 0.8513 3.2121 0.7599 0.7525 2.8374
PQM [68] - - - - - - 0.6340 0.6006 0.8784
PHVS-3D [69] - - - - - - 0.5480 0.5146 0.9501
3D-STS [70] - - - - - - 0.6417 0.6214 0.9067
SJND-SVA [71] 0.5834 0.6810 0.6672 0.8062 0.6901 0.5079 0.6503 0.6229 0.8629
3-D-PQI [72] 0.9306 0.9239 - 0.9413 0.9266 - 0.9009 0.8848 -
DeMo3D [54] 0.9161 0.9009 0.4564 0.9505 0.9326 0.4074 0.9272 0.91878 0.4561
Yang et al. [35] - - - - - - 0.8949 0.8552 0.4929
BSVQE [17] 0.9168 0.8857 - 0.8953 0.8383 - 0.9239 0.9086 -
3D NR VQA
MNSVQM [36] 0.8850 0.7714 0.4675 0.9706 0.8982 0.2769 0.8611 0.8394 0.5634
Supervised VQUEMODES [37] 0.9594 0.9439 0.1791 0.9859 0.9666 0.121 0.9697 0.9637 0.2635
- MoDi2D 0.4269 0.3937 1.2428 0.6201 0.5559 1.0301 0.4848 0.4195 1.1433
3D NR VQA MoDi3D 0.6657 0.6935 0.8999 0.8991 0.8540 0.5971 0.6060 0.6233 0.9853
(Unsupervised)

TABLE VI: Performances of objective metrics, including the proposed ‘completely blind’ MoDi2D and MoDi3D models, in
terms of LCC on the proposed LFOVIAS3DPh2 S3D dataset (Bold indicates the best performance numbers and the italic
indicates the proposed algorithm performance numbers).
H.264 H.265 Blur Frame Freeze All
Algorithm
Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All
SSIM [60] 0.857 0.737 0.816 0.863 0.743 0.812 0.709 0.624 0.651 0.784 0.839 0.801 0.803 0.660 0.735
MS-SSIM [61] 0.925 0.838 0.901 0.914 0.771 0.873 0.892 0.749 0.802 0.891 0.913 0.897 0.885 0.716 0.819
VIF [62] 0.908 0.798 0.874 0.889 0.710 0.822 0.896 0.758 0.813 0.828 0.816 0.820 0.879 0.701 0.816
NIQE [59] 0.734 0.457 0.646 0.649 0.530 0.540 0.417 0.332 0.351 0.724 0.573 0.648 0.641 0.501 0.578
VQM [65] 0.921 0.815 0.887 0.929 0.810 0.907 0.918 0.841 0.785 0.870 0.781 0.815 0.868 0.799 0.837
STRIQE [66] 0.902 0.835 0.861 0.894 0.830 0.850 0.806 0.632 0.804 0.752 0.802 0.774 0.746 0.586 0.677
FI-PSNR [73] 0.776 0.719 0.722 0.756 0.689 0.677 0.514 0.443 0.464 0.788 0.765 0.717 0.688 0.648 0.660
VQUEMODES [37] 0.968 0.868 0.886 0.943 0.813 0.866 0.789 0.690 0.706 0.893 0.831 0.827 0.887 0.856 0.878
MoDi2D 0.584 0.443 0.512 0.773 0.651 0.699 0.386 0.363 0.370 0.457 0.636 0.627 0.607 0.532 0.578
MoDi3D 0.829 0.692 0.720 0.820 0.721 0.761 0.478 0.420 0.432 0.846 0.874 0.839 0.740 0.669 0.699

TABLE VII: Performances of objective metrics, including the proposed ‘completely blind’ MoDi2D and MoDi3D models, in
terms of SROCC on the proposed LFOVIAS3DPh2 S3D dataset (Bold indicates the best performance numbers and the italic
indicates the proposed algorithm performance numbers).
H.264 H.265 Blur Frame Freeze All
Algorithm
Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All
SSIM [60] 0.831 0.664 0.795 0.836 0.671 0.798 0.636 0.269 0.480 0.771 0.851 0.807 0.744 0.585 0.682
MS-SSIM [61] 0.925 0.813 0.895 0.886 0.725 0.857 0.849 0.444 0.706 0.892 0.899 0.898 0.864 0.638 0.778
VIF [62] 0.922 0.784 0.875 0.873 0.652 0.810 0.887 0.522 0.781 0.805 0.800 0.807 0.862 0.652 0.784
NIQE [59] 0.771 0.440 0.667 0.561 0.305 0.388 0.413 0.264 0.349 0.640 0.388 0.445 0.559 0.443 0.501
VQM [65] 0.892 0.827 0.896 0.880 0.845 0.801 0.859 0.825 0.815 0.871 0.757 0.821 0.841 0.780 0.803
STRIQE [66] 0.879 0.757 0.851 0.858 0.688 0.836 0.759 0.407 0.663 0.673 0.616 0.632 0.705 0.532 0.652
FI-PSNR [73] 0.753 0.726 0.723 0.726 0.687 0.622 0.433 0.392 0.398 0.786 0.763 0.768 0.655 0.603 0.611
VQUEMODES [37] 0.879 0.787 0.864 0.851 0.696 0.825 0.700 0.592 0.606 0.792 0.767 0.772 0.857 0.835 0.839
MoDi2D 0.556 0.445 0.510 0.698 0.651 0.602 0.336 0.279 0.300 0.375 0.561 0.547 0.592 0.506 0.540
MoDi3D 0.778 0.582 0.687 0.746 0.662 0.671 0.435 0.357 0.396 0.660 0.636 0.627 0.682 0.593 0.661

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

TABLE VIII: Performances of objective metrics, including the proposed ‘completely blind’ MoDi2D and MoDi3D models, in
terms of RMSE on the proposed LFOVIAS3DPh2 S3D dataset (Bold indicates the best performance numbers and the italic
indicates the proposed algorithm performance numbers).
H.264 H.265 Blur Frame Freeze All
Algorithm
Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All Symm Asymm All
SSIM [60] 0.564 0.462 0.528 0.537 0.459 0.521 0.584 0.446 0.546 0.434 0.331 0.393 0.596 0.557 0.596
MS-SSIM [61] 0.416 0.680 0.396 0.430 0.566 0.436 0.373 0.499 0.491 0.317 0.247 0.290 0.464 0.517 0.505
VIF [62] 0.459 0.412 0.444 0.486 0.483 0.510 0.367 0.464 0.444 0.391 0.352 0.375 0.476 0.529 0.508
NIQE [59] 0.771 0.609 0.698 0.809 0.581 0.753 0.688 0.509 0.627 0.482 0.499 0.500 0.768 0.642 0.718
VQM [65] 0.425 0.397 0.422 0.392 0.359 0.376 0.338 0.308 0.411 0.344 0.380 0.381 0.496 0.446 0.480
STRIQE [66] 0.473 0.376 0.464 0.476 0.382 0.471 0.490 0.484 0.533 0.461 0.364 0.416 0.665 0.601 0.647
FI-PSNR [73] 0.527 0.476 0.514 0.549 0.598 0.659 0.512 0.711 0.638 0.508 0.529 0.522 0.551 0.574 0.545
VQUEMODES [37] 0.400 0.302 0.355 0.311 0.344 0.395 0.451 0.383 0.461 0.377 0.277 0.350 0.442 0.379 0.444
MoDi2D 0.911 0.603 0.789 0.674 0.523 0.653 0.803 0.502 0.712 0.549 0.400 0.486 0.848 0.601 0.743
MoDi3D 0.662 0.562 0.672 0.595 0.578 0.642 0.632 0.478 0.566 0.385 0.372 0.393 0.677 0.587 0.657

TABLE IX: 2D & 3D I/VQA performance evaluation on the LFOVIA and WaterlooIVC Phase I S3D video datasets (Bold
indicates the best performance numbers and italic indicates the proposed algorithm performance numbers).
LFOVIA WaterlooIVC
Algorithm
LCC SROCC RMSE LCC SROCC RMSE
SSIM [60] 0.8816 0.8828 6.1104 0.3964 0.2872 20.1010
MS-SSIM [61] 0.8172 0.7888 8.9467 0.4072 0.2969 19.9969
VIF [62] 0.7321 0.6654 9.7885 0.7912 0.6321 13.3905
VQM [65] 0.8651 0.8552 7.0123 0.7582 0.7081 10.8915
VQUEMODES [37] 0.8943 0.8890 5.9124 0.8519 0.8266 7.1526
MoDi2D 0.4406 0.4037 19.1831 0.3900 0.3792 21.0021
MoDi3D 0.6759 0.6552 9.5929 0.4834 0.4265 18.1095

(a) NIQE scores. (b) MoDi2D . (c) MoDi3D .


Fig. 8: Scatter plots of Spatial NIQE scores, MoDi2D and MoDi3D objective scores versus DMOS.

TABLE X: Statistical analysis of algorithm performance. A value of ‘1’ indicates that the row (algorithm) is statistically better than the
column (algorithm) and vice-versa if value of ‘0’. In each entry, the first four symbols correspond to the four distortions, while the last
symbol represents the entire dataset.
Metric SSIM MS-SSIM VIF NIQE VQM STRIQE FI-PSNR VQUEMODES MoDi3D
SSIM ----- 00000 00000 11111 00000 00010 11111 00010 11111
MS-SSIM 11111 ----- 11010 11111 01010 11111 11111 11110 11111
VIF 11111 00100 ----- 11111 01000 11111 11111 10110 11111
NIQE 00000 00000 00000 ----- 00000 00000 00000 00000 00000
VQM 11111 10101 10111 11111 ----- 10111 11111 10111 11111
STRIQE 11101 00000 00000 11111 01000 ----- 11101 00100 11110
FI-PSNR 00000 00000 00000 11111 00000 00010 ----- 00000 10010
VQUEMODES 11101 00001 01001 11111 01000 11011 11111 ----- 11111
MoDi3D 00000 00000 00000 11111 00000 00001 01101 00000 -----

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

Tables V, VI, VII, VIII and IX show the performance attribute this robust performance to the choice of our distortion
evaluation results of the MoDi3D algorithm on the IRC- discriminable features.
CYN, WaterlooIVC PhaseI, LFOVIA and LFOVIAS3DPh2 Due to the blank frames with H.264 compression artifacts
S3D video datasets. Also, we compared the performance of during frame freezes, the joint dependencies between mo-
MoDi3D against popular state-of-the-art 2D IQA/VQA and tion and disparity components varied more when compared
3D IQA/VQA models. SSIM [60], MS-SSIM [61], VIF [62] to the H.264 and H.265 compressions. The MoDi3D model
are 2D FR IQA models, and BRISQUE [63] is a supervised effectively captures these statistical variations and delivers
2D NR IQA model. NIQE [59] is completely blind 2D NR better performance on frame freezes than on other compression
IQA model. As these algorithms are based on IQA models, artifacts. Blur is a spatial distortion that does not significantly
they use only spatial information (and do not include the change the motion information properties of an S3D video.
temporal and disparity components) when used to estimate Therefore, the dependency variation between motion and dis-
S3D quality. These IQA metrics were applied on a frame- parity components is lower compared to compression artifacts.
by-frame basis for each view and the final quality computed So, the proposed model is not able to effectively capture
by averaging the frame scores of both views. STMAD [64] statistical dependencies between motion and disparity com-
and VQM [65] are 2D FR VQA models which utilize spatial ponents and MoDi3D shows slightly diminished performance
and temporal features to estimate the quality. These algorithms as compared to compression based distortions. Further, the
were applied on the individual views and the final score MoDi3D shows acceptable performance on the WaterlooIVC
computed by calculating the mean score of both view scores. Phase I dataset, which is a combination of synthetic and
Chen et al. [41], STRIQE [66] and FI-PSNR [73] are S3D natural S3D videos. In the summary, we have demonstrated
FR IQA models. These methods utilize spatial and disparity that the parameters of the BGGD used to model the joint
features to estimate quality. These metrics were applied on statistics of motion and disparity of natural S3D video are
a frame-by-frame basis on each S3D video and final quality ideally suited for the NR VQA task. An MVG model of the
estimate computed as the mean score of the frame level quality BGGD model parameters of pristine S3D videos was used
scores. FLOSIM3D [67], Chen3D [67], STRIQE3D [67], PQM to approach the NR VQA problem. Finally and importantly,
[68], PHVS-3D [69], 3D-STS [70], SJND-SVA [71] and 3- the proposed algorithm is completely unaware of subjective
D-PQI [72] are popular S3D FR VQA models. Chen3D and opinion, hence is ‘completely blind’, and delivers competitive
STRIQE3D are extensions of Chen et al. [41] and STRIQE performance over a variety of distortions and datasets.
[66] algorithms by including temporal features. Yang et al.
[35], BSVQE [17], MNSVQM [36] and VQUEMODES [37]
are S3D supervised NR VQA models. These algorithms use VI. C ONCLUSION
spatial, motion and disparity information to compute the
quality of an S3D video. From the results, it is clear that our We presented two main contributions in this paper. First,
proposed model demonstrates competitive performance against we performed a comprehensive subjective quality evaluation
state-of-the-art 2D and 3D FR/NR IQA/VQA algorithms on on a symmetrically and asymmetrically distorted full HD S3D
the IRCCYN [11], LFOVIA [15], WaterlooIVC Phase I [14] video dataset. The dataset contains 12 pristine S3D video
and on the proposed LFOVIAS3DPh2 S3D video datasets, sequences and 288 test stimuli. The test video sequences
even though it is ‘completely blind’ method. are a combination of H.264 and H.265 compression, blur
We performed a statistical analysis [6], [75] of the algorithm distortions, and frame freezes. 20 subjects were involved in the
scores to determine whether the SROCC scores are signifi- study and the study was conducted using the ACR-HR method.
cantly different from each other or not. Table X shows the Second, we proposed a completely blind S3D NR VQA algo-
results of this analysis. In the table, the value ‘1’ indicates rithm based on computing the joint statistical dependencies
that the row (algorithm) performed significantly better than between motion and disparity subband coefficients of an S3D
the column (algorithm) and vice-versa if value ‘0’. It is clear video. We estimated the BGGD parameters (α, β), and the
that the proposed method achieved competitive performance coherence score (Ψ) from the eigenvalues of the covariance
against the other 2D and 3D IQA/VQA algorithms. Further, matrix and showed these features are distortion discriminable.
Figure 8 shows scatter plots of spatial NIQE scores, MoDi2D We used an unsupervised 2D NR IQA model (NIQE) to
and MoDi3D scores on the S3D video dataset. These scatter estimate spatial quality. Finally, these features were pooled
plots provide corroborative evidence for the strength of our to predict the overall quality of an S3D video. We showed
model. that the proposed objective algorithm MoDi3D demonstrates
We show the efficacy of the MoDi3D algorithm on sym- competitive performance as compared to popular 2D and 3D
metrically and asymmetrically distorted S3D videos of the FR and supervised NR image and video quality assessment
proposed dataset in Tables VI, VII and VIII. Although NIQE models, even though it is not trained on any distorted S3D
and other NR QA algorithms show competitive performance videos nor on any annotations of them. In the future, we
on symmetrically distorted videos, they fail to replicate their plan to improve the algorithm performance and extend this
performance on asymmetrically distorted S3D videos. The method to virtual reality (VR) and augmented reality (AR)
proposed MoDi3D model (combination of MoDi2D and Spa- quality assessment. We will make the dataset, subjective study
tial NIQE score) shows consistent performance on both the scores and objective method freely available to the research
symmetrically and asymmetrically distorted S3D videos. We community [19].

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

R EFERENCES [22] A. Benoit, P. Le Callet, P. Campisi, and R. Cousseau, “Quality assess-


ment of stereoscopic images,” EURASIP journal on image and video
[1] Y. Liu, L. K. Cormack, and A. C. Bovik, “Statistical modeling of processing, vol. 2008, 2009.
3-D natural scenes with application to bayesian stereopsis,” IEEE [23] C. D. M. Regis, J. V. de Miranda Cardoso, . de Pontes Oliveira, and
Transactions on Image Processing, vol. 20, pp. 2515–2530, Sept 2011. M. S. de Alencar, “Objective estimation of 3D video quality: A disparity-
[2] “eMarketer: Better research. Better business decisions.”, url = based weighting strategy,” in International Symposium on Broadband
https://www.emarketer.com/.” Multimedia Systems and Broadcasting, IEEE, pp. 1–6, June 2013.
[3] R. Tenniswood, L. Safonova, and M. Drake, “3Ds effect on a films box [24] Y. Liu, J. Yang, Q. Meng, Z. Lv, Z. Song, and Z. Gao, “Stereoscopic im-
office and profitability,” 2010. age quality assessment method based on binocular combination saliency
model,” Signal Processing, vol. 125, pp. 237–248, 2016.
[4] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,
[25] L. Ma, X. Wang, Q. Liu, and K. N. Ngan, “Reorganized dct-based
“Study of subjective and objective quality assessment of video,” IEEE
image representation for reduced reference stereoscopic image quality
Transactions on Image Processing, vol. 19, pp. 1427–1441, June 2010.
assessment,” Neurocomputing, vol. 215, pp. 21–31, 2016.
[5] S. Winkler, A. Sharma, and D. McNally, “Perceptual video quality and
[26] M. Yu, K. Zheng, G. Jiang, F. Shao, and Z. Peng, “Binocular percep-
blockiness metrics for multimedia streaming applications,” in Proceed-
tion based reduced-reference stereo video quality assessment method,”
ings of the International Symposium on Wireless Personal Multimedia
Journal of Visual Communication and Image Representation, vol. 38,
Communications, pp. 547–552, 2001.
pp. 246–255, 2016.
[6] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. de Veciana, “Video
[27] C. T. Hewage and M. G. Martini, “Reduced-reference quality assessment
quality assessment on mobile devices: Subjective, behavioral and ob-
for 3D video compression and transmission,” IEEE Transactions on
jective studies,” IEEE Journal of Selected Topics in Signal Processing,
Consumer Electronics, vol. 57, no. 3, pp. 1185–1193, 2011.
vol. 6, pp. 652–671, Oct 2012.
[28] Z. P. Sazzad, S. Yamanaka, and Y. Horita, “Spatio-temporal segmen-
[7] M. Lambooij, W. IJsselsteijn, D. G. Bouwhuis, and I. Heynderickx, tation based continuous no-reference stereoscopic video quality predic-
“Evaluation of stereoscopic images: Beyond 2D quality,” IEEE Trans- tion,” in International Workshop on Quality of Multimedia Experience,
actions on Broadcasting, vol. 57, pp. 432–444, June 2011. IEEE, pp. 106–111, 2010.
[8] Y. Fang, J. Wang, J. Li, R. Ppion, and P. L. Callet, “An eye tracking [29] K. Ha and M. Kim, “A perceptual quality assessment metric using
database for stereoscopic video,” in Sixth International Workshop on temporal complexity and disparity information for stereoscopic video,”
Quality of Multimedia Experience, IEEE, pp. 51–52, Sept 2014. in IEEE International Conference on Image Processing, pp. 2525–2528,
[9] V. D. Silva, H. K. Arachchi, E. Ekmekcioglu, and A. Kondoz, “Toward 2011.
an impairment metric for stereoscopic video: A full-reference video [30] S. A. Mahmood and R. F. Ghani, “Objective quality assessment of 3D
quality metric to assess compressed stereoscopic video,” IEEE Transac- stereoscopic video based on motion vectors and depth map features,”
tions on Image Processing, vol. 22, pp. 3392–3404, Sept 2013. in Computer Science and Electronic Engineering Conference, IEEE,
[10] C. T. E. R. Hewage, M. G. Martini, M. Brandas, and D. V. S. pp. 179–183, 2015.
X. D. Silva, “A study on the perceived quality of 3D video subject to [31] M. Solh and G. AlRegib, “A no-reference quality measure for dibr-based
packet losses,” in IEEE International Conference on Communications 3D videos,” in International Conference on Multimedia and Expo, IEEE,
Workshops, pp. 662–666, June 2013. pp. 1–6, 2011.
[11] M. Urvoy, M. Barkowsky, R. Cousseau, Y. Koudota, V. Ricorde, [32] M. M. Hasan, J. F. Arnold, and M. R. Frater, “No-reference quality
P. Le Callet, J. Gutierrez, and N. Garcia, “Nama3ds1-cospad1: Subjec- assessment of 3D videos based on human visual perception,” in Inter-
tive video quality assessment database on coding conditions introducing national Conference on 3D Imaging, IEEE, pp. 1–6, 2014.
freely available high quality 3D stereoscopic sequences,” in Fourth [33] A. R. Silva, M. E. V. Melgar, and M. C. Farias, “A no-reference
International Workshop on Quality of Multimedia Experience, IEEE, stereoscopic quality metric,” in Proc. SPIE, vol. 9393, 2015.
pp. 109–114, July 2012. [34] W. Zhang, C. Qu, L. Ma, J. Guan, and R. Huang, “Learning structure
[12] P. Aflaki, M. M. Hannuksela, J. Häkkinen, P. Lindroos, and M. Gabbouj, of stereoscopic image for no-reference quality assessment with convolu-
“Subjective study on compressed asymmetric stereoscopic video,” in tional neural network,” Pattern Recognition, vol. 59, pp. 176–187, 2016.
17th International Conference on Image Processing, IEEE, pp. 4021– [35] J. Yang, H. Wang, W. Lu, B. Li, A. Badiid, and Q. Meng, “A no-
4024, 2010. reference optical flow-based quality evaluator for stereoscopic videos in
[13] M. J. Chen, D. K. Kwon, and A. C. Bovik, “Study of subject agreement curvelet domain,” Information Sciences, vol. 414, pp. 133–146, 2017.
on stereoscopic video quality,” in Southwest Symposium on Image [36] G. Jiang, S. Liu, M. Yu, F. Shao, Z. Peng, and F. Chen, “No reference
Analysis and Interpretation, IEEE, pp. 173–176, April 2012. stereo video quality assessment based on motion feature in tensor
[14] J. Wang, S. Wang, and Z. Wang, “Asymmetrically compressed stereo- decomposition domain,” Journal of Visual Communication and Image
scopic 3D videos: Quality assessment and rate-distortion performance Representation, 2017.
evaluation,” IEEE Transactions on Image Processing, vol. 26, pp. 1330– [37] B. Appina, A. Jalli, S. S. Battula, and S. S. Channappayya, “No-
1343, March 2017. reference stereoscopic video quality assessment algorithm using joint
[15] B. Appina, K. Manasa, and S. S. Channappayya, “Subjective and motion and depth statistics,” in 25th International Conference on Image
objective study of the relation between 3D and 2D views based on depth Processing, IEEE, pp. 2800–2804, 2018.
and bitrate,” Electronic Imaging, vol. 2017, no. 5, pp. 145–150, 2017. [38] E. Cheng, P. Burton, J. Burton, A. Joseski, and I. Burnett, “Rmit3dv:
[16] K. Wang, M. Barkowsky, R. Cousseau, K. Brunnstrom, R. Olsson, Pre-announcement of a creative commons uncompressed HD 3D video
P. Le Callet, and M. Sjostrom, “Subjective evaluation of HDTV stereo- database,” in Fourth International Workshop on Quality of Multimedia
scopic videos in IPTV scenarios using absolute category rating,” in Proc. Experience, pp. 212–217, July 2012.
SPIE, vol. 7863, 2011. [39] M. T. Pourazad, Z. Mai, P. Nasiopoulos, K. Plataniotis, and R. K.
[17] Z. Chen, W. Zhou, and W. Li, “Blind stereoscopic video quality assess- Ward, “Effect of brightness on the quality of visual 3D perception,”
ment: From depth perception to overall experience,” IEEE Transactions in International Conference on Image Processing, IEEE, pp. 989–992,
on Image Processing, vol. 27, no. 2, pp. 721–734, 2018. Sept 2011.
[18] E. Dumić, S. Grgić, K. Šakić, P. M. R. Rocha, and L. A. da Silva Cruz, [40] M. H. Pinson, M. Barkowsky, and P. Le Callet, “Selecting scenes for
“3D video subjective quality: a new database and grade comparison 2D and 3D subjective video quality tests,” EURASIP Journal on Image
study,” Multimedia tools and applications, vol. 76, no. 2, pp. 2087– and Video Processing, vol. 2013, no. 1, p. 1, 2013.
2109, 2017. [41] M.-J. Chen, C.-C. Su, D.-K. Kwon, L. K. Cormack, and A. C. Bovik,
[19] “Lab for Video and Image Analysis (LFOVIA) Downloads, “Full-reference quality assessment of stereopairs accounting for rivalry,”
http://www.iith.ac.in/∼lfovia/downloads.html.” Signal Processing: Image Communication, vol. 28, no. 9, pp. 1143–
[20] S. L. P. Yasakethu, C. T. E. R. Hewage, W. A. C. Fernando, and A. M. 1155, 2013.
Kondoz, “Quality analysis for 3D video using 2D video quality models,” [42] “ffmpeg: https://www.ffmpeg.org/..”
IEEE Transactions on Consumer Electronics, vol. 54, pp. 1969–1976, [43] “IEEE-SA S3D image database; http://grouper.ieee.org/groups/3dhf/.”
November 2008. [44] “Constant Rate Factor Guide, ”http://slhck.info/articles/crf.”
[21] C. T. E. R. Hewage, S. T. Worrall, S. Dogan, S. Villette, and A. M. [45] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. de Veciana, “Video
Kondoz, “Quality evaluation of color plus depth map-based stereoscopic quality assessment on mobile devices: Subjective, behavioral and ob-
video,” IEEE Journal of Selected Topics in Signal Processing, vol. 3, jective studies,” IEEE Journal of Selected Topics in Signal Processing,
pp. 304–318, April 2009. vol. 6, pp. 652–671, Oct 2012.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2019.2914950, IEEE
Transactions on Image Processing

[46] D. Ghadiyaram, J. Pan, and A. C. Bovik, “Learning a continuous-time [71] F. Qi, D. Zhao, X. Fan, and T. Jiang, “Stereoscopic video quality
streaming video QoE model,” IEEE Transactions on Image Processing, assessment based on visual attention and just-noticeable difference
vol. 27, pp. 2257–2271, Jan 2018. models,” Signal, Image and Video Processing, vol. 10, no. 4, pp. 737–
[47] I. Union, “Subjective methods for the assessment of stereoscopic 3DTV 744, 2016.
systems,” Recommendation ITU-R BT, vol. 2021, 2015. [72] W. Hong and L. Yu, “A spatio-temporal perceptual quality index
[48] IEEE, “IEEE Standard for Quality of Experience (QoE) and Visual- measuring compression distortions of three-dimensional video,” IEEE
Comfort Assessments of Three-Dimensional (3D) Contents Based on Signal Processing Letters, vol. 25, no. 2, pp. 214–218, 2018.
Psychophysical Studies,” IEEE Std 3333.1.1, 2015. [73] Y.-H. Lin and J.-L. Wu, “Quality assessment of stereoscopic 3d image
[49] J. H. Maunsell and D. C. Van Essen, “Functional properties of neurons compression by binocular integration behaviors,” IEEE Transactions on
in middle temporal visual area of the macaque monkey. i. selectivity for Image Processing, vol. 23, no. 4, pp. 1527–1542, 2014.
stimulus direction, speed, and orientation,” Journal of Neurophysiology, [74] “(2000) final report from the video quality experts group on the
vol. 49, no. 5, pp. 1127–1147, 1983. validation of objective quality metrics for video quality assessment,
[50] G. C. DeAngelis and W. T. Newsome, “Organization of disparity- http://www.its.bldrdoc.gov/vqeg/projects/frtv phasei/.”
selective neurons in macaque area mt,” The Journal of Neuroscience, [75] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation
vol. 19, no. 4, pp. 1398–1415, 1999. of recent full reference image quality assessment algorithms,” IEEE
[51] J.-P. Roy, H. Komatsu, and R. H. Wurtz, “Disparity sensitivity of neurons Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, 2006.
in monkey extrastriate area mst,” The Journal of Neuroscience, vol. 12,
no. 7, pp. 2478–2492, 1992.
[52] B. Potetz and T. S. Lee, “Statistical correlations between two-
dimensional images and three-dimensional structures in natural scenes,”
JOSA A, vol. 20, no. 7, pp. 1292–1303, 2003.
[53] B. Appina, S. Khan, and S. S. Channappayya, “No-reference stereo-
scopic image quality assessment using natural scene statistics,” Signal
Processing: Image Communication, vol. 43, pp. 1–14, 2016.
[54] B. Appina and S. Channappayya, “Full-reference 3-D video quality as-
sessment using scene component statistical dependencies,” IEEE Signal
Processing Letters, vol. 25, pp. 823–827, June 2018.
[55] F. Pascal, L. Bombrun, J.-Y. Tourneret, and Y. Berthoumieu, “Parameter
estimation for multivariate generalized gaussian distributions,” IEEE
Transactions on Signal Processing, vol. 61, no. 23, pp. 5960–5971, 2013.
[56] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural
video quality,” IEEE Transactions on Image Processing, vol. 23, no. 3,
pp. 1352–1365, 2014.
[57] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid: a flexible
architecture for multi-scale derivative computation,” in International
Conference on Image Processing, IEEE, vol. 3, pp. 444–447 vol.3, Oct
1995.
[58] M. Jakubowski and G. Pastuszak, “Block-based motion estimation
algorithmsa survey,” Opto-Electronics Review, Springer, vol. 21, no. 1,
pp. 86–102, 2013.
[59] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a completely
blind image quality analyzer,” IEEE Signal Processing Letters, vol. 20,
no. 3, pp. 209–212, 2013.
[60] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: from error visibility to structural similarity,” IEEE
Transactions on Image Processing, vol. 13, pp. 600–612, April 2004.
[61] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
similarity for image quality assessment,” in Asilomar Conference on
Signals, Systems Computers, IEEE, vol. 2, pp. 1398–1402, Nov 2003.
[62] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”
IEEE Transactions on Image Processing, vol. 15, pp. 430–444, Feb
2006.
[63] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image
quality assessment in the spatial domain,” IEEE Transactions on Image
Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[64] P. V. Vu, C. T. Vu, and D. M. Chandler, “A spatiotemporal most-
apparent-distortion model for video quality assessment,” in IEEE In-
ternational Conference on Image Processing, pp. 2505–2508, 2011.
[65] “VQM software: http://www.its.bldrdoc.gov/n3/video/vqmsoftware.htm..”
[66] S. Khan Md, B. Appina, and S. Channappayya, “Full-reference stereo
image quality assessment using natural stereo scene statistics,” IEEE
Signal Processing Letters, vol. 22, pp. 1985–1989, Nov 2015.
[67] B. Appina, M. K., and S. S. Channappayya, “A full reference stereo-
scopic video quality assessment metric,” in International Conference on
Acoustics, Speech and Signal Processing, IEEE, pp. 2012–2016, March
2017.
[68] P. Joveluro, H. Malekmohamadi, W. A. C. Fernando, and A. M. Kondoz,
“Perceptual video quality metric for 3D video quality assessment,” in
3DTV-Conference: The True Vision - Capture, Transmission and Display
of 3D Video, pp. 1–4, June 2010.
[69] L. Jin, A. Gotchev, A. Boev, and K. Egiazarian, “Validation of a new
full reference metric for quality assessment of mobile 3DTV content,”
in 19th European Signal Processing Conference, pp. 1894–1898, Aug
2011.
[70] J. Han, T. Jiang, and S. Ma, “Stereoscopic video quality assessment
model based on spatial-temporal structural information,” in Visual Com-
munications and Image Processing, IEEE, pp. 1–6, Nov 2012.

1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like