You are on page 1of 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
IEEE TRANSACTIONS ON BROADCASTING 1

Toward Better Statistical Validation of Machine


Learning-Based Multimedia Quality Estimators
Manish Narwaria

Abstract—Objective assessment of multimedia quality using increased quantitatively but the quality of such interaction
machine learning (ML) has been gaining popularity especially in has also evolved. In particular, todays end users are more
the context of both traditional (e.g., terrestrial and satellite broad- demanding in terms of their multimedia experience, and per-
cast) and advance (such as over-the-top media services, IPTV)
broadcast services. Being data-driven, these methods obviously ceptual quality is one of the intrinsic factors that affects such
rely on training to find the optimal model parameters. Therefore, interaction. Thus, assessment of perceptual quality is an impor-
to statistically compare and validate such ML-based quality pre- tant aspect in todays multimedia communication systems [1].
dictors, the current approach randomly splits the given data into To that end, subjective assessment performed by human sub-
training and test sets and obtains a performance measure (for jects is still considered the most accurate methodology and
instance mean squared error, correlation coefficient etc.). The
process is repeated a large number of times and parametric tests remains the most reliable and accurate method, given appro-
(e.g., t test) are then employed to statistically compare mean (or priate laboratory conditions and a sufficiently large subject
median) prediction accuracies. However, the current approach panel. However, subjective assessment may not be feasible
suffers from a few limitations (related to the qualitative aspects in certain situations (e.g., real-time multimedia transmission),
of training and testing data, the use of improper sample size for and an objective approach is more suitable in such scenarios.
statistical testing, possibly dependent sample observations, and a
lack of focus on quantifying the learning ability of the ML-based Objective assessment of multimedia quality involves the use
objective quality predictor) which have not been addressed in lit- of computational models which are expected to predict quality
erature. Therefore, the main goal of this paper is to shed light on scores in a repeatable fashion and such that the objective pre-
the said limitations both from practical and theoretical perspec- dictions align well with the subjective opinion of perceptual
tives wherever applicable, and in the process propose an alternate signal quality. It is however important to stress that objec-
approach to overcome some of them. As a major advantage,
the proposed guidelines not only help in a theoretically more tive approaches may not exactly mimic the subjective opinion
grounded statistical comparison but also provide useful insights in all situations, and are not meant to entirely replace sub-
into how well the ML-based objective quality predictors exploit jective assessment. Instead they can provide approximate and
data structure for learning. We demonstrate the added value of relative estimates of perceptual quality, within the context of
the proposed set of guidelines on standard datasets by comparing the applications such as DTT broadcast, IPTV, multimedia
the performance of few existing ML-based quality estimators. A
software implementation of the presented guidelines is also made compression etc.
publicly available to enable researchers and developers to test and While there has been substantial research effort towards
compare different models in a repeatable manner. developing objective quality estimators for multimedia sig-
Index Terms—Multimedia quality, machine learning, statistical nals (including single or multi modal signals such as image,
analysis. video, speech, audiovisual, graphics etc.), issues related to how
closely the human opinion can be mimicked and those related
to computational efficiency (these have obvious consequences
I. I NTRODUCTION on practical deployment) exist. In that context, a data driven
ULTIMEDIA signals have become a part of our daily approach has also been viewed as a plausible solution. Even
M lives, thanks to the availability of low cost devices cou-
pled with the rapid growth of traditional and advanced multi-
though interest in such methods has existed for several years,
there have been renewed and concerted efforts to exploit such
media broadcast services. In particular, fixed/mobile advanced data driven methods for the said purpose [2]–[12].
media delivery fueled by the emergence of IPTV, cloud ser- The use of ML for objective quality estimation is partic-
vices and over-the-top (OTT) media services has enabled the ularly suitable for broadcast applications where the quality
consumers to enjoy more immersive viewing experience of of the received or transmitted content needs to be assessed
3DTV, HDR, 4K etc., from the comfort of their premises. objectively based on limited signal information. Not surpris-
As a result, our interaction with multimedia has not only ingly, ML has been exploited in the past for the said purpose.
Gastaldo et al. [2] presented one of the first comprehensive
Manuscript received December 30, 2017; revised April 12, 2018; accepted methods for estimating quality of MPEG video streams, and
April 19, 2018.
The author is with the Department of ICT, Dhirubhai Ambani Institute is based on circular back propagation neural networks. A no
of Information and Communication Technology, Gandhinagar 382007, India reference method was presented in [3], which is based on
(e-mail: manish_narwaria@daiict.ac.in). mapping frame-level features into a spatial quality score fol-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. lowed by temporal pooling. The method developed in [4] is
Digital Object Identifier 10.1109/TBC.2018.2832441 based on features extracted from the analysis of discrete cosine
0018-9316 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
2 IEEE TRANSACTIONS ON BROADCASTING

transform (DCT) coefficients of each decoded frame in a video for correlation coefficients, using F-test on the residuals,
sequence, and subsequent quality prediction using a neural the use of ANOVA and t test [15], [17], determining the
network. Another ML based video quality estimator was pre- significance of the difference between the outlier ratios or
sented in [5] where symbolic regression-based frame work RMSEs [13]. Likewise, recommendation ITU-T J.149 [14]
was trained on a set of features extracted from the received uses classification errors of quality metric.
video bit stream. The ML based quality estimator proposed As far as the statistical comparison and validation of ML
in [6] works on the similar principle of analyzing several fea- based quality predictors is concerned, the current approach is
tures such as distinguishing the type of codec used (MPEG based on repeated and random splits of data (i.e., predictions
or H.264/AVC), DCT coefficients, estimation of the level of from ML based methods and the corresponding subjective
quantization used in the I-frames, etc. The next step is to apply scores for the given multimedia content) into training and test
support vector regression to predict video quality. The objec- sets [2]–[12]. In each iteration, a performance measure (like
tive quality estimator proposed in [7] was based on polynomial mean squared error, correlation coefficient etc.) is obtained.
regression model, where the independent variables (or fea- Then, the means (or in some cases median) of such repeated
tures) were based on spatial and temporal quantities derived performance measure for each ML based estimator are sta-
from video spatiotemporal complexity, bitrate, and packet loss tistically compared via pairwise t. However, because of the
measurements. Mocanu et al. [11] employed deep learning requirement of training the current approach needs to be exam-
(deep belief networks) and bit stream specific features to pre- ined more closely in terms of the factors that can affect the
dict quality objectively in a video transmission network. Deep validation process. These include qualitative aspects of train-
learning has also been employed for quality measurement in ing and testing data, determining the appropriate sample size
live video streaming [12]. Moreover, promising results from when splitting the given data into training and test sets, the
related disciplines such as computer vision and the availabil- issue of possibly dependent sample observations and the anal-
ity of required hardware (e.g., GPU-accelerated computing) ysis pertinent to the learning ability of the method (note that
have opened up possibilities of developing efficient ML based these issues are not relevant in case of statistical comparison of
implementations of quality predictors. non-ML based predictors because there is no training involved
For the case of objective multimedia quality assessment, the and hence a question of train-test split typically does not arise).
use of ML methods is a two-stage process: feature extraction A survey of literature (e.g., refer to [2]–[12] for some existing
(representing the given multimedia data via a set of per- efforts in ML based quality estimation for video or [13]–[16]
ceptually meaningful and possibly lower dimensional feature for standardized recommendations) reveals that these impor-
values) and feature pooling (combining or fusing of features tant issues have not been thoroughly examined (either from
into a quality score). The second stage typically uses regres- theoretical or practical view points) although few works such
sors, and hence the objective quality predictions (scores) are as [4], [9], and [11] have considered the practical implications
continuous (such scores can of course be further binarized via of the first issue regarding the qualitative aspects of training
thresholding or can be used for pairwise stimuli comparisons). and testing data (also refer to some related works on statisti-
More recently, deep networks (such as the convolutional neural cal comparison of classifiers [18] or analysis of their learning
networks, deep belief networks etc.) have also gained popu- ability [19]).
larity [11], [12] where feature extraction process is implicitly Therefore, the main aim of the paper is to shed light on
handled by the ML method (instead of using hand-crafted these factors, and in the process present a set of new guide-
features). lines to overcome the drawbacks of the current approach.
Irrespective of whether objective quality estimators use ML The proposed guidelines offer the advantage of focusing on
or not, statistical testing plays an important role in their valida- practical use-case scenario and quantifying the learning abil-
tion and benchmarking. Such validation studies are obviously ity of the ML based quality estimator. Therefore, the use of
crucial before objective predictors can be deployed in practice. these guidelines helps to make more informed conclusions and
Note that statistical tests (both parametric and nonparametric) recommendations about metric performance. In contrast, the
are extensively used not only to validate objective methods existing approach tends to treat ML based methods as black
against subjective data but also to statistically compare two or boxes and focuses primarily on global, binary decisions about
more objective quality predictors, in order to find the better metric performance. A software implementing the presented
metric for the given application or to rank them. guidelines is also made publicly available,1 in order to achieve
Regarding the statistical comparison and validation of non- the goal of reproducible research.
ML based estimators (i.e., which do not require any training) The remainder of the paper is organized as follows.
the procedure has been by and large standardized (e.g., the Section II discusses the limitations and additional consider-
ITU recommendations P.1401 [13], J.149 [14], [15] or VQEG ations in statistical validation of ML based quality predictors.
recommendations [16]) and uses a performance measure such Following this, we present in Section III a theoretical anal-
as such correlation coefficient (Pearson, Spearman etc.), the ysis concerning dependent (correlated) sample observations
root mean squared error (RMSE), outlier ratio etc., to quan- and how that affects the sampling distribution of the t test
tify the agreement between subjective opinion and objective statistic. Next, Section IV proposes a framework (set of guide-
predictions (or comparing those metrics from several objective lines) for more accurate statistical comparison and validation
methods). Then, statistical inferences are drawn (or differ-
ent methods compared) by using confidence intervals (CIs) 1 https://sites.google.com/site/narwariam/home/research
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 3

Fig. 1. The current approach for statistical comparison and validation of M ML based quality estimators. The drawbacks associated with this approach are
discussed in Section II.

of ML based quality predictors to ameliorate some of the This provides a symmetric M × M decision matrix D in which
discussed drawbacks. Section V presents a permutation test the entry (i, j) set to 0 if the null hypothesis is accepted (i.e.,
based method for systematic analysis of ML based quality the methods in row i and column j are deemed to be statisti-
predictors. Experimental results and analysis are reported in cally similar according to the t test) or 1 otherwise. Therefore,
Section VI. We provide concluding thoughts in Section VII. D is simply written as
⎡ ⎤
− ··· 1
⎢ .. ⎥
D = ⎣ ... ..
II. L IMITATIONS OF THE C URRENT A PPROACH FOR
S TATISTICAL VALIDATION OF ML BASED . .⎦ (2)
Q UALITY P REDICTORS 1 ··· −
The main steps involved in the current approach for statisti- As the same method is not compared to itself via t test,
cal comparison and validation of ML based quality estimators the diagonal entries in D are not considered. Inferences on
are shown in Figure 1. Note that most existing papers [2]–[12] relative method performance are then drawn from the deci-
follow this approach. sion matrix D. However, such an approach of comparing
First, the given subjective data (comprising of multimedia ML based quality estimators suffers from a few limitations
content and the corresponding subjective scores) is randomly that possibly render the conclusions and inferences question-
split into train and test sets (assume that the number of samples able. Specifically, following drawbacks can be identified in the
(stimuli) in train and test sets are respectively Ntrain and Ntest ) current approach.
Niter times. It is typically ensured that the content in training
A. Violation of Assumption of Independent Observations
and test sets do not overlap. In each iteration, the ML based
methods are trained on the training set and assessed for their The application of statistical tests such as the t test, ANOVA
prediction performance on the test set. This results in a set etc. assumes that the observations in each test sample are inde-
of Niter performance measures (e.g., correlation coefficient, pendent and identically distributed random variables (iid) [20].
mean squared error, percentage of correctly classified content However, notice that while the values in each column of
etc.) for each ML based quality estimator. Then, a descriptive matrix A are generated from a random test sample, the pop-
statistic (mean, median etc.) of the performance measure in ulation from which the said test sample is drawn is severely
each iteration is taken as the overall performance measure. limited both in terms of the variety in content and the size. The
Suppose our goal is to compare the performance of M ML reason is that typical subjective tests include limited source
based quality predictors. Denote the performance measure for content for obvious practical reasons. Note that even with
(j) newer paradigms such as crowd sourcing, the population size
such comparison as ρi where i is the iteration index i =
1, . . . , Niter and j = 1, . . . , M. Then, the matrix to be analyzed may not grow exponentially that will guarantee independence
will of the form of the random test samples on which the ML based predictors
⎡ ⎤ are tested. Moreover, crowd sourcing can introduce additional
(1) (M)
ρ1 · · · ρ1 factors into the subjective rating process (as it offers no con-
⎢ . .. .. ⎥
A=⎢ ⎣ .
. . ⎥
. ⎦ (1) trol in terms of display, ambient light, viewing distance etc.).
(1) (M) A limited population size will therefore be almost always a
ρNiter · · · ρNiter
challenge for statistical testing. Particularly, it can possibly
In order to carry out a statistical comparison, t test is result in overlapping content, i.e., test sets across iterations
applied to compare the mean values (or the Wilcoxon test for can share one or more test conditions. For instance, images
median) of each column of matrix A in a pairwise manner. corresponding to a reference (source) image can appear in the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
4 IEEE TRANSACTIONS ON BROADCASTING

(j)
testing set more than once. Consequently, the ρi values in However, it does not give an explicit insight into how large the
the j-th column of A are not entirely independent. This can treatment effect is, and whether the observed effect is practi-
possibly result in correlated samples observations. That is, the cally meaningful or not. Obviously, such information is crucial
performance measure in one iteration can be relatively easily to make a more balanced and practically reasonable conclu-
predicted if same/similar images will appear in the test set in sion about the performance of ML based quality predictors. It
another iteration. This may render the statistical comparison is therefore a recommended practice [21]–[24] that p values
and subsequent inferences questionable especially if the said be supplemented by effect size so that the results of statistical
correlation is high. tests are more fairly interpreted. This important point is, unfor-
In general and to our best knowledge, the said issue has tunately, either completely ignored or not emphasized enough
not received sufficient research attention especially from the in the current approach of comparing ML based methods.
view point of objective multimedia signal quality estimation. In fact, this is also a limitation even in case of testing of
We will therefore examine it more closely from a theo- non-ML based predictors as most existing recommendations
retical perspective in Section III, and propose a solution (such as P.1401 [13], J.149 [14] or [15]) do not explicitly
in Section IV. recommend the use of effect size.

B. The Issue of Arbitrary Sample Size D. Ignoring Data Uncertainties in Performance Evaluation
The reader will note that the dimensionality of each column In the current approach, the most widely used perfor-
of A is Niter . It implies that mean (or median) is computed mance measure (ρ) is the correlation coefficient (e.g., Pearson,
(j)
over Niter observations (i.e., ρi values). Consequently, the Spearman or Kendall correlation coefficients) between the pre-
sample size for the t test depends on the number of iterations dicted and target (subjective) quality scores [2]–[16]. In many
Niter . Since sample size is used to compute the test statistic, cases, the traditional root mean squared error RMSE is also
it implies that the decision of the statistical test (i.e., accep- employed. These measures, however, are less effective because
tance or rejection of null hypothesis) can be influenced by they do not explicitly consider the uncertainties in the subjec-
the arbitrary choice of Niter . In particular, as the value of the tive rating process. Consequently, they penalize any deviation
test statistic varies inversely with the sample size (this can be from the subjective quality score (MOS). This is not desir-
easily seen from the definition of the test statistics in t test, able in the context of multimedia quality estimation because
ANOVA etc. [20]), the chances of rejecting the null hypothesis each MOS represents the mean of individual opinion scores
can be increased by simply choosing a large Niter . In the lim- from a finite set of observers, i.e., sample of observers, and
iting case, as Niter → ∞, all the entries of the decision matrix is therefore a random variable. Thus, a more accurate perfor-
D will tend to 1. Practically, it implies that an infinitesimal mance measure ρ should consider the said variability in the
(j)
increase (decrease) in mean of ρi values will lead to the individual subjective scores, and at the same time emphasize
conclusion that there is statistically significant performance differences in quality levels that are practically perceivable.
difference between the corresponding pair of ML based qual- We define such a measure in Section IV-D.
ity predictors. However, such conclusion is not reliable as it
may be due to a large Niter , and not necessarily reflect sta- E. Lack of Focus on Learning Ability of the Model
tistically meaningful differences in prediction performances
A notable limitation of the current approach stems from its
(we provide an example later in Section VI to support our
focus on prediction accuracies (e.g., maximizing correlation
arguments).
coefficient or minimizing mean squared error). Consequently,
the aspect of learning ability (i.e., how well can the model
C. Lack of Practical Considerations learn structure from the training data and use it to predict
The third drawback of the current approach concerns lack of quality on new test samples) is not explicitly considered in
practical considerations. Specifically, two points can be raised method comparison. We argue that the latter is also important
in this regards. because it can provide useful information about robustness of
First, the random train-test splitting (used in the current the model, and hence lead towards a more holistic assess-
approach) does not provide adequate control towards incor- ment of ML based quality predictors (instead of treating them
porating practically useful and independent conditions in the as black boxes). Our assertion is motivated by the idea that
test set. As a result, the comparison of ML based quality pre- data pertaining to multimedia quality can be assumed to be
dictors might be based on non-exhaustive set of conditions. structured because there is a relation (which is unknown and
While increasing the number of randomizations by using a possibly non-linear) between the subjective quality scores and
large value Niter may mitigate the said problem to some extent, the multimedia content (possibly represented by a set of fea-
it can influence the statistical decisions (as discussed in the tures). Then goal of ML based quality prediction is to exploit
previous drawback). this structure to find an optimal set of weights. Hence, it is
Second, we note that the current approach simply empha- relevant to ask if a given model (in the context of this paper,
sizes theoretical (statistical) differences. That is, it provides the term model refers to the collective entity consisting of
the information if a treatment effect (in our case, the term features and an ML algorithm) is able to learn the underly-
treatment effect implies the magnitude of the difference in per- ing patterns of perceptual quality, given a set of labeled data.
formance measures of ML based estimators) is present or not. In other words, it is important to systematically quantify the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 5

weakness (if it exists) of the model in terms of its learning X = [X1 , . . . , Xn ] (where n is the sample size) where the
ability, and analyze its response to unstructured data. sample observations are assumed to be iid random variables,
the sample mean X ∼ N(μ, σn ) where μ and σ 2 respectively
2
It is also important to differentiate learning ability from pre-
diction accuracy. As mentioned, the former focuses on the denote the mean and variance of the corresponding population.
ability to learn real data patterns (assuming that such pattern Note that this result holds only when the sample observations
or structure is present in the data). On the other hand, the are iid. However, the case of dependent2 sample observations,
latter refers to how close the objective predictions are to the one needs take into consideration the covariances between the
given test data (subjective scores), and therefore does not pro- random variables. This follows from the standard rule that the
vide explicit information about the ability of the model to variance of sum of n correlated (dependent) random variables
distinguish real patterns of perceptual quality from noisy (or is the sum of their covariances [20], i.e.,
random) ones.
n
1 1
n n
 
Var X = Var Xi = 2 Cov Xi , Xj
F. Summary of Limitations of Current Approach n n
i=1 i=1 j=1
⎛ ⎞
As discussed, the current approach suffers from few theoret-
1 ⎝
n

ical and practical drawbacks in the context of comparing ML = 2 Var(Xi ) + 2 Cov Xi , Xj ⎠ (3)
based objective quality predictors. Therefore, the inferences n
i=1 i=j
and conclusions drawn from its use are possibly questionable.
where we have used the fact that Cov(Xi , Xj ) = Var(Xi ) when
More importantly, it does not provide meaningful information
i = j. To further simplify (3), we proceed as follows. First,
about learning ability of the method under investigation and
under exchangeability [26] , we can write Cov(Xi , Xj ) = ri,j σ 2
lacks practical considerations which are obviously critical for
where ri,j is the pairwise correlation coefficient. Moreover,
the purposes of method deployment in real applications. In
assume that the variables are pair-wise equicorrelated, i.e., we
light of these limitations, there is need for a better approach
let r = ri,j . The reader will also note that there are n2 = n(n−1)
2
towards more meaningful comparison and validation of ML
covariances in the second term of eq. (3) and that the variance
based quality predictors. Such approach will be useful in gen-
of each Xi is σ 2 . Hence, we can write (3) as
eral but even more relevant in scenarios such as broadcast
services where ML based quality estimation is expected to be  σ2 n−1 2
Var X = + rσ (4)
a more common and plausible approach to quality monitoring. n n
Thus, we will present in Sections IV and V an alternate set of Thus, we find that variance of sample mean increases with
guidelines to ameliorate some of the mentioned issues in the r and the value corresponding to independent sample observa-
current approach. Before doing that, we analyze the theoret- tions can be obtained by simply setting r = 0. The increased
ical implications of using dependent (i.e., correlated) sample variance will result in a change of shape of the distribution of
observations for statistical testing. t-test statistic and it may no longer follow the theoretical t dis-
tribution. Hence, in practice the aim is to keep the value of r
III. E FFECT OF D EPENDENT S AMPLE O BSERVATIONS : A to a minimum. While typical subjective studies achieve this by
T HEORETICAL P ERSPECTIVE employing independent human subjects (observers) or allow-
Recall from Section II-A that the sample observations ing sufficient breaks between test sessions, the aforementioned
denoted by the values in each column of matrix A are likely random training-testing split does not allow sufficient control
to be dependent. This is because the random train-test split and can lead to larger values of r. Hence, the issue of inde-
procedure used in the current approach can lead to sharing pendent sample observations is more applicable in case of ML
of test conditions across iterations. Therefore, the subsequent based quality predictors, and should be considered in order to
application of statistical tests such as the t-test using this data obtain more accurate results from the subsequent use of sta-
can affect the sampling distribution of the test statistic (note tistical tests. The exact impact on the sampling distribution of
that most parametric statistical tests assume that the sam- the test statistic in t-test is analyzed next.
ple observations are independent and identically distributed,
i.e., iid random variables). This in turn can lead to incorrect B. Effect on Sampling Distribution of t-Statistic
decisions since the test statistic may not exactly follow the As the current approach uses pairwise t-test to compare the
expected theoretical distribution. In the following, we analyze mean values of each column of matrix A, it will be of interest
the implications of dependent sample observations and specif- to examine the impact of the increased variance of sampling
ically discuss the case of t test as it is widely used in the distribution of mean as quantified by eq. (4).
current approach to validating ML based quality predictors. We begin by considering two populations p1 and p2 with
means μ1 and μ2 and same variance σ 2 (homogeneity of vari-
A. Increased Variance of Sampling Distribution of Mean ance). Let the corresponding samples be denoted by x1 =
The genesis of the problem lies in the fact that the t-test (or [x11 , . . . , x1n1 ] and x2 = [x21 , . . . , x2n2 ] where n1 and n2 are
for that matter any other parametric test like ANOVA) used the sample sizes. In our case, the sample observations within
to compare the mean values of each column of matrix A in 2 Quantification and analysis of data dependence is non-trivial in general.
a pairwise manner is based on the fundamental central limit In our case, we make the simplification assumption of using correlation
theorem (CLT) [20], [25]. The CLT states that for a sample coefficient as a measure of dependence.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
6 IEEE TRANSACTIONS ON BROADCASTING

each sample denote the performance measure (e.g., correlation (we present further experimental evidence in Section VI-A).
coefficient) from two distinct ML based quality estimators. In other words, when the sample observations are correlated,
Therefore, it is reasonable to assume that the sample observa- the computed t statistic does not follow the theoretical t dis-
tions across the two samples are independent (uncorrelated). tribution, and hence using it for statistical inference can lead
However, as discussed, the observations within each sample to incorrect decisions. Therefore, using a t test to compare the
are now dependent (i.e., correlated), unlike the usual case mean values of each column of matrix A as done in the current
where they are assumed to be independent. Finally, the sample approach may be less accurate because r1 and r2 are wrongly
observations are assumed to be identically distributed. This is set to zero despite the possible dependencies between sample
not unreasonable since P(x1i < c) = P(x2i < c), ∀i and some observations.
fixed constant c, i.e., the probability that observations in sam- In order to correct this, two solutions are possible: (1) a
ples x1 and x2 take a value less than (or equal) to a fixed theoretical solution wherein one needs to find the modified the-
constant is the same. oretical t distribution which is a function of not just the degrees
Let x1 , x2 and s21 , s22 denote the sample means and vari- of freedom but also depends on r1 and r2 , (2) a more practical
ances, respectively. Further, let r1 and r2 denote the pairwise approach where we attempt to alter the experimental design so
correlation coefficient, i.e., dependency between observations that r1 and r2 are qualitatively minimized. The first solution
within each sample. In addition, in the considered application, is more challenging because it requires rigorous mathemat-
the sample sizes will be equal, i.e., n1 = n2 = n because ML ical treatment and will typically lead to application-specific
based quality estimators are tested on same and equal number rather than general solution to the said problem (for instance
of test conditions. Then the goal of the analysis is to infer if refer to [27] for the case of financial data analysis or [28]
μ1 = μ2 (the null hypothesis) or not. To define the t-statistic, which presented asymptotic solutions in case of spatial data
we use the result from the CLT, i.e., where long range dependence is present). The second solution
⎛  ⎞ is more feasible since it involves a more careful experimental
σ 2 n − 1 design and/or data collection (for e.g., refer to [29] where nec-
x1 ∼ N ⎝μ1 , + r1 σ 2 ⎠ (5)
n n essary adjustments are made for more accurate medical data
analysis) which is typically achievable when comparing ML
Notice that we have used eq. (4) to compute the standard based quality estimators. Therefore, in this paper, we propose
deviation of the sampling distribution. Similarly we can obtain, to eliminate or reduce dependency among sample observations
⎛  ⎞
by proper data partitioning as detailed in the next section.
σ 2 n − 1
x2 ∼ N ⎝μ2 , + r2 σ 2 ⎠ (6)
n n IV. P RACTICAL AND S TATISTICAL S IGNIFICANCE IN
C OMPARING ML BASED Q UALITY E STIMATORS :
Then, the difference between the samples means will also be
T HE P ROPOSED A PPROACH
normally distributed, i.e.,
⎛  ⎞ In the previous section, we highlighted few drawbacks of
σ 2 the current approach. With the goal of remedying them, we
x1 − x2 ∼ N ⎝μ1 − μ2 , (2 + (n − 1)(r1 + r2 ))⎠ (7) present a modified set of guidelines in Figure 2. The major
n
steps are discussed below.
By standardization, we have
x1 − x2 − (μ1 − μ2 ) A. Meaningful Partitioning of Data
 ∼ N(0, 1) (8)
σ2 As discussed, the random train-test split used in current
n (2 + (n − 1)(r1 + r2 ))
approach ensures that training and testing sets do not over-
Further, assume that the null hypothesis H0 is true, i.e., lap (or share content). This, is of course desirable. However,
μ1 = μ2 and use the pooled variance s2p as an unbiased this process does not necessarily constrain the different test
estimator of the population variance σ 2 . Thus, under H0 , the sets (generated in each iteration) to be completely indepen-
denominator in eq. (8) can be modified accordingly and the dent of each other. As a result, the independence of elements
t-statistic defined as (observations) in each column of A is questionable, and vio-
x1 − x2 lates one of the basic requirements for further statistical testing
t=  (9) (j)
of such data. Hence, in the first step, we restrict ρi values
sp 1n (2 + (n − 1)(r1 + r2 )) to be obtained on disjoint test sets (with each set represent-
where ing a particular condition to be examined in the context of
perceptual quality). For instance, in image or video quality
s21 + s22 prediction task, the disjoint test sets can be obtained based
s2p = (10)
2 on the number of reference content used in the dataset. In
We note that the statistic t defined in eq. (9) will follow the this manner, each test will consist of all images from one
theoretical t distribution with n1 + n2 − 2 = 2(n − 1) degrees source/reference content. Likewise, in broadcast applications,
of freedom only when r1 = r2 = 0, i.e., when observations the type of distortions induced by different codecs (at varying
within each sample are independent. Otherwise, it will devi- bit rates and/or packet loss rate etc.) or the type of content
ate from the theoretical t distribution as r1 and r2 increase (e.g., based on color and temporal information) can be used
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 7

Fig. 2. Schematic flow diagram of the steps involved in the proposed approach for comparing methods using single dataset (i.e., splitting the same dataset
for generating training and test sets).

to obtain physically meaningful set of conditions for which AUC values etc.) is practically meaningful or not. Thus, in
quality prediction is desired. In other words, we advocate a k the proposed set of guidelines, we seek to use effect size to
fold cross validation (CV) wherein each fold comprises of data quantify practical significance. It can be computed by com-
that is typically from independent (disjoint) sets that can be paring each column pair (i, j) of Acondition . A commonly used
conveniently defined according to the conditions to be tested measure of effect size is Cohen’s d which is computed as
and the eventual goal of analysis (e.g., testing a certain com-
ρ(i) − ρ(j)
bination of conditions in joint audiovisual compression). By d(i, j) =  (12)
using the k fold CV approach meaningfully, we can modify (sρ (i))2 +(sρ (j))2
matrix A as 2
⎡ ⎤
(1) (M)
ρ1 ··· ρ1 where ρ(i), ρ(j) and sρ (i), sρ (j) denote the means and standard
⎢ .. .. .. ⎥
Acondition = ⎢
⎣ . . .

⎦ (11) deviations of i-th and j-th column, respectively.
(1) (M)
The effect sizes d(i, j) for each comparison are stored in the
ρNcondition · · · ρNcondition matrix E. Effect size value higher than a threshold th can be
deemed practically significant (typically, th ≥ 0.8 is consid-
Note that the dimensionality of each column of Acondition
ered large effect). It is important to emphasize that we do not
will not be arbitrary but equal to the number of meaningful or
use effect size as an explicit statistical test. Rather it is used
desired independent conditions/categories in the dataset, unlike
to assess how large the observed effect is, and thus provides
that of A. Therefore, unlike the current approach, we seek to
a context about the possible practical differences between the
emphasize more the qualitative aspects of data partitioning
methods being compared. In order to establish if the treat-
(train-test split), and reduce dependency arbitrary number of
ment effect is actually present or not, we will use either use
randomizations.
ANOVA or nonparametric confidence intervals (as explained
in Section IV-C).
B. Practical Considerations in Method Comparison Lastly, since we advocate data partitioning according to
Practical significance is not always the same as statisti- physically meaningful conditions (instead of random splits),
cal significance [21]–[24]. As mentioned in Section II-C, this recall that the dimensionality of each column of Acondition will
aspect has been largely ignored in the current approach. Recall be Ncondition (the number of meaningfully disjoint conditions in
that statistical tests (like the t test in the current approach) only the dataset, and will depend on the dataset as well as context
(j)
provide information if a treatment effect is present or not but of analysis). Thus, each element ρi represents the perfor-
do not explicitly answer the question whether the amplitude of mance on a practically more meaningful chunk of data which
the treatment effect (recall that in our case this can be differ- in turn provides useful initial information about the possible
ence in mean correlation coefficients or difference in median strength (or weakness) of ML based quality predictors for that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE(TRANSACTIONS
IEEE Transactions on Broadcasting ON BROADCASTING
Volume: 64, Issue: 2, June 2018 )

(.)
particular condition. Therefore, a comparison of ρi values ANOVA is close to the chosen significance level α, in order
across rows of matrix Acondition can reveal meaningful local to have better interpretation of the results [21]–[24].
information about method performance (instead of making As a final note on the use of ANOVA, we point out that
global observations, as is the case with the current approach). the often cited assumption of normality required in ANOVA
Such information is also obviously useful from view point of (and other parametric tests such as the t-test or confidence
method development and calibration. interval for mean) is not on the data but only on the sampling
distribution of mean (which is guaranteed by the central limit
C. Testing the Presence of Treatment Effect
theorem) [25]. Moreover, these tests are accurate even in case
Treatment effect, in our context, is the relative overall (for of violation of homogeneity of variance provided that the sam-
instance the average change) change in performance measure ple (group) size is the same [25]. This is true in the considered
(j)
ρi as M different ML based predictors are tested. We say that application because ML based predictors will be obviously
the treatment effect is present if there is a systematic change in tested on the same and equal number of tests conditions.
(j)
ρi ) over a set of test conditions. Otherwise, we conclude that Thus, in the proposed guidelines, we strongly recommend the
the treatment effect is absent, i.e., the observed differences in use of ANOVA as an omnibus test. However, in occasional
(j) (j)
ρi for different predictors are merely attributed to chance. cases such as extremely skewed ρi values, where mean can-
To that end, a statistical test can be employed and a decision not be used, we recommend the use of nonparametric CIs,
taken based on the chosen significance level α [20]. which are described in the next subsection (Section IV-C2).
Recall that in the current approach, t test is applied to We however draw caution over the use of nonparametric tests
compare the mean values (or the Wilcoxon test for median) (including the ones based on ranks such as Mann-Whitney
of each column of matrix A in a pairwise manner [2]–[12]. U test or the Kruskal-Wallis’ test) as these generally have
This means there are multiple statistical comparisons which lower statistical power [33] and higher sensitivity to homo-
may suffer from drawbacks related to inflated type I error geneity of variance condition [34] as compared to parametric
probability. While p-value or significance level adjustments ones.
have been proposed in literature to control the family-wise 2) Construction of Nonparametric Confidence Intervals:
error error [18], these adjustment procedures are debatable due Nonparametric CIs are popular and have been used in many
to practical reasons (for instance refer to [30]–[32]). Perhaps applications [35]–[38] and are obtained empirically via boot-
the more important concern due the use multiple t-tests is the srapping [35]–[37]. In our case, we construct them for the
(j)
fact that each comparison simply ignores information from chosen summary statistic for ρi values. We begin by treating
(j)
other groups/samples. Hence, the comparison is more local- ρi values in matrix Acondition as independent sample observa-
ized and this is not in line with main goal of analysis which tions. In other words, each column of Acondition can be assumed
is to jointly compare a set of M ML based quality estimators. to be a random sample drawn from the corresponding pop-
Therefore, in the proposed guidelines, we advocate the use of ulation. Let the true population parameter of interest be θ .
one-way ANOVA [20] on the data matrix Acondition to test the Then, the goal is to make inferences about θ by using θ̂ (the
null hypothesis that means of populations represented by the estimated parameter value from sample). However, practically
(j)
samples (groups) in columns of Acondition are equal or not. since ρi is computed on a limited number of stimuli, we can
1) ANOVA Based Statistical Significance Testing: If obtain an interval estimate of θ̂ (instead of a point estimate)
ANOVA rejects the null hypothesis, then it implies that a by constructing its CI via resampling [35]–[37]. Such interval
treatment effect is detected. It means that at least two out of estimate provides more descriptive information about θ . Note
the M ML based predictors exhibit systematically large dif- that we will obtain the said CI for each column of Acondition ,
(j)
ferences in terms of the mean ρi values. In order to find and therefore drop the index j in the sequel for convenience
such pairs, one can use posthoc tests [20]. However, most in notation. The procedure to obtain nonparametric CIs is as
posthoc tests mainly rely on adjusted significance (or p-values) follows:  
to compensate for the inflated type I error rate. This, however, • Obtain θ̂ from the given data ρ1 , ρ2 , . . . , ρNcondition .
can have undesirable consequences including inflation of type • Obtain L repeated bootstrapped samples
 ∗ ∗ 
II error [30]–[32]. Moreover, such posthoc tests (or for that ρ1 , ρ2 , . . . , ρN∗ condition , and the ordered vector
matter ANOVA) do not reveal information about the magni- ∗  
θ̂ = θ̂1∗ , θ̂2∗ , . . . , θ̂L∗ such that θ̂1∗ < θ̂2∗ · · · < θ̂L∗ .
tude of the observed treatment effect. Thus, in the proposed • Using pivots to construct the desired CI, we com-
guidelines we use the information in matrix E to identify such pute the lower and upper bound for the CI as
statistically different pairs (i.e., we conclude that there is sta-  ∗


tistical and practical difference between methods i and j if 2θ̂ − θ̂ (1−α /2) 2θ̂ − θ̂ (α /2) where α is the desired

E(i, j) > th). Such approach allows to meaningfully consider significance level, and θ̂ (h)
is the hth percentile
practical differences and avoids multiple comparison adjusted value [35]–[37].
post hoc procedures. On the other hand, if the null hypothesis As indicated in Figure 2, the methods for which the non-
is not rejected by ANOVA, we conclude that there is no treat- parametric CIs do not overlap and E(i, j) > th can be deemed
(j)
ment effect and the observed differences in mean values ρi to be meaningfully different (practically and statistically).
are attributed to random sampling errors (i.e., method pairs However, if the CIs overlap we conclude that the methods
are statistically similar). In this case, effect size matrix E can are statistically similar (E(i, j) values can be computed in this
still be computed and reported especially if the p value for case if the overlap is minimal).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 9

D. Performance Evaluation by Considering Data n1 and n2 are the sample sizes (i.e., the number of
Uncertainty and Effect Size human participants who rated the multimedia qual-
As mentioned in Section II-D, the use of correlation coef- ity). As before, if d(i, j) > th, then the corresponding
ficient or RMSE as the performance measure ρ does not take stimuli pair can be deemed to exhibit practically mean-
into account the inherent subjective uncertainty (variabilities). ingful differences in terms of perceptual quality. As
Therefore, in the proposed guidelines, we recommend the use mentioned, this is the major difference from previous
of AUC (area under curve) as ρ. The definition of AUC is works [40]–[42]. As a result, each stimuli pair (i, j) will
motivated by the fact that a classification error is made when be assigned either a label 1 if d(i, j) > th and zero
the subjective score and the objective score (in our case, from otherwise. This forms the ground truth label vector.
an ML based predictor) lead to different conclusions on a pair 2) Next, we define the difference in predicted scores from
of data points [14]. The idea was further extended to include the ML based quality predictor (MLQP) for stimuli pair
Receiver Operating Characteristics (ROC) [39] based compar- (i, j) as MLQP (i, j) = MLQP(i) − MLQP(j). Then,
ison [40]–[42], and states that a better performance measure for each stimuli pair (i, j) a label of 1 is assigned if
should be able to distinguish (classify) different quality lev- MLQP (i, j) > throc and 0 otherwise. Note that throc is
els by considering the dispersion (uncertainty) in the opinion different from th because the latter is used to fix a thresh-
scores. This can be treated as a binary classification prob- old based on which we conclude whether the stimuli
lem and analyzed based on ROC analysis [40]–[42]. The Area pair (i, j) has practically different quality level. Thus,
Under Curve (AUC) is then to evaluate discrimination abili- the value for th is chosen once and fixed for the experi-
ties [39]. However, in most of the previous works [40]–[42], ment, and is essentially employed to obtain ground truth
statistical significance has been used as the criterion to distin- label vector as explained in the previous point. By con-
guish between quality levels of a pair of stimuli. That is, if a t trast, throc is allowed to vary so that we obtain a 2D plot
test between the individual scores of two stimuli results in the of false positive rate against the true positive rate [39].
rejection of null hypothesis, then the corresponding pairs are 3) Then, the AUC value is obtained by computing the area
said to be different. Otherwise, they are deemed as same. The under the curve in the said 2D plot. It will be higher
AUC values from the ROC analysis performed only on statisti- if the ML based estimator is able to correctly recognize
cally different pairs are then used to compare objective quality the stimulus of higher quality in the pair (only for pairs
predictors. While the described AUC based performance mea- which have practically different quality levels).
sure takes into account subjective uncertainties, it suffers from
two limitations.
V. A SSESSING L EARNING A BILITY OF ML
First, the use of statistical significance (as done in [40]–[42])
BASED Q UALITY P REDICTOR :
to classify pairs as same or different is problematic because
A P ERMUTATION T EST
it may not always be the same as practical significance. As (j)
noted in Sections II-C and IV-B, practically insignificant While performance measures ρi can be used to quan-
differences may sometimes be statistically significant (and tify and compare prediction accuracies, these do not always
vice-versa) [21]–[24]). Thus, existing works [40]–[42] tend to provide explicit information about the response of an ML
rely exclusively on p values which provides the information based quality predictor to unstructured data. As discussed in
if an effect is present or not. However, it does not answer Section II-E, this is important because ML based methods
the question whether the treatment effect is large enough to are typically treated as black boxes. Hence, the ability of a
be practically relevant or not. Second, the use of multiple model to distinguish real data patterns from random ones can
comparisons (t tests in this case) is in many cases accompa- potentially provide insights into the learning process [19].
nied by adjustments to significance levels (or the p-values) in Ojala and Garriga [19] presented two such tests in the con-
order to control the family-wise error error. Such adjustments text of classification problem. The first test is designed to
are, however, debatable due to practical reasons (for instance assess whether the classifier has found a real class structure in
refer to [30]–[32]). the data (note that this test has already been used traditionally
Therefore, we propose to rely less on a purely p-value based in computational biology [43] and attribute selection in deci-
statistical decision, and rather focus on practically meaningful sion trees [44]). The second test studies whether the classifier
differences via the use of effect size. It also helps to avoid is exploiting the dependency between the features in classifica-
multiple comparisons as we do not explicitly make a statistical tion. In our context, we are more interested in the second test
decision using the effect size. We proceed as follows. primarily because the goal is to assess if the user-defined fea-
1) Compute the effect size between stimuli pair (i, j) based tures share meaningful dependencies to capture the complex
on Cohen’s d, mapping between content and its quality. Moreover, it is also
of our interest to study if the chosen of ML algorithm (regres-
MOSi − MOSj sor) is able to exploit the said feature dependency (if it exists)
d(i, j) =  (13)
s2i (n1 −1)+s2j (n2 −1) to learn a more accurate and generalizable mapping function f .
(n1 −1)+(n2 −1) In the following, we describe a permutation test which
is based on the second test in [19] but with two impor-
where MOSi , MOSj and s2i , s2j denote the sample means tant modifications to test ML based quality predictors. First,
(mean opinion score) and variances respectively, and unlike [19] which dealt with classification problem, we deal
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
10 IEEE TRANSACTIONS ON BROADCASTING

with regression since in most cases ML based predictors are In multimedia quality assessment, the elements of X typically
designed and trained on continuous valued opinion scores. represent the value of an attribute (feature) that is related to
This is achieved via data dependent binning process (as perceptual quality. Further, a continuous value yi ∈ R repre-
explained in Section V-B). Second, we define a modified senting the MOS (average of ratings from a some specified
criterion for error metric so that only practically (or statisti- number of human subjects) is associated with each data point
cally) meaningful differences are penalized (and not simply the Xi . Then the labeled data set will be D = {(Xi , yi )}ni=1 .
changes in point-based estimates as is the case with correlation In order to obtain randomized versions of D, feature values
coefficient or RMSE). contained in X need to be permuted. However, such permuta-
tion is restricted to feature values within each class label [19].
A. Motivation Since the target values in the considered case are continu-
ous, we first employ a strategy to discretize them, and in the
As already mentioned, testing the response of an ML based
process bin them into B bins. In order to preserve the distri-
quality predictor to unstructured data has been largely ignored
bution, we obtained the boundary points of the histogram of
in [2]–[12] on ML based multimedia quality assessment. We
the continuous values. Next, thresholding was applied so that
argue that such test is useful as it provides insights into the
the continuous values between the specific boundary points
learning ability of a given model, and helps to assess the com-
are all assigned the same label. For instance, values smaller
petence of the model from the view point of learning (training).
than the first boundary point were all assigned to class label 1,
To develop such a test, we can borrow the idea of a permu-
the values between first and second boundary points assigned
tation test [19], [20], and assess systematically the effect of
to class label 2 and so on. The main advantage of the said
feature changes on the prediction performance. We assume
binning process is that it preserves the original data distri-
that the given ML based quality predictor was trained on
bution by considering data dependent thresholds (instead of
a structured data of perceptual quality, and that it has been
using pre-defined threshold values). It is also worth high-
able to exploit the underlying structure (i.e., inter dependency
lighting that binning of continuous MOS values into discrete
between feature values and their relation to MOS) for learning
classes (bins) is practically meaningful since the class labels
the mapping function f . Hence, when we test this predictor
can assume a similar meaning as those in standard subjec-
on untrained but structured data D, the corresponding error
tive rating methodologies. For instance, the absolute category
e(f , D) is expected to be lower than the error e(f , D
) on
rating ACR which is a popular and standard rating method
unstructured (randomized) data D
. This is because the inter-
employs discrete labels: 1 (worse), 2 (poor), 3 (fair), 4 (good)
dependency between features has been disrupted in D
due to
and 5 (excellent). Thus, the binning of continuous MOS offers
random feature permutation. Accordingly, we can define the
a physically valid interpretation of the resultant class labels.
null hypothesis as H0 : the given ML based quality predictor
As a result of binning, yi → c where c = [c1 , c2 , . . . cB ] rep-
has not learnt the dependency between feature values for pre-
resents the discrete class label vector. Note that such binning
dicting the quality score. The empirical p value in this case
is used only for feature permutations, and subsequent error
can be simply defined as the ratio of number of times the error
metric computations will be based on the continuous yi values.
on random data is less than that on structured data to the total
number of randomizations. C. Error Metric for Permutation Test
A high p value points out to three possibilities [19]: (a)
The goal of ML based quality predictor is to predict the
there are no dependencies between the features that are used
quality score (continuous) of new data points (i.e., new test
in the ML based predictor; (b) there are some dependencies
signal) by training a regressor from D. The learned function
between the features in the data but they do not help to
f maps the feature values into an objective quality score, i.e.,
obtain a more accurate mapping function f ; or (c) there
f : X → R.
are useful dependencies between the features in the data
To compute the final error for the permutation test, we
that help to find a better f but the chosen ML algorithm
begin with the observation that any performance measure
(regressor) is not able to exploit them. In our context, if an
ρ (including correlation coefficient, MSE, area under the
ML based quality predictors obtains a high p value in the
curve AUC [14], etc.) is always computed on a finite sam-
said permutation test, then it implies that either the feature set
ple. Specifically, in each iteration, the test set has Ntest stimuli
is sub-optimal (possibility (a) or (b)) or the chosen regressor
for which quality is predicted and ρ computed against the
is not competent enough, i.e., possibility (c). In either case,
corresponding subjective ground truth. Thus, ρ is a random
a high p value reveals potential deficiency in the given ML
variable and will be associated with a CI. The said CI can
based quality predictor. Remedial action might include using
be computed via analytic methods. However, such methods
additional context specific features or modifying the existing
depend on few assumptions. For instance, some require a fixed
features. In addition, a different regressor that is better at
distribution of positive (i.e., class label 1) and negative (i.e.,
exploiting feature interdependency may be employed.
class label 0) scores (recall that in our context, we assign a
label 1 if d(i, j) > th and 0 otherwise) [45], [46], while oth-
B. Data Dependent Binning of Continuous MOS ers assume a fixed classifier error rate [47]. Such assumptions
In order to tackle the problem of regression, we proceed as may not be satisfied in the considered application. Moreover,
follows. Let X be an n × m data matrix (the i-th row and j-th CLT based approximations are applicable only in case of sum
column vectors can be denoted as Xi and Xj , respectively). or mean iid samples (this is also true for computing CI around
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 11

correlation coefficient after Fisher’s transformation [20]), and its simplest form can be expressed as
not to other performance measures such as the AUC or RMSE∗
values. Therefore, we propose the use of nonparametric CIs xt = μ + r(xt−1 − μ) + t (15)
of ρ. The nonparametric CI on data D can be denoted as
where t ∼ N(0, σ2 ) is zero mean error term, r is the corre-
Iρ (D) = [Lρ (D) Uρ (D)] where the subscript ρ denotes the
lation coefficient and μ is the population mean. In order to
chosen performance measure (or error metric). An error (or
obtain the sampling distribution of t, we generated two popu-
failure) is said to have occurred if Uρ (D
) ≥ Lρ (D), i.e., if
lations (having the same mean and variance) using the above
the confidence intervals on structured data D and unstructured
Gaussian AR(1) process (we set μ1 = μ2 = 0.5, σ 2 = 1).
(randomized) data D
overlap or the lower bound of the former
Then, each population was repeatedly sampled to obtain x1 and
is less than (or equal to) the upper bound of the latter. For
x2 (sample size n was set to 1000 which is equal to the typical
reasons outlined in Sections II-D and IV-D, we recommend
number of iterations Niter used in the current approach) and the
the use of AUC (defined in Section IV-D) as the performance
statistic t computed in each iteration. Note that the said pro-
measure (error metric) for the permutation test.
cedure assumes the null hypothesis H0 to be true and ensures
homogeneity of variance. In order to compare and see the
D. Permutation Test Description effect of dependent observations, we obtained the said empir-
To apply the permutation test, we follow similar notation ical distribution of t in three cases, each time taking different
as [19], and proceed as follows. values of r1 and r2 .
• For the original data D = {(Xi , yi )}ni=1 , bin the continuous The resulting sampling distributions for the three cases are
MOS, i.e., yi → c to obtain D = {(Xi , cj )}. shown in Figure 3. The first plot corresponds to r1 = r2 = 0,
• Let X(c) be the submatrix of X in class label c, i.e., i.e., independent sample observations and this is the most typ-
X(c) = {Xi |yi = c}, of size lc × m. ical use case of the t-test. As can be noted the empirical
• Let π1 , . . . , πk (k = 1, . . . , m) be k independent permu- sampling distribution of the t statistic follows reasonably well
tations of lc elements. the theoretical t distribution with the corresponding degree of
• Let X(c)
be a randomized version of X(c) where each freedom. In contrast, when r1 and r2 are not zero, one can
πj is applied independently to the column of X(c)j . Thus, clearly see that the empirical sampling distribution does not
X(c)
= [π1 ( X(c)1 ), . . . , πk (X(c)k )]. follow the theoretical curve. Particularly, we notice that as r1
• Let X
= {X(c)
|c = c1 , c2 , . . . cB }. Then, one random- and r2 increase, the said deviation from the theoretical curve
ized version D
= {(X
i , yi )}ni=1 can be obtained. increases. This follows and supports the theoretical analysis
• The empirical p value for the permutation test (denoted as made in Section III-B.
p1 to distinguish it from the one obtained from ANOVA) In summary, when the sample observations are correlated,
can be computed as the computed t statistic does not follow the theoretical t dis-

 
D ∈ D : Uρ D
≥ Lρ (D) + 1 tribution, and hence using it for statistical inference can lead
p1 = (14) to incorrect decisions. This is one of the major drawbacks
Niter .Ncondition + 1
with the current approach. The reader will notice that the pro-
where D denotes the set of Niter randomized versions D
posed guidelines take a step towards reducing or eliminating
of the original data D. the said dependencies and hence allow more accurate statistical
analysis (using ANOVA or nonparametric CIs).
VI. E XPERIMENTAL R ESULTS AND D ISCUSSION
In this section, we present experimental results to analyze B. Test Dataset and Methods Compared
and compare few existing ML based methods by using both For the experiments, we chose the specific domain of
the current (the steps are shown in Figure 1) and the proposed image quality assessment primarily due to easy availabil-
set of guidelines (the steps involved are shown in Figure 2 and ity of datasets (along with individual subjective scores) and
the details have been presented in Sections IV and V). Before implementations of few ML based objective quality predictors.
that, we present experimental evidence to support the theoret- We use the CSIQ [48] database as a testbed for the
ical analysis made in Section III and examine the effect of experiments. There are no particular reasons for choosing this
dependent (correlated) samples observations on the sampling dataset apart from the fact that it provides subjective quality
distribution of t. scores (in the form of difference MOS, DMOS) and standard
deviations of raw scores. The images in the dataset were
A. Correlated Sample Observations: Effect on Empirical generated from 30 reference (original) images which were
Sampling Distribution of t-Statistic distorted by 6 different distortion types at 4 to 5 distortion
To examine the impact of correlated sample observations, levels. This resulted in 866 distorted images. Similarly, we
we generated the sampling distribution of the test statistic t chose 6 existing and representative ML based methods for
defined in eq. (9). As discussed in Section III-B, t will follow comparison keeping in mind the diversity of features used.
the theoretical t distribution with n1 + n2 − 2 = 2(n − 1) These include the methods which use features based on statis-
degrees of freedom only when r1 = r2 = 0, i.e., when tics of natural image [49], gradient magnitude and Laplace
observations within each sample are independent. To test this, transform [8], singular value decomposition [9] (SVD),
we first employed a Gaussian AR(1) process in order to gen- gradient-weighted histogram of local binary pattern [50],
erate the population with correlated elements. The process in natural scene statistics of contrast-distorted images [51] and
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )IEEE
12 Transactions on Broadcasting ( Volume: 64, Issue: 2, ON
IEEE TRANSACTIONS June 2018 )
BROADCASTING

Fig. 3. Sampling distribution of t statistic (computed via eq. (9)) for different r1 and r2 values. In each plot, the continuous curve indicates the theoretical
t-distribution with 2(n − 1) = 1998 degrees of freedom.

(j)
phase of Fourier transform [52]. We respectively denote these the test set in different iterations. In effect, the ρi values
methods as BRISQUE, GMBIQA, SVDIQA, GLBPIQA, on which pairwise t test is applied are not necessarily
CDIQA and FTSIQ. independent. Then, in light of the analysis in Section III-B
and the experimental results in Figure 3, the reliability of
C. Application of the Current Approach and Analysis the t test result is questionable because the computed t
of the Drawbacks statistic may not follow the theoretical t distribution.
• While mean AUC values for each method allows a global
First, we follow the current approach of comparing the 6
ML based methods. To that end, we employ 80-20 train-test comparison (for instance AUC value for FTSIQ is clearly
split, i.e., 80% data for training and remaining 20% for better than that of CDIQA), it does not reveal method
testing. Since there are 866 images in our testbed, this leads deficiencies for specific conditions. For instance, ques-
to Ntrain = 693 and Ntest = 173, and we ensure that the tions such as which ML based predictor performs the
training and testing sets do not share content (i.e., no overlap). best for a particular distortion type or for a class of source
(j) images, remain unanswered.
We choose AUC as the performance measure ρi where i is
• The effect of Niter values is evident by comparing D1000
the iteration index i = 1, . . . , Niter and j = 1, . . . , 6. Hence,
the matrix A will be Niter × 6 (columns 1 to 6 respectively and D30 . In particular, notice that all entires of D1000 are
correspond to GLBPIQA, SVDIQA, GMBIQA, CDIQA, 1 implying that all the methods are statistically differ-
FTSIQ and BRISQUE). The mean AUC values over Niter ent from each other. However, the conclusions from D30
iterations for the 6 methods were: GLBPIQA = 0.8201, obviously do not agree with those from D1000 . Thus, sta-
SVDIQA = 0.9403, GMBIQA = 0.9143, CDIQA = 0.5439, tistical conclusions (inferences) may simply depend on
FTSIQ = 0.9770 and BRISQUE = 0.8308. Next, statistical the arbitrary choice of Niter .
• There is no explicit information about the learning ability
comparison is carried out by applying t test pairwise on
columns of A to obtain the symmetric decision matrix D. We of any of the methods.
obtained the said decision matrix in two cases: Niter = 1000
and Niter = 30, and respectively denoted as D1000 and D30 . D. Test Results Using Proposed Guidelines
⎡ ⎤ As explained in Section IV-A, the first step in the pro-
− 1 1 1 1 1
⎢1 − 1 1 1 1⎥ posed guidelines is meaningful data partitioning. To that end,
⎢ ⎥
⎢1 1 − 1 1 1 ⎥ one reasonable strategy is to partition according to reference

D1000 = ⎢ ⎥ (16)
1 1 1 − 1 1 ⎥ images. Hence, we divide the data such that each test set con-
⎢ ⎥ sists of distorted images from one reference image while the
⎣1 1 1 1 − 1⎦
1 1 1 1 1 − distorted versions of the remaining reference images formed
⎡ ⎤ that training set. This will lead to a total of 30 test sets (one
− 1 1 1 1 0 for each reference image). The said partitioning helps to assess
⎢1 − 0 1 1 1⎥
⎢ ⎥ the performance according to the type of source (reference)
⎢1 0 − 1 1 1⎥ (j)

D30 = ⎢ ⎥ (17) content. The AUC is chosen as the performance measure ρi .
⎢1 1 1 − 1 1⎥ ⎥ Therefore, the matrix Acondition in this case will be 30 × 6.
⎣1 1 1 1 − 1⎦ The entries of matrix Acondition are shown in Figure 4, for
0 1 1 1 1 − sake of better visualization. As noted in Section II-C, this
In line with the analysis in Section II, we can summarize provides meaningful initial information about method per-
the drawbacks of the current approach with regards to the formances. We have indicated in Figure 4 the boundary at
experimental results on CSIQ dataset. AUC = 0.7. As can be seen, all the methods do not have
• As random train-test split is used (of course ensuring that consistent AUC values across conditions, and fall below 0.6 in
the training and testing sets do not share content), it does certain cases. Because AUC value equal to 0.5 corresponds to
not guarantee that same content will not appear again in random guess, these highlight the limitations or inability of the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 13

(j)
Fig. 5. Histograms of ρi values for the methods GLBPIQA, SVDIQA,
Fig. 4. Initial performance comparison based on AUC values according to GMBIQA (first row) and CDIQA, FTSIQ, BRISQUE (second row). Figure
source images in CSIQ dataset. Figure best viewed in color. best viewed in color.

corresponding ML based method for that condition (content).


Such information is useful not just for method comparison
but also during method development as it points to specific
deficiencies which for instance could be mitigated by use of
additional features and/or ML method.
We can also see from Figure 4 that FTSIQ achives AUC
values ∼ 0.5 for conditions 7 and 20 (i.e., source images 7 and
20 in this case). Hence, it may not be suitable in discriminating
perceptual quality levels in such content. However, the current
approach may not reveal such information. Indeed, as analyzed
in the previous sub-section, the decision matrix D1000 (or D30 )
simply leads to the conclusion that FTSIQ is statistically the
Fig. 6. Nonparametric confidence intervals (significance level α = 0.05)
best performer out of the 6 methods (this has its limitations (j)
for median of ρi values for the methods GLBPIQA, SVDIQA, GMBIQA,
as also highlighted in Section IV-C). CDIQA, FTSIQ and BRISQUE (left to right). Figure best viewed in color.
In order to get further insights into practical significance, the
effect size matrix E (based on Cohen’s d) is given below. We
find that CDIQA exhibits large (negative) values implying that ANOVA is recommended as explained in Section II-C and
its performance is much lower than other methods. This is also highlighted in Figure 2). To that end, we can visualize the
(j)
reflected in Figure 4 where this method has lower AUC val- histograms of ρi (shown in Figure 5) for the different ML
ues (< 0.6) for several conditions. On the other hand, FTSIQ based quality predictors. As can be seen, these are somewhat
exhibits relatively large positive effect size in comparison to skewed to the right (except for the method CDIQA). Thus,
(j)
other methods and this is in line with the observations from to demonstrate we choose median of ρi values for each
Figure 4 where it achieves AUC values > 0.7 on majority of method as a summary statistic and obtain the nonparametric
test conditions.3 It may be emphasized that we are not making CIs. These are shown in Figure 6 (we used significance level
statistical inferences from Figure 4 or from matrix E. Rather of α = 0.05 and L = 1000).
the goal is to gain insights into local performance comparison With regards to the application of the proposed permuta-
of different methods, and assess if one method might be prac- tion test, the p1 values (we used B = 5, α = 0.05 and
tically better than other for a given condition. Additionally, th = 0.8) obtained were as follows: GLBPIQA = 0.43,
such analysis can help during method development. SVDIQA = 0.31, GMBIQA = 0.35, CDIQA = 0.78, FTSIQ
= 0.15 and BRISQUE = 0.51. We note that the resultant p1 are
⎡ ⎤ higher than 0.05 (the chosen significance level) for the 6 ML
− −1.10 −0.84 3.34 −2.75 0.48
⎢ 1.10 − −2.08 1.91 ⎥
based quality predictors. As discussed in Section V-A, the high
⎢ 0.3700 5.65 ⎥
⎢ ⎥ p1 values indicate that either there is no meaningful interde-
⎢ 0.84 −0.37 − 5.46 −2.58 1.62 ⎥
E=⎢ ⎥ pendency between feature values that can improve prediction
⎢−3.34 −5.65 −5.46 − −8.33 −3.32⎥
⎢ ⎥ or the employed regressor (i.e., ML algorithm) is unable to
⎣ 2.75 2.08 2.58 8.33 − 3.99 ⎦
exploit feature interdependency (if it exists) or a combination
−0.48 −1.91 −1.62 3.32 −3.99 −
of these two. In any case, the test reveals deficiencies which
(18) are not revealed with the current approach.
Next, in order to draw statistical comparisons, we can Overall, we find from Figure 6 that the nonparametric CIs
either use ANOVA or nonparametric CIs, depending on overlap for 4 of the 6 methods. Therefore, according to the
whether mean can be used as a descriptive statistic (if yes, proposed guidelines in Figure 2, we can conclude that these
4 methods (GLBPIQA, SVDIQA, GMBIQA and BRISQUE)
3 For the purposes of this paper, we are not overtly interested in analyzing
are statistically at par in terms of quality prediction accord-
or comparing specific methods. Indeed FTSIQ is a reduced-reference method
while others like CDIQA are no-reference. Thus, the performance differences ing to source content (i.e., image type), and the observed
(j)
are not entirely unexpected. differences in their mean ρi are attributed to chance (or
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
IEEE TRANSACTIONS ON BROADCASTING

sampling error). In this case, though some of the effect sizes R EFERENCES
(refer to rows 1, 2,3 and 6 of the matrix E in eq. (18)) which [1] P. Coverdale, S. Moller, A. Raake, and A. Takahashi, “Multimedia qual-
correspond to these 4 methods) are large (say > 0.9), such ity assessment standards in ITU-T SG12,” IEEE Signal Process. Mag.,
treatment effects can be merely attributed to chance (as the vol. 28, no. 6, pp. 91–97, Nov. 2011.
[2] P. Gastaldo, S. Rovetta, and R. Zunino, “Objective quality assessment
CIs overlap). On the other hand, for CDIQA we find that of MPEG-2 video streams by using CBP neural networks,” IEEE Trans.
the pairwise effect sizes are quite high in magnitude and its Neural Netw., vol. 13, no. 4, pp. 939–947, Jul. 2002.
nonparametric CI does not overlap. This implies that other [3] J. Xu, P. Ye, Y. Liu, and D. Doermann, “No-reference video quality
assessment via feature learning,” in Proc. IEEE Int. Conf. Image Process.
methods are statistically and practically better than CDIQA. (ICIP), Paris, France, Oct. 2014, pp. 491–495.
Likewise, we can conclude that FTSIQ is statistically and [4] K. Zhu, C. Li, V. Asari, and D. Saupe, “No-reference video qual-
practically better than the remaining methods. ity assessment based on artifact measurement and statistical analysis,”
Thus, in the proposed guidelines, our conclusions about IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 4, pp. 533–546,
Apr. 2015.
relative performances are well supplemented by additional [5] N. Staelens et al., “Constructing a no-reference H.264/AVC bitstream-
information from Figures 4 and 5, effect size matrix E, based video quality metric using genetic programming-based symbolic
Figure 6 (or ANOVA if mean is used as summary statistic) regression,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 8,
pp. 1322–1333, Aug. 2013.
and p1 values from the permutation test. This allows to [6] J. Søgaard, S. Forchhammer, and J. Korhonen, “No-reference video qual-
compare and test methods on different aspects leading to a ity assessment using codec analysis,” IEEE Trans. Circuits Syst. Video
more grounded inference making process. Technol., vol. 25, no. 10, pp. 1637–1650, Oct. 2015.
[7] B. Konuk, E. Zerman, G. Nur, and G. B. Akar, “A spatiotemporal no-
reference video quality assessment model,” in Proc. IEEE Int. Conf.
VII. F INAL R EMARKS Image Process., Melbourne, VIC, Australia, Sep. 2013, pp. 54–58.
[8] W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind
With the growing demands for more immersive quality of image quality assessment using joint statistics of gradient magnitude
experience from consumers, quality monitoring in multimedia and Laplacian features,” IEEE Trans. Image Process., vol. 23, no. 11,
content delivery especially via broadcast services assumes a pp. 4850–4862, Nov. 2014.
[9] M. Narwaria and W. Lin, “SVD-based quality metric for image and
significant role in todays scenario. To that end, ML based qual- video using machine learning,” IEEE Trans. Syst., Man, Cybern. B,
ity predictors offer a plausible solution. Moreover, promising Cybern., vol. 42, no. 2, pp. 347–364, Apr. 2012.
results from related disciplines such as computer vision and the [10] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality
assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,
availability of required hardware (e.g., GPU-accelerated com- no. 12, pp. 4695–4708, Dec. 2012.
puting) have opened up possibilities of developing efficient [11] D. C. Mocanu et al., “No-reference video quality measurement: Added
ML based implementations of quality predictors. However, value of machine learning,” J. Electron. Imag., vol. 24, no. 6, 2015,
Art. no. 061208, doi: 10.1117/1.JEI.24.6.061208.
proper validation and benchmarking of such ML based quality [12] M. T. Vega, D. C. Mocanu, J. Famaey, S. Stavrou, and A. Liotta, “Deep
estimators is important prior to deployment. In that context, the learning for quality assessment in live video streaming,” IEEE Signal
main goal of the paper was to highlight few drawbacks asso- Process. Lett., vol. 24, no. 6, pp. 736–740, Jun. 2017.
ciated with the current approach of statistical comparison and [13] “Methods, metrics and procedures for statistical evaluation, qual-
ification and comparison of objective quality prediction mod-
validation. These stem primarily from lack of considerations els,” Int. Telecommun. Union, Geneva, Switzerland, Rep. ITU-T
to theoretical and practical aspects of statistical testing. Recommendation P.1401, Jul. 2012.
Therefore, the main goal of the paper was to raise awareness [14] “Method for specifying accuracy and cross-calibration of video qual-
ity metrics (VQM),” Int. Telecommun. Union, Geneva, Switzerland,
about some of the identified issues in the current approach. Rep. ITU-T Recommendation J.149, Jul. 2004.
We also provided theoretical analysis concerning dependent [15] ITU-T Tutorial, “Objective perceptual assessment of video quality: Full
(correlated) sample observations. Further, we discussed several reference television,” Int. Telecommun. Union, Geneva, Switzerland,
Rep. JSTP-OAVQ (2004), May 2005.
other limitations related to sample size, the lack of assessment [16] “Final report from the video quality experts group on the validation
of the magnitude of treatment effect and an almost exclusive of objective quality metrics for video quality assessment,” Video Qual.
reliance on p values to compare ML based quality predic- Experts Group, NTIA/ITS, Boulder, CO, USA, Rep. FR-TV Phase II,
Mar. 2003.
tors. We also argued that assessment of learning ability is an [17] “Method for the subjective assessment of intermediate quality levels
important aspect to validate such learning based predictors, of coding systems,” Int. Telecommun. Union, Geneva, Switzerland,
and discussed the use of a permutation test to that end. Rep. ITU-R Recommendation BS.1534-3, Oct. 2015.
[18] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
Essentially, the proposed guidelines treat statistical compar- J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006.
ison of ML based quality estimators as a multi-dimensional [19] M. Ojala and G. C. Garriga, “Permutation tests for studying classifier
problem. Accordingly, we seek to assess the predictors more performance,” J. Mach. Learn. Res., vol. 11, pp. 1833–1863, Aug. 2010.
holistically in terms of their local performance on specific test [20] G. Roussas, An Introduction to Probability and Statistical Inference.
London, U.K.: Academic Press, 2015.
conditions, their learning ability and the magnitude of treat- [21] G. M. Sullivan and R. Feinn, “Using effect size-or why the P value is
ment effect (in order to quantify the practical significance of not enough,” J. Grad. Med. Educ., vol. 4, no. 3, pp. 279–282, Sep. 2012.
the observed differences). In contrast, the current approach [22] R. Hoekstra, S. Finch, H. A. L. Kiers, and A. Johnson, “Probability
as certainty: Dichotomous thinking and the misuse of P values,”
tends to reduce this task to binary and global statistical deci- Psychonomic Bull. Rev., vol. 13, no. 6, pp. 1033–1037, 2006.
sion making, and does not reveal systematic weaknesses (or [23] M. Baker, “Statisticians issue warning over misuse of P values,” Nature,
strengths) of the predictors. In order to provide a tool for vol. 531, no. 7593, p. 151, Mar. 2016.
[24] K. Kelley and K. Preacher, “On effect size,” Psychol. Methods, vol. 17,
practical use, a software implementing the proposed guidelines no. 2, pp. 137–152, 2012.
is made publicly available.4 [25] M. Narwaria, L. Krasula, and P. L. Callet, “Data analysis in multi-
media quality assessment: Revisiting the statistical tests,” IEEE Trans.
4 https://sites.google.com/site/narwariam/home/research Multimedia, to be published, doi: 10.1109/TMM.2018.2794266.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE Transactions on Broadcasting ( Volume: 64, Issue: 2, June 2018 )
NARWARIA: TOWARD BETTER STATISTICAL VALIDATION OF ML-BASED MULTIMEDIA QUALITY ESTIMATORS 15

[26] B. De Finetti, G. Koch, and F. Spizzichino, “Exchangeability in prob- [43] R. Maglietta et al., “Selection of relevant genes in cancer diagnosis
ability and statistics,” in Proc. Int. Conf. Exchangeability Probability based on their prediction accuracy,” Artif. Intell. Med., vol. 40, no. 1,
Stat., Rome, Italy, Apr. 1982, pp. 97–112. pp. 29–44, 2007.
[27] T. F. Crack and O. Ledoit, “Central limit theorems when data are [44] E. Frank and I. H. Witten, “Using a permutation test for attribute
dependent: Addressing the pedagogical gaps,” J. Financ. Educ., vol. 36, selection in decision trees,” in Proc. 15th Int. Conf. Mach. Learn.,
nos. 1–2, pp. 38–60, 2010. San Francisco, CA, 1998, pp. 152–160.
[28] S. Lahiri and P. M. Robinson, “Central limit theorems for long [45] D. Bamber, “The area above the ordinal dominance graph and the
range dependent spatial linear processes,” Bernoulli, vol. 22, no. 1, area below the receiver operating characteristic graph,” J. Math.
pp. 345–375, 2016. Psychol., vol. 12, no. 4, pp. 387–415, 1975. [Online]. Available:
[29] K. Sainani, “Statistically speaking,” PM R, vol. 2, no. 9, pp. 858–861, http://www.sciencedirect.com/science/article/pii/0022249675900012
2010. [46] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under
[30] K. J. Rothman, “No adjustments are needed for multiple comparisons,” a receiver operating characteristic (ROC) curve,” Radiology, vol. 143,
Epidemiology, vol. 1, no. 1, pp. 43–46, 1990. no. 1, pp. 29–36, 1982, doi: 10.1148/radiology.143.1.7063747.
[31] T. V. Perneger, “What’s wrong with bonferroni adjustments,” BMJ, [47] C. Cortes and M. Mohri, “Confidence intervals for the area under
vol. 316, no. 7139, pp. 1236–1238, 1998. the ROC curve,” in Proc. 17th Int. Conf. Neural Inf. Process. Syst.
[32] R. J. Feise, “Do multiple outcome measures require P-value adjust- (NIPS), Vancouver, BC, Canada, 2004, pp. 305–312. [Online]. Available:
ment?” BMC Med. Res. Methodol., vol. 2, no. 1, p. 8, Jun. 2002. http://dl.acm.org/citation.cfm?id=2976040.2976079
[33] P. J. Mumby, “Statistical power of non-parametric tests: A quick guide [48] E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-
for designing sampling strategies,” Marine Pollution Bull., vol. 44, no. 1, reference image quality assessment and the role of strategy,” J. Electron.
pp. 85–87, 2002. Imag., vol. 19, no. 1, pp. 1–21, 2010.
[34] E. Kasuya, “Mann–Whitney U test when variances are unequal,” Animal [49] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality
Behav., vol. 61, no. 6, pp. 1247–1249, Jun. 2001. assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,
[35] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, no. 12, pp. 4695–4708, Dec. 2012.
Evidence, and Data Science, 1st ed. New York, NY, USA: Cambridge [50] Q. Li, W. Lin, and Y. Fang, “No-reference quality assessment for
Univ. Press, 2016. multiply-distorted images in gradient domain,” IEEE Signal Process.
[36] T. J. DiCiccio and B. Efron, “Bootstrap confidence intervals,” Stat. Sci., Lett., vol. 23, no. 4, pp. 541–545, Apr. 2016.
vol. 11, no. 3, pp. 189–212, 1996. [51] Y. Fang et al., “No-reference quality assessment of contrast-distorted
[37] J. Carpenter and J. Bithell, “Bootstrap confidence intervals: When, images based on natural scene statistics,” IEEE Signal Process. Lett.,
which, what? A practical guide for medical statisticians,” Stat. Med., vol. 22, no. 7, pp. 838–842, Jul. 2015.
vol. 19, no. 9, pp. 1141–1164, 2000. [52] M. Narwaria, W. Lin, I. V. McLoughlin, S. Emmanuel, and L.-T. Chia,
[38] A. C. Davison and D. V. Hinkley, Bootstrap Methods and “Fourier transform-based scalable image quality measure,” IEEE Trans.
Their Application (Cambridge Series in Statistical and Probabilistic Image Process., vol. 21, no. 8, pp. 3364–3377, Aug. 2012.
Mathematics). Cambridge, U.K.: Cambridge Univ. Press, 1997.
[39] T. Fawcett, “An introduction to ROC analysis,” Pattern Recogn. Lett.,
vol. 27, no. 8, pp. 861–874, Jun. 2006.
[40] P. Hanhart, L. Krasula, P. L. Callet, and T. Ebrahimi, “How to benchmark
objective quality metrics from paired comparison data?” in Proc. 8th Int.
Conf. Qual. Multimedia Exp. (QoMEX), Lisbon, Portugal, Jun. 2016, Manish Narwaria received the Ph.D. degree in
pp. 1–6. computer engineering from Nanyang Technological
[41] L. Krasula, K. Fliegel, P. L. Callet, and M. Klíma, “On the accuracy University, Singapore, in 2012. He was a Researcher
of objective image and video quality models: New methodology for with IRCCyN-IVC Lab, France, before joining DA-
performance evaluation,” in Proc. 8th Int. Conf. Qual. Multimedia Exp. IICT, India, as an Assistant Professor in 2015.
(QoMEX), Lisbon, Portugal, Jun. 2016, pp. 1–6. His major research interests include the area of
[42] L. Krasula, M. Narwaria, K. Fliegel, and P. L. Callet, “Preference of multimedia signal processing with focus on percep-
experience in image tone-mapping: Dataset and framework for objective tual aspects toward content capture, processing, and
measures comparison,” IEEE J. Sel. Topics Signal Process., vol. 11, transmission.
no. 1, pp. 64–74, Feb. 2017.

You might also like