VSQuAD ICIP 2022 Paper Camera Ready - 1 Compressed

A NEW VIDEO QUALITY ASSESSMENT DATASET FOR VIDEO SURVEILLANCE
APPLICATIONS
Azeddine Beghdadi⋆ , Muhammad Ali Qureshi† , Borhen-eddine Dakkar⋆ , Hammad Hassan Gillani† ,
Zohaib Amjad Khan†† , Mounir Kaaniche⋆ , Mohib Ullah‡ , Faouzi Alaya Cheikh‡
⋆
University Sorbonne Paris Nord, France
†
The Islamia University of Bahawalpur, Pakistan
††
L2S, CentraleSupélec, University Paris Saclay, France
‡
Norwegian University of Science and Technology (NTNU), Norway
ABSTRACT of the critical factors that impact the performance of video

In this paper, we propose a new comprehensive Video Surveil- surveillance systems. Indeed, various in-capture distortions
lance Quality Assessment Dataset (VSQuAD) dedicated to (Blur, Additive White Gaussian Noise (AWGN), Uneven-
Video Surveillance (VS) systems. In contrast to other pub- illumination, Smoke, atmospheric turbulence, etc.), trans-
lic datasets, this one contains many more videos with distor- mission impairments (packet-loss, channel noice, etc.), or
tions and diversified content from common video surveillance compression artifacts are among the limiting factors of video
scenarios. These videos have been artificially degraded with surveillance systems. These distortions may affect the per-
various types of distortions (single distortion or multiple dis- formance of the prominent high-level tasks such face/event
tortions simultaneously) at different severity levels. In order detection, recognition, and tracking process [1]. Therefore,
to improve the efficiency of the surveillance systems and the it is important to integrate the video quality aspect in the de-
versatility of the video quality assessment dataset, night vi- sign of any video surveillance system [2]. It is worth noting
sion CCTV videos are also included. Furthermore, a compre- that the lack of database dedicated to the evaluation of video
hensive analysis of the content in terms of diversity and chal- quality in the context of video surveillance has somewhat hin-
lenging problems is also presented in this study. The interest dered the advancement of research in the field of intelligent
of such database is twofold. First, it will serve for bench- video surveillance systems.
marking different video distortion detection and classification Furthermore, apart from a few very limited studies, most
algorithms. Second, it will be useful for the design of learning databases for the evaluation of video quality have focused
models for various challenging VS problems such as identi- for more than two decades on coding and transmission ef-
fication and removal of the most common distortions. The fects. Nevertheless, some interesting video databases includ-
complete dataset is made publicly available as part of a chal- ing other distortions have been proposed for video quality
lenge session in this conference through the following link: evaluation, such as CVD2014 [3], KoNViD-1k [4], LIVE-
https://www.l2ti.univ-paris13.fr/VSQuad/. Qualcomm [5], and LIVE-VQC [6]. We have also seen some
recent interesting work on Video Quality Assessment (VQA)
Index Terms— Video Surveillance, Video Quality As- based on deep learning [7]. But, to the best of our knowledge,
sessment, Video Dataset, Distortion generation. there is no or very little work devoted to the evaluation of
video quality in the context of video surveillance [8]. One
1. INTRODUCTION of the issues that arises in the design of an intelligent video
surveillance system is the robustness of the algorithms ded-
Video Surveillance (VS) is a subject of applied research icated to high-level tasks, such as abnormal event detection
that is currently undergoing a lot of technological changes [9], people re-identification [10] and visual tracking [11],
and progress, especially with emerging technologies includ- against the effect of in-capture distortions [2].
ing high-performance computing, artificial intelligence, and With the renewed interest in artificial intelligence-based ap-
smart sensors. Problems with public security have sparked proaches to solve computer vision and image processing and
significant concern in recent years. Security and monitoring analysis problems, we are witnessing the development of
systems are more and more demanding in terms of quality, huge datasets. This is, for example, the case of video datasets
reliability and flexibility especially those dedicated to video dedicated to VQA methods benchmarking [12]. The video
surveillance. However, despite the tremendous progress al- surveillance context generates huge amounts of data, which
ready made towards the development of efficient security is naturally in line with the current trend of big data analysis.
systems, the existing solutions have limitations especially in The quality of a video especially in the surveillance context
complex and cluttered environments. Video quality is one
has a prime role in designing a smart surveillance system. The to generate this type of blur is to perform a directional low-
proposed dataset namely Video Surveillance Quality Assess- pass filtering such as a directional moving-average filter. By
ment Dataset (VSQuAD) is the continuation of [8]. This new varying the parameters of the kernel, i.e., the direction and the
database is not only much larger, about five times, but con- extent, we generated a directional blur at four severity levels.
tains more scenarios and other types of distortions, especially Low Illumination: Despite the progress made in the sensi-
those related to atmospheric conditions. This database, apart tivity and resolution of image sensors, the problem of lighting
from the primary objective of evaluating the performance remains a limiting factor in the performance of video surveil-
of distortion detection and classification algorithms and the lance systems. Low lighting conditions can make high level
evaluation of video quality metrics, is of primary interest for tasks very challenging. To synthesize such kind of distortion
the benchmarking of various high-level task algorithms. we adopted a simple multiplicative model that consists of uni-
formly attenuating the pixel values.
Uneven Illumination: It is due to non-uniform illumination
2. THE PROPOSED DATASET-VSQUAD (UI) that manifests itself by the appearance of unbalanced
dark and light areas which makes object detection and inter-
The proposed dataset contains 36 original high-definition pretation of the observed scene difficult. This degradation can
videos with diverse contents and 28 scenarios. Nine common be simulated by weighting the pixel values of each frame by
single distortions are applied at 4 levels of severity to these a fading mask of the same size as the image, with weights be-
videos. This dataset also contains simultaneous distortions at tween 0 and 1 of the bell-shaped spatial distribution of Gaus-
different levels of severity. The dataset contains 1, 576 videos sian or log-Gaussian type. By moving the principal lobe of the
affected with single and simultaneous distortions. All the fading mask function various UI effects could be generated on
videos are 10 seconds duration each captured at 30 frames the sides rather than the center of the frame to produce a re-
per second. The detailed summary of the proposed dataset is alistic effect. Figure 1 (a) presents video frames with single
provided in Table 1. artificial distortion.
3. DISTORTION GENERATION 3.2. Semi-artificial distortions

The generation of some atmospheric distortions or other phys-
Here, we briefly discuss the methods to generate the differ-
ical turbulence is a very challenging problem. For example,
ent distortions. In our experiments, the distortions are gener-
the use of Perlin noise model for smoke generation may be in-
ated in a semi- or fully artificial way with four severity levels,
effective. Indeed, because it is difficult to simulate the move-
namely, (1) just noticeable, (2) not annoying but visible, (3)
ments of the particles and at the same time to control the
annoying, (4) very annoying. Note that the distortions are sep-
phenomenon of light scattering in a realistic way. The same
arately applied for each color channel in RGB video frames.
problem arises for the generation of other similar atmospheric
degradations such as haze or rain. Here we use a semi artifi-
3.1. Artificial distortions cial scheme to generate those distortions in a realistic way.
It is based on the screen blending technique [14]. The origi-
Noise: It is one of the most encountered distortions during nal frames were blended with a real distortion video using the
video acquisition, or transmission systems, and especially in blending model given below (1):
night-video or low-light environments. The most probable
cause of noise during acquisition is due to camera sensors Fout = 1 − (1 − Forig )(1 − αFdist ) (1)
sensitivity. We simulated the noise in our dataset by using
the basic additive white Gaussian noise (AWGN) model. By where Fout , Forig , Fdist are resultant, reference, and distor-
changing the variance various levels of noise are generated on tion frames respectively and α is the blending factor.
each frame of the video stream. Smoke: It is one of the distortions that may affect the qual-
Defocus-Blur: A video may suffer from blur due to defocus ity of the video in the outdoor environments. Smoke is the
during the image capture. This results in a decrease of the im- most challenging distortion to model. A video having only
age contrast [13] that manifests itself as sharpness decreases. the smoke with black background is blended with the original
The blur due to defocus can be generated at various levels by video frames. By changing the opacity level of the smoke,
applying a low-pass Gaussian filter to the input image with a different levels of smoke distortion are generated.
variable mask size relates to the standard deviation of the im- Haze: Haze is another distortion widely studied in computer
pulse response of the filter. vision and more particularly in navigation problems. This dis-
Motion-Blur: In the case of video, motion blur is mainly due tortion affects the whole image in a more or less uniform way.
to the relatively low image acquisition rate compared to the Here again, we used the blending method to produce videos
speed of the objects filmed, the instability of the camera, or affected by haze in a realistic way. By varying the opacity of
the photonic sensitivity of the image sensor. A simple way the Haze-only video we may generate different levels.
Table 1. Summary of the proposed surveillance video quality assessment dataset
Year 2022 Number of Reference 36
Videos
Categories (Vision) 2 RGB (Day-light / Night- Camera Type (Modalities) 2 (Fixed/Moving)
light)
Scenario Types 2 (Indoor, Outdoor) Number of scenarios 28
Resolution of each video FHD (1920 × 1080) Duration of each video 10 seconds
Number of distortion types 9 Number of distortion lev- 4
els
Distortion Types (D1 ) Defocus Blur, (D2 ) Haze, (D3 ) Low illumination, (D4 ) Motion Blur (due to cam-
era instability), (D5 ) AWGN, (D6 ) Rain, (D7 ) Smoke, (D8 ) Uneven-illumination, (D9 )
Compression artifact
Multi-distortion Yes, 9 common mixtures Frame rate 30 fps
Multi-distortion order D1 D5 , D3 D5 , D5 D8 , D3 D6 , D1 D9 , D3 D5 D6 , D5 D8 D9 , D4 D6 D8 , D2 D5 D9
Number of videos with sin- 964 Number of videos affected 612
gle distortion each by more than one distor-
tion
Type .mp4 Size (in GB) 34 GB
(a) Video frames with single artificial distortion (D1 , D3 , D4 , D5 , D8 )
(b) Video frames with single semi-artificial distortion (D2 , D6 , D7 , D9 )
(c) Video frames with multi-distortions (D1 D9 , D3 D6 , D2 D5 D9 )

Fig. 1. Examples of distorted versions of videos from VSQuAD.
Rain: Significant intensity variations in videos due to rain cording to the importance of one distortion component over
may degrade the performance of outdoor surveillance. The the others. It is worth noticing that the order of application of
generation of videos containing rain at different density lev- the different synthetic distortions must take into account the
els is also obtained by following the same process as in the physical reality and in particular the way in which the signal
case of haze and smoke. By varying the α parameter values is acquired by the sensors in real scenarios, e.g., if we gener-
we generated different levels of rain density. ate noise on a video and then apply the blur, the effect of noise
Compression Artifacts: Various artifacts due to the com- would be affected by lowpass filtering effect due to blurring.
pression or the transmission channel can affect the quality of Figure 1 (c) presents examples of video frames with multiple
the videos. In this study we consider mainly the blocking distortions.
artefact resulting from FFmpeg coding at four quality levels.
Figure 1 (b) presents video frames with single semi-artificial
4. DATASET CHARACTERISTICS ANALYSIS
distortion.
It is important to have different scenarios but also visual con-
3.3. Multiple Distortion Generation tents rich in spatio-temporal structures at different scales of
observation and under various lighting and viewing angles.
The process of generating mixtures of simultaneous distor- The richness or diversity in videos is computed through spa-
tions is purely intuitive and is based on an additive or mul- tial and spatio-temporal descriptors. A set of criteria and mea-
tiplicative model where the different terms are weighted ac- sures to quantify and analyze the richness and representative-
(a) SI versus CF (b) GCF versus CF (c) TI versus SI
Fig. 2. Scatter plots for VSQuAD. The symbol ’⋆’ represents the keyframes of reference videos.
ness of image and video databases dedicated to perceptual frame. The maximum value over time is used to represent the
quality assessment has been proposed in [15, 16]. In what spatial information content of the scene. The SI measure is:
follows we will recall some of these descriptors and apply
them to the analysis of the database. SI = maxn {std[Sobel(Fn )]} (2)
4.2.2. Temporal perceptual information:

4.1. Spatial descriptors
The local temporal perceptual information denoted Mn (i, j)
Spatial descriptors can be defined in the spatial or transformed captures the motion difference feature at the location (i, j).
domain and at different resolution levels. In what follows we It is defined as the difference between the pixel values at the
limit ourselves to a few simple spatial descriptors such as Col- same location in successive frames. It is computed as follows:
orfulness (CF), Spatial Information (SI), and Global Contrast
Factor (GCF) associated with each frame as done in [16]. Fig- Mn (i, j) = Fn (i, j) − Fn−1 (i, j) (3)
ures 2 (a) and (b) show the scatter plots for the keyframes of
reference videos demonstrating that the proposed dataset is where Fn (i, j) is the pixel value at the (i, j) location of nth
enriched with diversified colors, contrast, and texture. frame. The TI measure is computed as the maximum over
time (n) of the standard deviation over space of inter-frame
difference Mn :
4.2. Saptio-temporal Descriptors
TI = maxn {std[Mn ]} (4)
The perceptual quality of a video is very much influenced
by spatio-temporal information and in particular by moving Videos with high motion results in higher TI values.
and visually attractive objects and structures. [17]. One of Figure 2 (c) presents a scatter plot of the spatio-temporal
the characteristics that allow analyzing the video signal is the perceptual information SI and TI measures on videos from
motion parameter. There are several methods for estimating the proposed dataset. It is observed, that our dataset contains
the motion field in dynamic scenes. Here we use a global some scenes with very limited motion as well as with a lot of
measure of motion information characterizing the richness of motion. It is clear that this database contains rather diversi-
the video signal in terms of time-varying visual content. We fied spatiotemporal information. The richness of this database
used the spatiotemporal perceptual information proposed in guarantees the reliability of the study and its possible use in
the VQEG group recommendation [17]. various applications particularly the impact of the common
distortions on the quality of videos and high-level tasks.
4.2.1. Spatial perceptual information:
5. CONCLUSION AND PERSPECTIVES
As a spatial descriptor, we consider the edginess information
that represents the most visually salient component in the im- The proposed database fills a gap in this very active field of
age signal at a given scale. The spatial information is thus research. Indeed, consideration of video quality in the con-
defined from the response of an enhancement operator of the text of video surveillance has been somewhat neglected. This
salient features of the image signal and more particularly the unique database will be used not only for applications related
contours. Here, we adopt the Sobel operator as a filter to to video quality but also as a benchmark to evaluate video
enhance these components as done in [17]. For each video classification methods especially in very challenging contexts
frame (Fn ) a Sobel filter is applied. The standard deviation of thanks to the specificity of its visual content and to the distor-
the magnitude of Sobel response is then computed over each tions within the videos.
6. REFERENCES [11] P. L. M. Bouttefroy, A. Bouzerdoum, S. L. Phung, and
A. Beghdadi, “Vehicle tracking using projective particle
[1] A. Beghdadi, M. Asim, N. Almaadeed, and M. A. filter,” in 2009 Sixth IEEE International Conference on
Qureshi, “Towards the design of smart video- Advanced Video and Signal Based Surveillance. IEEE,
surveillance system,” in NASA/ESA Conference on 2009, pp. 7–12.
Adaptive Hardware and Systems (AHS). IEEE, 2018,
pp. 162–167. [12] F. Götz-Hahn, V. Hosu, H. Lin, and D. Saupe, “Konvid-
150k: A dataset for no-reference video quality assess-
[2] A. Beghdadi, I. Bezzine, and M. A. Qureshi, “A percep- ment of videos in-the-wild,” IEEE Access, vol. 9, pp.
tual quality-driven video surveillance system,” in 23rd 72 139–72 160, 2021.
International Multitopic Conference (INMIC). IEEE,
[13] A. Beghdadi, M. A. Qureshi, S. A. Amirshahi,
2020, pp. 1–6.
A. Chetouani, and M. Pedersen, “A critical analysis
[3] M. Nuutinen, T. Virtanen, M. Vaahteranoksa, T. Vuori, on perceptual contrast and its use in visual informa-
P. Oittinen, and J. Häkkinen, “Cvd2014—a database tion analysis and processing,” IEEE Access, vol. 8, pp.
for evaluating no-reference video quality assessment 156 929–156 953, 2020.
algorithms,” IEEE Transactions on Image Processing, [14] Z. A. Khan, A. Beghdadi, F. A. Cheikh, M. Kaaniche,
vol. 25, no. 7, pp. 3073–3086, 2016. E. Pelanis, R. Palomar, Å. A. Fretland, B. Edwin, and
O. J. Elle, “Towards a video quality assessment based
[4] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men,
framework for enhancement of laparoscopic videos,” in
T. Szirányi, S. Li, and D. Saupe, “The konstanz natural
Medical Imaging: Image Perception, Observer Perfor-
video database (konvid-1k),” in 9th International Con-
mance, and Technology Assessment, vol. 11316. In-
ference on Quality of Multimedia Experience (QoMEX).
ternational Society for Optics and Photonics, 2020, p.
IEEE, 2017, pp. 1–6.
113160P.
[5] D. Ghadiyaram, J. Pan, A. C. Bovik, A. K. Moorthy, [15] S. Winkler, “Analysis of public image and video
P. Panda, and K.-C. Yang, “In-capture mobile video dis- databases for quality assessment,” IEEE Journal of Se-
tortions: A study of subjective behavior and objective lected Topics in Signal Processing, vol. 6, no. 6, pp.
algorithms,” IEEE Transactions on Circuits and Systems 616–625, 2012.
for Video Technology, vol. 28, no. 9, pp. 2061–2077,
2017. [16] A. Beghdadi, M. A. Qureshi, B. Sdiri, M. Deriche, and
F. Alaya-Cheikh, “Ceed-a database for image contrast
[6] Z. Sinno and A. C. Bovik, “Large-scale study of percep- enhancement evaluation,” in Colour and Visual Comput-
tual video quality,” IEEE Transactions on Image Pro- ing Symposium (CVCS). IEEE, 2018, pp. 1–6.
cessing, vol. 28, no. 2, pp. 612–627, 2018.
[17] P. ITU-T RECOMMENDATION, “Subjective video
[7] D. Li, T. Jiang, and M. Jiang, “Quality assessment of in- quality assessment methods for multimedia applica-
the-wild videos,” in 27th ACM International Conference tions,” International telecommunication union, 1999.
on Multimedia, 2019, pp. 2351–2359.
[8] I. Bezzine, Z. A. Khan, A. Beghdadi, N. Al-Maadeed,

M. Kaaniche, S. Al-Maadeed, A. Bouridane, and F. A.
Cheikh, “Video quality assessment dataset for smart
public security systems,” in 23rd International Multi-
topic Conference (INMIC). IEEE, 2020, pp. 1–5.
[9] P. Bouttefroy, A. Bouzerdoum, S. Phung, and A. Begh-

dadi, “Abnormal behavior detection using a multi-modal
stochastic learning approach,” in 2008 International
Conference on Intelligent Sensors, Sensor Networks and
Information Processing. IEEE, 2008, pp. 121–126.
[10] Z. Mortezaie, H. Hassanpour, and A. Beghdadi, “Peo-

ple re-identification under occlusion and crowded back-
ground,” Multimedia Tools and Applications, pp. 1–21,
2022.

VSQuAD ICIP 2022 Paper Camera Ready - 1 Compressed

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VSQuAD ICIP 2022 Paper Camera Ready - 1 Compressed

Uploaded by

Copyright:

Available Formats

A NEW VIDEO QUALITY ASSESSMENT DATASET FOR VIDEO SURVEILLANCE

ABSTRACT of the critical factors that impact the performance of video

3. DISTORTION GENERATION 3.2. Semi-artificial distortions

(a) Video frames with single artificial distortion (D1 , D3 , D4 , D5 , D8 )

(b) Video frames with single semi-artificial distortion (D2 , D6 , D7 , D9 )

(c) Video frames with multi-distortions (D1 D9 , D3 D6 , D2 D5 D9 )

4.2.2. Temporal perceptual information:

[8] I. Bezzine, Z. A. Khan, A. Beghdadi, N. Al-Maadeed,

[9] P. Bouttefroy, A. Bouzerdoum, S. Phung, and A. Begh-

[10] Z. Mortezaie, H. Hassanpour, and A. Beghdadi, “Peo-

You might also like