You are on page 1of 77

Automatic Analysis of News Videos

Jonathan Attard

Supervisor: Dr Dylan Seychell

May 2023

Submitted in partial fulfilment of the requirements


for the degree of Bachelor of Science in Information Technology (Hons)
(Artificial Intelligence).
Abstract
As the volume of video content in the media continues to grow, it can be challenging
for viewers to fully analyse the video content they consume. This project aims to
address this issue by developing a tool for automatic news video analysis. The
proposed system leverages computer vision techniques, including face tracking,
detection, encoding extraction, recognition, and Optical Character Recognition (OCR),
to extract key information from news videos.
The main objective of this project is to develop a system that can identify
individuals who appear in news videos and provide users with a better understanding
of the video’s content. The system uses unsupervised methods to match individuals
with a pre‐existing database or extract face encodings as well as their names from the
video itself so that these could be used for future videos.
This project successfully demonstrates a system in which individuals and their
timestamps can be extracted, so that they can also be used in future analysis. Results
show that the proposed system can identify individuals in news videos and provide
users with relevant information about them. The system is presented through a
user‐friendly Graphical user interface (GUI), which allows users to interact with the
system and explore the extracted information.
Overall, this project offers a novel approach to automatic news video analysis
that can help users better comprehend and evaluate the information presented in news
videos. The created system was able to achieve an 83% accuracy for name extraction
and 63% accuracy for face recognition. While the project’s approach to automatic
news video analysis is promising, further work is needed to address accuracy, increase
the variety of news sources, and incorporate additional features that expand the
system’s functionality, such as emotion detection or new ways to interact with news
videos.

i
Acknowledgements
I express profound gratitude to my family for their unwavering support and
encouragement throughout my academic journey, particularly my parents. I am
genuinely grateful to Jeanine Attard for her excellent work in designing the program’s
logo.
My close friends also deserve my thanks for keeping me motivated during this
research, in particular, Jean Sacco provided me with invaluable perspectives and advice
on the challenges encountered.
My supervisor, Dr Dylan Seychell, deserves special thanks for his steadfast
commitment to my academic success, shaping the direction and scope of my research
with his feedback, insights, and expertise.
Additionally, I appreciate the assistance of the students who aided me with the
data gathering.

ii
Contents

Abstract i

Acknowledgements ii

Contents v

List of Figures vii

List of Tables viii

List of Abbreviations ix

Glossary of Symbols 1

1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Literature Review 4


2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Scene detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 News videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Video‐based Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 FaceRec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 13

iii
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 News Video Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 News Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 News Video Segments: Reducing video duration . . . . . . . . . . 14
3.2.3 Transcript Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Video Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Name Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Face Detection, Encoding, and Recognition . . . . . . . . . . . . . . 20
3.3.4 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Analysis Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Evaluation 25
4.1 News Video Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Reduced dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Larger database results . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Analysis Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Conclusion 34
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A Level 0 DFD 41

B Video Frame/Scene Categories 42

C Video Analysis Class Diagram 46

D Analysis Timeline 47

E Timeline of Analysis vs Validation 48

F Visualisation GUI 49

G Incremental Database 51

H File Structure 52

iv
I Comparison of frame intervals of 0.5 and 1 seconds 55
I.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
I.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
I.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

J Survey 61

K Code 67

v
List of Figures

Figure 2.1 FaceRec training pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


Figure 2.2 Architecture of Tesseract OCR . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.1 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


Figure 3.2 Article Links Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 3.3 News Segment Extraction DFD . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 3.4 Video Analysis Level 2 DFD . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 3.5 Validation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 3.6 Analysis Excel Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 3.7 Frame timeline example . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.8 Analysis timeline example . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 4.1 Incremental Database Results . . . . . . . . . . . . . . . . . . . . . . . . 32

Figure A.1 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure B.1 Frame Category 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42


Figure B.2 Frame Category 1 (Presenter variation) . . . . . . . . . . . . . . . . . . 43
Figure B.3 Frame Category 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure B.4 Frame Category 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure B.5 Frame Category 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure B.6 Frame Category 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure C.1 Video Analysis Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 46

Figure F.1 Main window GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


Figure F.2 Export window GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure G.1 Incremental database with 0.6 thresholds . . . . . . . . . . . . . . . . . 51

Figure H.1 Analysis file structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


Figure H.2 Analysis Images example . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure H.3 Database file structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure H.4 Database Images example . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure I.1 Case 1: 0.5‐second frame intervals . . . . . . . . . . . . . . . . . . . . . 55

vi
Figure I.2 Case 1: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 56
Figure I.3 Case 2: 0.5‐second frame intervals . . . . . . . . . . . . . . . . . . . . . 57
Figure I.4 Case 2: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 57
Figure I.5 Case 2: Problematic frames with text transition animation . . . . . . . 58
Figure I.6 Case 3: 0.5‐second frame intervals . . . . . . . . . . . . . . . . . . . . . 59
Figure I.7 Case 3: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 59
Figure I.8 Case 3: Problematic frames with text transition animation . . . . . . . 60

vii
List of Tables

Table 3.1 Video Database Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Table 4.1 Video Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

viii
List of Abbreviations
AI Artificial Intelligence.

CNN Convolutional Neural Network.

FCM Fuzzy C‐means Clustering.

GTC Graph‐Theoretical Method.

GUI Graphical user interface.

HDM Histogram Difference Metric.

HOG Histogram of Oriented Gradients.

KCF Kernelized Correlation Filter.

MAE Mean Absolute Error.

MOSSE Minimum Output Sum of Squared Error.

MSER Maximally Stable Extremal Region.

MTCNN Multi‐Task Cascaded Convolutional Neural Network.

NLP Natural Language Processing.

OCR Optical Character Recognition.

PCA Principal Component Analysis.

SDM Spatial Difference Metric.

SORT Simple Online and Realtime Tracking.

SVM Support vector machine.

ix
1 Introduction
1.1 Problem definition
In today’s world, the abundance of easily accessible information has led to the problem
of information overload [1]. This overload often results in non‐optimal
decision‐making, particularly in the realm of media, where staying well‐informed [2]
while maintaining a balanced perspective on the content we consume is crucial [3].
Numerous studies [3–6] have examined the impact of media bias. As highlighted
by Bernhardt et al. [3], ”Even if citizens are completely rational and take media bias into
account, they cannot recover all of the missing information.” Additionally, research has
shown that media bias influences political voting and policy outcomes. For instance,
DellaVigna et al. [5] found a significant positive correlation between exposure to Fox
News and the Republican vote share in the 2000 Presidential elections compared to
1996, suggesting that media bias can have a substantial political effect on general
beliefs and voter turnout. Similarly, Eberl et al. [4] identified media bias as influencing
voter behaviour based on data from the 2013 Austrian parliamentary election
campaign. Additionally, Hopmann et al. [6] observed that both the tone towards a
political party and the visibility of candidates impact voting outcomes.
Given the rise of video content in the media, it is imperative to analyse and
evaluate the information we consume on a daily basis.

1.2 Motivation
As discussed in the previous section, the widespread availability of information poses
challenges such as information overload and media bias. Fortunately, the rapid
advancements in Artificial Intelligence (AI) technology offer opportunities to develop
tools that aid users in identifying key information and enhancing their awareness of
consumed content, especially in the context of news media. Leveraging AI algorithms,
these tools can analyse and evaluate video content, including news videos, to provide
valuable insights and identify potential biases.
Although several tools have been developed to address specific challenges
[7–10], few focus on analysing and identifying individuals in news videos. Therefore,
an ideal tool would facilitate the direct visibility analysis of individuals in news videos,
considering the growing use of video content in the media.

1
1 Introduction

1.3 Proposed Solution


This study proposes a system to analyse a news video and output analytics on the
exposure of individuals in the same video, in this case, TVM news bulletins. The system
can identify individuals, detect the timestamps of their occurrence, and determine their
duration. Since the name of the speaking individual is usually shown in news videos as
a caption, the system can extract the individual’s name and match the face with a
database, or add it to an existing one. This automatic annotation from captions renders
the proposed solution unsupervised, meaning that it can learn independently without
any manual or supervised interactions, and thus could keep learning on its own. The
system utilises different computer vision techniques, including face recognition and
Optical Character Recognition (OCR) to read the names of the individuals.

1.4 Aims and Objectives


The aim of this study is to develop a computer vision method to analyse daily news
videos and provide users with valuable insights about the individuals featured in the
videos. To achieve this aim, the following objectives will be pursued:

1. Collect news videos featuring people named in captions.

2. Create a program that automatically extracts names from the news video frames
containing name captions.

3. Design and implement a system that can automatically recognise and identify
individuals shown in news videos, retrieve their names from the captions, and
store newly encountered individuals for future use.

4. Implement a machine learning and computer vision method that identifies and
tracks individuals to create a timeline of appearances as a camera‐time report.

5. Create a user‐friendly interface to facilitate the interactions between the user


and the implemented system.

Each objective will be evaluated using appropriate metrics to assess the


accuracy and effectiveness of the proposed method. The resulting analysis system will
assist users in identifying important insights from news videos that could be easily
overlooked. By achieving these objectives, this study aims to provide a valuable
contribution to the field of video analysis and provide added transparency to news
consumption.

2
1 Introduction

1.5 Document Structure


This dissertation is structured as follows: Chapter 2 provides an explanation of the
employed Computer Vision techniques and discusses differences in methodologies
found in the literature. Chapter 3 presents the design and implementation of the
proposed system, covering video extraction, the core system, video analysis, and
analysis visualisation, while also highlighting the reasons behind the choices made. In
Chapter 4, the final system is evaluated and compared with related papers, with a
comprehensive and detailed assessment conducted by segmenting different parts of
the Video Analysis system. Finally, Chapter 5 concludes the dissertation by
summarising the benefits, and limitations, and suggesting future directions for research
building upon this study.

3
2 Background and Literature Review
This chapter provides an essential foundation for understanding the proposed system
by introducing the relevant background concepts and exploring the technologies
utilised. It aims to enhance comprehension and clarity for readers, ensuring a solid
understanding of the techniques employed.
The chapter begins with an overview of the background concepts, providing the
necessary context for the subsequent discussions. It then explores the employed
technologies, offering a concise background, examining relevant papers and research,
discussing strengths, weaknesses, and challenges, and highlighting state‐of‐the‐art
advancements. Furthermore, similar systems will be discussed to gain insights into
existing solutions.

2.1 Background
The concepts of neural networks and Convolutional Neural Network (CNN)s will be
used throughout the literature review and the proposed system. This section aims to
provide an introduction to these concepts so that the mentioned technologies of face
recognition, face tracking, scene detection, and OCR, could be better understood.
Deep learning is a subfield of machine learning that utilises neural networks to
learn and make predictions. Neural networks, computational models inspired by the
human brain, serve as powerful tools for approximating complex functions. They
consist of interconnected nodes known as neurons, working together to learn and
make predictions by adjusting the weights and biases of their connections. Deep
learning has found success in diverse areas like computer vision, image analysis,
information retrieval, Natural Language Processing (NLP), and speech recognition [11].
Common types of neural networks used in deep learning include feed‐forward neural
networks, and CNNs.
CNNs are a type of deep learning model specifically designed for analysing
visual data, such as images or videos [12]. CNNs consist of multiple layers, which
include convolutional layers, pooling layers, and fully connected layers. Convolutional
layers apply filters to extract patterns, pooling layers reduce spatial dimensions, and
fully connected layers enable the network to learn complex relationships and make
predictions [12]. These networks are capable of automatically learning feature
representations from raw data, without the need for explicit feature engineering [12].
By leveraging large amounts of labelled data and powerful computational resources,
CNNs can learn complex patterns and relationships, making them highly effective in
tasks such as image recognition, object detection, and NLP [13–20].

4
2 Background and Literature Review

2.2 Face Detection


Face recognition is the process of identifying people within an image by locating the
face, extracting features, and comparing them with stored features for a match or
unknown classification [21]. However, face recognition is challenging due to
inconsistencies in facial appearance caused by factors like pose, expression, occlusion,
illumination, image quality, and motion blur [21–23]. To overcome these challenges,
researchers have developed techniques such as face alignment [24, 25] and deep
learning‐based approaches using CNNs to extract robust features from faces [14–20].
Face detection is a crucial step before face recognition [14]. This enables the
localisation and tracking of faces which then enables face recognition to be carried out.
Over the years, multiple face detection algorithms and techniques have been proposed,
ranging from native image processing techniques to deep learning approaches.
One of the most widely used traditional face detection methods is the
Viola‐Jones algorithm which was proposed by Viola and Jones [26]. The algorithm uses
features named Haar‐like features and cascade classifiers to detect faces in real time.
However, the algorithm has limitations in handling variations in pose, lighting and
occlusion [27].
More recently, deep learning‐based methods have shown great promise in
improving the accuracy and robustness of face detection since these allow for more
complicated processes. For example, the Histogram of Oriented Gradients (HOG)
method was proposed by Datal and Triggs [13] uses a CNN to extract features and
detect faces in images. This method achieves high accuracy and handles variations in
partial occlusions and lighting [27].
Another popular deep learning‐based method is the Multi‐Task Cascaded
Convolutional Neural Network (MTCNN) proposed by Zhang et al. [14], which uses a
cascaded CNNs to detect faces and facial landmarks in images. The method achieves
state‐of‐the‐art performance on several benchmarks and is widely used in many
applications, including news video analysis.
More recently, one‐stage detection methods such as RetinaFace [28] and
BlazeFace [16] have shown great potential in achieving high accuracy and processing
speed. RetinaFace uses a ResNet [15] backbone and a focal loss function to detect
faces at different scales and angles, while BlazeFace is a lightweight and efficient
algorithm designed for mobile devices and real‐time applications.
Overall, face detection is a critical component of face recognition, especially in
the case of video analysis. While traditional methods such as Viola‐Jones [26] and
HOG can achieve good results, deep learning‐based methods such as MTCNN [29],
RetinaFace [15], and BlazeFace [16] have shown great potential in improving accuracy
and processing speed.

5
2 Background and Literature Review

2.3 Face Recognition


Face recognition is a computer vision technique used to identify faces within an image
which can either be a static image or a frame within a video. This identification can be
done using different methods, which usually include some sort of feature selection
from the image, normalisation so that each of the features can be compared, and
finally, a similarity score which describes the likeliness of the person in question with
each other person within the normalised database.
As described in [30] face recognition gives rise to multiple issues while
considering the noise from the real world. Such issues could arise due to different light
levels, facial expressions, head pose, hairstyles or facial hair, and age. To combat these
issues, it may be needed to use features that would be unaffected by such issues for
example the Quotient Image [31] would be less affected by light changes. Other
solutions would be to normalise the image as much as possible and include more
variation so that each edge case could be covered thoroughly.
One of the most popular and powerful approaches to face recognition is deep
learning‐based methods. For example, Deng et al. have demonstrated this power
through the Additive Angular Margin Loss (ArcFace) [18]. ArcFace employs a CNN to
learn highly discriminative feature representations for each face. By optimising the
feature space with an additive angular margin loss function, ArcFace makes face
recognition more robust to variations in pose, and age gaps. The method uses a
large‐scale face recognition dataset, MS1MV2, which is a modified version of the
MS‐Celeb‐1M dataset [32], containing over 5 million images of celebrities and
ResNet‐100 [17], a neural network architecture with 100 convolutional layers, to
extract highly discriminative features from facial images. The final classification layer is
replaced with a cosine similarity function that measures the similarity between the
input face’s feature vector and a set of reference feature vectors. This optimisation
helps the network to learn features that are robust to variations in pose and age gaps,
leading to improved accuracy in face recognition achieving the highest reported
accuracy to date on the LFW dataset [33], with a verification accuracy of 99.83%.
Another popular approach to face recognition is the use of traditional machine
learning algorithms such as Support vector machine (SVM)s and Principal Component
Analysis (PCA). These methods have been widely used in the past and are still being
used in many practical applications. For example, Dlib [34], a popular C++ library for
computer vision, provides tools and algorithms for face detection, facial landmark
detection, and face recognition. Dlib uses a modified version of the HOG algorithm
optimised for face detection and an SVM classifier for face recognition. Dlib has been
demonstrated to achieve high accuracy on benchmark face recognition datasets such
as LFW [33].

6
2 Background and Literature Review

In a recent study by Zhang et al. [35], the Dlib toolkit was utilised to create a
face identification system. The system first detects the face using the HOG algorithm,
aligns it by estimating facial landmarks, and then extracts face encodings using the
CNN model. Instead of using an SVM for classification, the Euclidean distance
between the extracted features is calculated, and if the distance is below a threshold,
the system identifies the closest matching feature. However, if the distance exceeds
the threshold, the system cannot identify the person and may classify them as an
unknown individual. This same method can also be seen in the face_recognition library
[36]. One of the significant advantages of the Dlib toolkit is that it is freely available
and can be easily installed with Python, making it accessible for face recognition tasks.

2.4 Scene detection


Scene detection involves dividing a video into segments based on closely related
frames, facilitating a better understanding and distinction between scenes. Various
methods, such as histogram‐based, contour‐based, motion‐based techniques, and
clustering algorithms, can be used for scene boundary detection [8, 37]. The main
challenge in scene detection is handling unnormalised scene changes, requiring the
evaluation and selection of algorithms suited for detecting specific types of transitions
[38].
Gao et al. [37] outlined the initial step for video analysis, which is to group
related frames together as scenes. They used a Histogram Difference Metric (HDM)
and a Spatial Difference Metric (SDM) to measure the variance between neighbouring
frames, which were then applied in a Fuzzy C‐means Clustering (FCM) algorithm to
identify scene boundaries. Key frames were extracted from each scene, and the
Graph‐Theoretical Method (GTC) algorithm was used to analyse the proximity
relationship between key frames and identify potential anchorperson frames. The
study achieved good results with high precision and recall rates using their new binary
pattern‐matching strategy.
Lupatini et al. [8] tested three different scene detection methodology
categories: histogram‐based, motion‐based, and contour‐based. The performance of
various algorithms within these categories was evaluated for different video types. The
histogram‐based algorithms, including the one named ’H5’, generally performed the
best with high recall and precision for news videos, and ’H7’ achieved the best results
for all video categories. ‘H5’ was based on a global histogram and evaluated on 9‐bit
colour code information while ‘H7’ is also based on a global histogram, but instead
evaluated on hue information with 256 bins for each histogram. The histogram method
detects scene boundaries by measuring the difference between two frames and

7
2 Background and Literature Review

applying a threshold. Additional techniques, such as using histograms on individual


colour channels and analysing frames in a grid, were suggested to address motion and
opacity transitions [29, 39].

2.4.1 News videos


Qi et al. [38], had a different approach towards scene detection specifically for news
videos. In their paper, they used both audio and frame images to segment the news
video into shots. Furthermore, NLP was used to classify the video segments into
categories for a more contextual description. The structure of the video described in
their paper was such that each video contained stories, where each story had a group
of shots. Each shot, or scene, was segmented by detecting boundaries where both
visual and audio acknowledge a common boundary. The audio was also used as a
classification tool where each scene was classified into music, environment sound,
silence, or speech, while also further classifying each speaker. Furthermore, results
such as 98% were achieved in Lu et al. [10]’s study, which only focused on the audio
part of the analysis. On the other hand, visual detection was done using colour
correlation analysis which detected shots. Furthermore, using the work of Shahraray et
al. [40], key‐frames so that not all the frames in the shot would need to be used if the
anchor person was detected. Through the papers, [38] and [10] audio data can also be
used for both segmentation and identifying the speaker.

2.5 Video‐based Face Recognition

2.5.1 FaceRec
In their paper, Lisena et al. [41] proposed a system for recognising faces in videos using
a combination of MTCNN [14], FaceNet [19], and SVM classifiers as shown in 2.1. To
train the system, images were obtained using crawlers and processed through MTCNN
and FaceNet to extract embeddings of the faces. These embeddings were then passed
into the SVM classifier to retrieve the most likely match of known faces, including a
confidence score. For video analysis, each frame was individually analysed, and if a face
was detected by MTCNN, it was cropped and aligned, and its embeddings were fed into
the SVM classifier to identify the person. In order to speed up computation they did
not analyse every single frame within a video, but rather a frame for every second that
passes in the video. The Simple Online and Realtime Tracking (SORT) algorithm was
used to track each individual face and determine whether the same person appeared in
subsequent images. To label a face in a frame sequence, a simple algorithm calculated

8
2 Background and Literature Review

the weighted average of the confidence scores for each possible label and selected the
label with the highest weighted average confidence score as the final prediction.

Figure 2.1 FaceRec prediction pipeline [41]

To identify new unknown faces, all the FaceNet encodings were kept and input
into a new model for future matching. Hierarchical clustering was used to group similar
encodings based on a distance threshold, and the clusters were filtered to exclude
clusters with a side face or ones that can already be assigned a label. Clusters with a
duration longer than a second had a stricter distance threshold to limit the number of
encodings.
In order to evaluate their system, they used two datasets of news videos, the
ANTRACT and MeMAD, so that then a ground truth could be created manually by
selecting segments containing the faces of the most present celebrities. What is
important to note is that the resolutions for both datasets were quite low, with the
ANTRACT dataset containing shots in black‐and‐white with a resolution of 512x384
pixels, while the MeMAD dataset had news videos in colour with a resolution of
455x256 pixels.
The ground‐truth datasets were created by following a process where domain
experts first provided a list of historically well‐known people. The next step was to
search for segments in the videos where these people appeared and divide them into
shots. Then, face recognition was performed on the central frame of each shot,
resulting in a large number of shots. To ensure accuracy, the presence of the person in
each selected segment was manually checked. Furthermore, some shots not involving
any of the specified people were also added to the dataset. This iterative process
continued until a final set of shots was obtained, which included people from the list
provided by domain experts, as well as some additional shots. By using this method,
the ground‐truth datasets were carefully curated to include shots where historically
well‐known people were present, allowing for more accurate and reliable analysis. It
should be noted that since the MeMAD dataset was made up of videos instead of
shots, face recognition was performed every quarter frame of the segment instead.
The system’s precision, recall, and F‐score were calculated based on both ground‐truth
datasets, which were used to compare the appearances of each person.

9
2 Background and Literature Review

Overall, the system’s performance was good as indicated by the results, with
better performance achieved on the MeMAD dataset compared to the ANTRACT
dataset. Across both datasets, the F‐scores varied for each individual, ranging from
0.37 to 0.96. Although this, issues arose in cases with short scenes, which could be
addressed by using scene boundaries. Suggestions for improving the system included
incorporating contextual information such as the date of the video and other people
who appeared [41]. Furthermore, the system’s recall could also be improved, especially
for side faces, which require a proper strategy for handling.
Additionally, the number of individuals included in a face recognition system
and datasets is an important factor affecting the system’s performance. Upon analysing
the study, it is evident that the system used in the study was limited in terms of the
number of individuals included. Specifically, the system only searched for 19
celebrities, and there were only 82 fragments with unknown faces, which may not
provide sufficient diversity for testing the system’s ability to recognise the different
faces. Therefore, while the study’s findings are informative, it is important to consider
the limitations of the system and datasets used when interpreting the results.

2.5.2 Survey
Another approach for identifying faces within a video was discussed by Wang et al.
[22]. Similar to the previous paper [41], it was highlighted that the different
components of face recognition in a video include face detection, face tracking and
face recognition. Wang et al. discussed different algorithms and techniques for each
component within the system and the strength and problems that come with each
method. In the survey, two main issues were identified, which were the lack of
standardised video databases since these tend to take a lot of storage, and also that
currently there are not a lot of methods that deal with a sequence of images or frames
within in a video, but rather more on just still images. Regarding this last point, utilising
a sequence of images or frames could help to extract more information such as a 3D
face model or a better normalised and standardised 2D face model. In the survey, they
also provided some databases for video analysis which include the faces of different
subjects, as well as discussing specifically the exploits of CAMSHIFT [42],
condensation [43], and adaptive Kalman filter algorithms [44] for facial tracking.

2.6 Optical Character Recognition


Optical Character Recognition (OCR) is a process used to read text from an image. The
OCR process involves image preprocessing, character segmentation, feature
extraction, character classification, and post‐processing techniques such as NLP and

10
2 Background and Literature Review

dictionary‐based approaches [45]. Challenges in OCR include different fonts or


handwriting, text‐background contrast, and indistinguishable characters, which can be
mitigated by using a consistent and generic font and applying adequate preprocessing
techniques [46].
Several studies [47–49] have been conducted using OCR to analyse news
videos that have achieved promising results. One of the most powerful tools for this
computer vision technique is Tesseract [50], which is an open‐source OCR tool used in
multiple different studies [47, 51–53] demonstrating its power as an OCR. The
architecture of Tesseract can be seen in Figure 2.2.

Figure 2.2 (
Source: Architecture of Tesseract OCR [52]

Wattanarachothai et al. [53] presented a novel approach to retrieving video


content using text‐based techniques, including key frame extraction, text localisation,
and keyword matching. This study builds upon previous work by Gao et al. [37] on
unsupervised video analysis and key frame extraction. The authors employed
Maximally Stable Extremal Region (MSER) features to extract key frames, enabling the
segmentation of video shots with diverse text contents. Text localization involved
clustering MSERs in each key frame based on their similarity in position, size, colour,
and stroke width to identify text regions. The Tesseract OCR engine was then utilised
to recognise text within these regions, where four images obtained from various
preprocessing methods were used to enhance recognition results. Finally, an
approximate string search approach was employed to match OCR outcomes with a
target keyword for query purposes. Experimental results demonstrated that the MSER
feature facilitated efficient video segmentation into shots, outperforming a sum of
absolute difference and edge‐based method in terms of precision and recall.

11
2 Background and Literature Review

2.7 Object Tracking


Object tracking involves identifying and following objects across frames in a video or
image sequence. It has applications in surveillance, autonomous driving, robotics, and
human‐computer interaction [54, 55]. Various tracking algorithms, including
correlation‐based, feature‐based, and deep learning‐based methods, can be employed
and tested based on the specific requirements of the application. Challenges in object
tracking include occlusion, motion blur, and changes in lighting conditions [55].
Template matching is an early approach to object tracking that uses a fixed
template of the object being tracked and matches it with subsequent frames in the
video [56]. However, this approach is limited by appearance changes of the object,
such as scale, rotation, illumination, and occlusion changes [55].
To overcome the limitations of template matching, more advanced techniques
such as correlation filters have been developed. The Minimum Output Sum of Squared
Error (MOSSE) filter introduced by Bolme et al. [56] uses a linear correlation filter
trained with positive and negative samples. It has shown superior performance
compared to other correlation filter‐based methods. Another variant of correlation
filter‐based algorithms is the Kernelized Correlation Filter (KCF) proposed by
Henriques et al., which uses a kernel function to map images into a higher‐dimensional
feature space. The KCF algorithm has demonstrated high accuracy and computational
efficiency, making it a popular choice for real‐time tracking applications.
Deep learning‐based techniques, such as the Siamese network, have also shown
promising results in object tracking due to their ability to learn robust features for
object representation [57]. The Siamese network learns a similarity metric between a
template image and a search image, enabling effective tracking by searching for the
matching section in each frame. It can handle changes in appearance, scale,
orientation, occlusion, and deformation. These techniques have also been applied in
real‐time scenarios [58].
In conclusion, object tracking has practical applications, and researchers have
developed advanced techniques to address the limitations of early methods like
template matching. Correlation filters like MOSSE and KCF offer superior performance,
while deep learning‐based techniques like the Siamese network can learn robust
features and handle various challenges in object tracking. Each technique provides
different options for object tracking, each with its advantages and limitations.

12
3 Methodology
The methodology chapter provides a detailed explanation of the design and
implementation process used to create the proposed solution, linking the research
discussed in the literature review to the chosen solution. It offers a critical analysis of
the system’s operation, detailing any unforeseen problems encountered during
implementation and discussing how they were addressed. The chapter begins with a
high‐level overview of the system design and specifications, which is then followed by
a detailed description of each component, providing a link to the big picture presented
earlier. Any design choices made throughout the project are justified, discussing the
implications of different design choices and then giving reasons for making the choices
made.
For video analysis, the TVM 1 website domain was selected which is one of the
most popular news stations in Malta [59]. Moreover, its extensive collection of daily
news broadcasts, available for both live streaming and later online viewing, solidifies its
position as an ideal choice. The implementation of the system is specifically catered
towards this video format.
For this system, the Python programming language was used throughout as well
as the use of external Python packages. The main packages that were used are
face_recogntion, OpenCV, and Pytesseract, which will be discussed further in the next
sections.

3.1 System Overview


This section provides a high‐level overview of the system’s design and its components.
The system consists of three main components: Video Analysis, News Video
Extraction, and Analysis Visualisation. The Video Analysis component serves as the
core, responsible for detecting and recognising faces, matching them with a database
of known individuals, extracting names, and generating a timeline of occurrences. The
News Video Extraction component provides input data, while the Analysis
Visualisation component handles user interaction and presents the results in a
comprehensive format. The interaction between these components is depicted in
Figure 3.1. Figure 3.7 and Figure 3.8 demonstrate an example of frames that are
analysed and then converted into a timeline of appearing individuals.
In the following sections, detailed descriptions of each component’s design and
implementation will be provided. Any encountered problems and their solutions will
also be discussed.
1
https://tvmi.mt/series/117

13
3 Methodology

Figure 3.1 Level 1 DFD showing the interaction between the 3 main components.

3.2 News Video Extraction


The purpose of this section is to outline the process by which the news videos were
acquired, as well as the reasoning behind segmenting them into smaller and more
pertinent clips. Additionally, the composition of the videos will be explained in order to
provide a more comprehensive understanding of their content.

3.2.1 News Videos


The initial collection of videos was obtained from the chosen domain, TVM, by
downloading them on a daily basis. The intended duration for this process was
approximately one month although a few additional videos were acquired beyond the
original target. In total, 52 videos were extracted from July to September 2022, which
also included non‐consecutive dates.
Although efforts were made to download the videos at their highest resolution,
the resulting quality was not optimal, with the maximum resolution being 1280 × 720
pixels. However, considering the requirements of this task and the need for faster
computational speed, this resolution was deemed adequate to start the Video Analysis
component.

3.2.2 News Video Segments: Reducing video duration


Since the 52 videos contained approximately 32 hours of footage, and most of them
contained random scenes that do not include faces, analysing all of the videos was very

14
3 Methodology

time‐consuming. To overcome this challenge, a limit was set on the number of videos,
and measures were taken to make video analysis faster and more efficient. However,
even with such limitations, having these sections with no faces was taking too much
extra time which could be limited further.
Further investigation revealed that TVM has news articles with clips directly
from news videos. These clips usually contain interviews related to the particular
article. Although this, these clips were missing the captions containing the names of
the interviewees, and this is an issue since name extraction is an important function of
the current system. A system was therefore devised to locate these clips within the
news videos that were extracted.
To accomplish this, a list of articles, called ’Article Links’, was linked to news
videos manually through external sources. This linking contained the start time and
duration of the clip mapped up to the respective video. Furthermore, for each article,
there was usually a Maltese and English version, both of which were mapped with the
same clip. A sample of this data can be seen in Figure 3.2.

Figure 3.2 A sample of the article links mapped with the video date 29.07.2022

Using these time stamps for the article links data, a simple system, shown by
Figure 3.3, was designed to extract parts of the videos starting from the given
timestamp until the given duration. For this system, Python’s package for FFmpeg was
used, which is a widely used open‐source program for handling video and audio files.
The statistics for the videos extracted by this method can be seen in table 3.1 under
the column ’News Video Segments’.

3.2.3 Transcript Extraction


For the purpose of validation, an additional system was created to extract a transcript
for every URL within the same news article links dataset explained in Section 3.2.2.
This system incorporated tailored web scraping techniques, specifically designed for
TVM news articles.
These transcripts were extracted using Selenium 2 due to its reliability and
extensive experience with Python. Although Puppeteer was also considered, it was
2
https://selenium‐python.readthedocs.io/

15
3 Methodology

Figure 3.3 Showing the interactions of data within the video segment extraction
system. Furthermore, an additional system was made to extract transcripts from the
article URLs.

Table 3.1 Video Database Statistics

Basic Statistics
News Videos News Video Segments
Number of videos 52 215
Number of News Video Dates 52 41
Total Video Duration 31.74 hours 8.48 hours
Average Video Duration 36.63 minutes 2.37 minutes
Total Frame Count 3,256,423 858,398
Average Frame Count 62,623 3,993
Frequencies
News Videos News Video Segments
Frames per Second (fps)
‐ 30 36 135
‐ 25 16 80
Resolutions in pixels (width × height)
‐ 1280 × 720 40 150
‐ 1024 × 576 8 43
‐ 960 × 540 4 22

ultimately decided to keep everything in Python to ensure the seamless integration of


each component. Both tools allow for automating mouse clicks, reading elements, and
any other necessary web interactions.
Ultimately, a different validation method was employed, rendering these
transcripts unused. Nonetheless, they possess potential value for future system
advancements, such as keyword extraction or similar methods.

16
3 Methodology

3.2.4 Video Structure


The extraction process resulted in two sets of news videos: the full news videos and
the segments. Understanding the content of these videos is crucial, especially since
the segments are derived from the full news videos.
The content of the news videos can be categorised into five main scenes: a
person speaking with their name displayed in a caption, a person speaking without
their name shown, a presenter inside the studio, a presenter outside the studio, and
scene footage. Some examples of these can be seen in Appendix B.
The frames selected for Video Analysis specifically include those that feature
faces, falling into categories 1‐4. Additionally, Name Extraction exclusively focuses on
extracting names when they are displayed as captions (category 1). It is worth noting
that, in certain instances, presenters were also provided with name captions. However,
as the system does not recognise presenters separately, they were treated in the same
manner as other individuals.
The ’News Video Segments’ primarily featured interviews with multiple
individuals rather than just presenters. While some segments included scene footage,
there were instances where they exclusively consisted of scene footage. In such cases,
there would be no individuals to extract or recognise from those specific video
segments.

3.3 Video Analysis


The Video Analysis component is the backbone of the system and is comprised of
several sub‐components that work together to extract the faces and names of
individuals and create a timeline. Similar to [22], these sub‐components include face
detection, face recognition, object tracking, and the additional name extraction. The
face detection algorithm locates faces in the video frames, and the face recognition
algorithm compares these faces with a database of known individuals to determine
their identities. Object tracking is then used to follow these identified individuals as
they move through the video. Finally, the name extraction component uses OCR to
extract the names of individuals from the video. Together, these sub‐components
enable the Video Analysis component to accurately identify and track individuals in
news videos and provide a detailed timeline of their appearances. These
sub‐component interactions can be seen in Figure 3.4.
To store the analysis, the system generates analysis files in two formats: JSON
and Pickle. JSON files are for user viewing and debugging purposes, while Pickle files
are used to load saved analysis. Additionally, the system stores the database used for
face recognition, along with the images used for each individual. More details about

17
3 Methodology

Figure 3.4 The level 2 DFD for the Video Analysis show the interaction between each
sub‐component

the files can be found in Appendix H with examples.

3.3.1 Overview
The Video Analysis system breaks down a video into scenes and analyses each frame
to identify the occurrences of individuals in the video. It consists of several key classes.
The Face Info class stores individual data, including names and facial encodings, while
the Face Info Database class manages these instances, facilitating storage, loading,
sorting, and merging. The Scene class represents occurrence data about an individual
and comprises multiple Scene Instances, which hold the data of each analysed frame.
The News Analysis class combines the functionalities of the Face Info Database and
Scene classes, populating scenes with individual occurrence information and updating
the database with new individuals. This design allows for efficient and accurate
analysis of individual occurrences in the video, while utility functions facilitate data
retrieval and calculations. For a clearer overview of these classes, refer to Appendix C.
The video analysis process involves frame‐by‐frame analysis using face
detection. When a single face is detected, the system tracks it, forming a ”Scene.” Face
encodings are extracted for each detected face in a scene, and name extraction is
employed to attempt name retrieval. At the end of a scene, face recognition is
performed by comparing the face encodings with a database of known individuals.

18
3 Methodology

New faces are added to the database with their respective names and encodings. The
specifics of the system’s sub‐components and their interactions are depicted in Figure
3.4.
The Video Analysis class offers various parameters to customise the analysis
for optimal accuracy, efficiency, and speed. The specific parameters used for the final
analysis include:
• Video Resolution: The resolution of the video was set to 640 × 360 pixels,
allowing for faster face detection, encoding extraction, and tracking.

• Interval for Frame Analysis: The system analyses each frame at an interval of 1
second, balancing the need for accuracy with the efficiency of processing time.

• Minimum Scene Duration: The minimum length of a scene is set to 0 seconds,


allowing for the system to detect and track short‐lived faces.

• Tracker Type: The system uses the KCF tracker algorithm for face tracking, which
provides a balance between accuracy and speed.

• Overwrite Tracker: This feature allows the face detection component to


overwrite the tracked face location, providing more accurate detection of faces in
the image. The setting was set to true.

• Face Recognition Tolerance: The system has a tolerance of 0.6 for Face
Recognition, meaning that a match is only considered valid if the distance metric
between the face encodings in the scene and the database is smaller than or
equal to 0.6. This was the default value, which achieved an accuracy score of
99.38% on the LFW dataset [33].

• Face Encoding Selection: Since from a scene multiple face encodings are
extracted, this parameter gave more flexibility when choosing the selection
process. The chosen method for this is to average all the face encodings together.
When analysing a batch of videos, a multi‐processing approach was
implemented to simultaneously analyse multiple videos, reducing the overall analysis
time. It’s important to note that no shared facial database was utilised to avoid
conflicts; instead, an internal database was used. Additionally, for analysing a single
video, multi‐processing was also employed, with face encodings being extracted
simultaneously with frame analysis.

3.3.2 Name Extraction


The name extraction component was designed to extract names from video data that
followed a specific structure. This structure involved interviews where the name of the

19
3 Methodology

interviewee was displayed in a white box with blue text. The text could appear in two
separate lines or locations, and sometimes a non‐name appeared as a red box with
white text, which marked the current frame with no name. Moreover, it was observed
that the name usually appeared at the topmost line which included this white box with
blue text.
To accurately extract the name from the video frames, adequate preprocessing
was carried out using OpenCV3 . This involved closing and opening the image,
thresholding HSV colours, and identifying box‐like contours through conditional
statements and extensive testing to ensure the optimal values and configurations were
used. The preprocessed image was then passed through Pytesseract OCR4 , which is
based on Google’s Tesseract‐OCR Engine [50]. Pytesseract was selected based on its
accuracy in recognising simple formalised text, such as the white box with blue text
that appeared in the videos.
Challenges arose in cases where text transition animations occurred, making it
difficult to extract the name correctly. To address this issue, the most common
non‐empty text was selected as the name of the individual in the scene, and if the
individual was matched with the database, their original name was used instead. This
approach helped to solve the problem of extracting the correct name from the videos
and provided a reliable method for name extraction.

3.3.3 Face Detection, Encoding, and Recognition


For both the face detection and recognition components, the Python package ”Face
Recognition” [36] was used. This open‐source package allows for face detection,
extracting face encodings, and calculating a distance metric for the face encodings. It is
based on the Dlib [34] toolkit, which has been shown to achieve accurate and efficient
results for face detection and recognition tasks [35].
The ”Face Recognition” package utilises the HOG algorithm for face detection
and extracts facial features, known as face encodings, for each detected face. These
encodings consist of 128 features that represent the unique characteristics of a face.
Recognition is achieved by calculating the distance between the face’s encoding and
the encodings of known faces in a database using the Euclidean distance. If the
distance falls below a predefined threshold, the face is considered a match, and the
associated name from the database is assigned to the detected face. This approach
enables accurate face recognition even in challenging conditions such as varying
lighting and facial expressions. The package also offers the capability to train custom
models for specific use cases, further enhancing recognition accuracy. To match a
3
https://opencv.org/
4
https://github.com/madmaze/pytesseract

20
3 Methodology

detected face with a known individual, the distance between the face and the database
encodings is calculated for each frame. The average distance across all frames is then
compared to a given threshold, inspired by FaceRec [41].
To address the challenge of selecting the facial encoding of a person when
multiple encodings are extracted due to appearing in multiple frames, three different
approaches were used: taking the first frame as the encoding, taking the middle frame
as the encoding, inspired by Gao et al.’s [37] paper on scene detection, and calculating
the average extracted encodings. These strategies help in selecting the most suitable
encoding for each person, improving facial recognition accuracy.
However, the system faced challenges when using a larger database of faces, as
matching an increasing number of individuals with other individuals decreased the
system’s accuracy. Figure 4.1 demonstrates the decrease in performance as the
database size increased.

3.3.4 Face Tracking


This particular sub‐component was designed to identify the start and end times of an
individual in a given video. Initially, the system relied on scene detection, which
performed reasonably well. However, this approach had a significant drawback, as it
compromised the accuracy of the timestamps and the individual’s duration of
appearance. Therefore, the system was revamped to use a tracker instead of scene
detection. This modification had several benefits, including improved accuracy of the
timestamps and duration, as well as reduced computation time. Unlike scene
detection, the tracker didn’t require prior video processing, allowing it to track faces
and analyse the video simultaneously.
To segment a video into different scenes, the PySceneDetect package was
utilised, which offers various algorithms to efficiently perform this task. The default
algorithm was selected, which works by calculating the average HSV colour of each
frame to determine the difference between two subsequent frames. If the difference
exceeds a predefined threshold, a boundary is detected, and a new scene begins.
For tracking individuals in the video, the system employed the KCF [60]
algorithm, which is a popular and relatively accurate tracker that operates at high
speeds. To minimise the need for additional libraries and ensure ease of integration, a
tracker from OpenCV was selected. Although several other trackers were briefly tested
from OpenCV, it was found that KCF performed better than the alternatives. The use
of KCF enabled the system to track individuals accurately and efficiently, resulting in
improved accuracy of timestamps and duration, while also reducing computation time.
However, tracking still faced some challenges. Sliding transitions caused the
tracker to track a non‐face, and fast motion between frames resulted in offsetting the

21
3 Methodology

tracker, leading to sub‐optimal face encoding extraction. To address these issues, a


parameter was added to enable the overwriting of the tracker’s bounding box. This
parameter, listed in Section 3.3.1, gives the option to overwrite the tracker’s bounding
box, resulting in more accurate face encodings and mitigating issues caused by fast
motion and sliding transitions.

3.3.5 Validation
To evaluate the performance of the Video Analysis system, three main parts will be
measured: OCR name extraction, face recognition of the system, and duration of
occurrence for each person. These metrics will help assess the effectiveness of the
system in accurately identifying and tracking individuals in a video. By measuring the
accuracy of name extraction and face recognition, the system’s ability to extract
meaningful information from the video will be evaluated. Additionally, the duration of
occurrence measurement will provide insight into the system’s ability to accurately
track individuals over time. By analysing the performance of each component, the
overall effectiveness of the Video Analysis system can be assessed. These evaluations
will be crucial in determining the system’s strengths and weaknesses and identifying
areas for improvement.
Initially, the Video Analysis system’s results were obtained using the transcript
generated in Section 3.2.2. However, upon closer inspection, it became evident that
these results were flawed. The transcripts did not include timestamps for individuals,
and in a single clip, there could be multiple instances of different people or no people
at all. As a result, the results obtained using the transcripts were deemed insufficient to
meet the validation requirements. More detailed annotations were required to
accurately evaluate the system’s performance, prompting the need for more
comprehensive and accurate data sets.
To address the issues with the initial results obtained from the transcript‐based
evaluation, a new validation set was created thanks to independent third‐party
annotators. The validation set was carefully annotated and included all occurrences of
individuals in the news video segments in Excel format. The annotations included the
timestamps, names, and flags indicating whether the name was shown in the current
scene and whether the person was a presenter. The creation of this validation set
provided a more accurate and comprehensive dataset for evaluating the system’s
performance and allowed for a more detailed analysis of the results.
The annotations for all 215 video segments, as presented in 3.1, comprised a
total of 1008 annotations, involving 244 distinct individuals. Out of these annotations,
580 featured named individuals, while 428 featured unnamed individuals. Additionally,
218 annotations indicated the presence of presenters, and 262 annotations indicated

22
3 Methodology

Figure 3.5 An example of validation annotation for a news video segment.

that a name was displayed on the screen.


The generated metrics will then be used to compare the system to those using
scenes, a tracker, a cumulative database, and different parameters. It should be noted
that the calculations will be done on the ’News Video Segments’ videos as explained in
Section 3.2.2.

Figure 3.6 This example demonstrates scene data generated from video analysis. It is
presented in Excel format for comparison with the validation in Figure 3.5. Many
unnamed faces are detected, indicating that the program is detecting background faces
rather than identifiable individuals.

3.4 Analysis Visualisation


A GUI was incorporated into the system to enhance its usability. The GUI offers a
user‐friendly interface that allows users to analyse videos, view results, adjust the
facial database, and import or export the database. By introducing the GUI, the system
becomes more accessible and user‐friendly, aiming to streamline the video analysis
process and enhance the overall user experience.
To create a user‐friendly interface, Matplotlib and PyQt packages were utilised.
Matplotlib is a popular Python package for generating visualisations such as graphs and
charts. PyQt is a Python binding for the Qt framework, which offers a wide range of
GUI components and tools for building interactive applications. The combination of
Matplotlib and PyQt enabled the development of a customise and responsive GUI that
can handle user input and display output from the News Analysis class. See Appendix
F for the program and Appendix D for additional timelines.

23
3 Methodology

Figure 3.7 A frame timeline showing 3 people who appeared, along with the real‐time
analysis annotations. The generated analysis timeline can be seen in Figure 3.8

Figure 3.8 The analysis timeline of Figure 3.7, showing 3 people along with their
timestamps and duraions.

24
4 Evaluation
In the previous chapter, a system was developed to automatically analyse news videos
by extracting a video database, analysing individual information using video analysis,
and integrating the results into a user‐friendly GUI. This chapter will evaluate the full
system, including its strengths, weaknesses, and performance.
Since the core system is built from the Video Analysis component, it is
important to correctly analyse and evaluate this system especially. While there have
been studies on news videos [8, 37, 38, 47–49] and face recognition in videos [22, 41],
few studies have simultaneously addressed the task of live face recognition and name
extraction. Therefore, sub‐components needed to be calculated and evaluated
separately to assess the system’s overall performance.
The evaluation will follow a similar structure to the previous chapter, with each
component and sub‐component presented in nested sections. The following sections
will provide a detailed evaluation of the system.

4.1 News Video Dataset


The news extraction component achieved its goal, which was to gather enough videos
so that the video analysis stage could start. One obvious flaw in the system is that
since websites are very dynamic, the code must be modified each time accordingly. In
fact, this happened while working on this project although this was not very
problematic, since only simple changes to the code needed to happen.
Ideally, for a more normalised evaluation of video analysis systems, more
standardised databases could be used. One such example is the MeMAD dataset used
in [41]. However, as argued by Geo et al. [37], news videos tend to have multiple
different formats depending on the different sources used, and having a more niche
video database such as the TVM news videos might aid future developments in
considering different approaches and applications.

4.1.1 Reduced dataset


The ”News Video Segments” dataset has significantly improved the video analysis
process by removing extraneous sections present in the full video. This reduction has
shortened the analysis duration from a day to just a couple of hours, enabling multiple
in‐depth tests to be conducted.
Although the reduced dataset, called ’News Video Segments’, has significantly
expedited the video analysis process by eliminating extraneous sections, it may not

25
4 Evaluation

provide a complete representation of the analysis that could be conducted on the full
news video. In future work, this limitation could be addressed by providing annotations
for the full news videos. Additionally, using a larger dataset with annotations from
multiple annotators could lead to more accurate annotations, as the current
annotations, which were not perfect, had to be modified in some cases.

4.2 Video Analysis


In order to evaluate the performance of the overall system, a comprehensive set of
metrics was employed to assess the performance of each sub‐component. These
sub‐components include name extraction, face recognition, facial tracking, and the
duration of the analysis.
To achieve the desired system performance, extensive testing was conducted to
identify the optimal parameters and methods. The results of these tests can be seen in
Table 4.1, which provides an overview of the performance achieved with each different
variation.
Additionally, to assess the system’s scalability, an incremental database test was
conducted. The results of this test are presented in Figure 4.1, which illustrates the
performance of the system as the database size increases.
In the following sections, the methods for each metric will be explained as well
as proceeding by the evaluation of those metrics. This evaluation will help in gaining
deeper insight into the performance of the system and identify any areas for
improvement.

4.2.1 Metrics

Name Extraction

To evaluate the accuracy of the name extraction sub‐component, a confusion matrix


will be generated for names encountered by the system against the names in the
validation set.
The four parts of the confusion matrix represent:

• True Positives (TP): The number of actual names that were correctly identified.

• False Positives (FP): The number of names that were incorrectly identified.

• True Negatives (TN): The number of non‐names that were correctly identified by
the name extraction sub‐component.

26
4 Evaluation

• False Negatives (FN): The number of actual names that were missed by the name
extraction sub‐component.

By analysing the confusion matrix, various performance metrics can be


calculated, such as precision, recall, and F1 score, that will help in evaluating the
accuracy of the name extraction sub‐component.

Face Recognition

Similar to the Name Extraction sub‐component discussed in Section 4.2.1, a confusion


matrix will be generated to evaluate the accuracy of the Face Recognition
sub‐component. The only difference is that instead of using the names extracted from
the OCR, the names of the matched individuals from the Face Recognition
sub‐component will be used in the confusion matrix.

Duration of Individuals

To evaluate the accuracy of the scene prediction sub‐component, or the tracker, for
each person, the Mean Absolute Error (MAE) will be calculated using the following
equation:

1∑
MAE duration = i = 1n |ŷi − yi | (4.1)
n
Here, n is the number of common individuals between the actual and predicted
individuals, yi is the actual duration of the i‐th individual, ŷi is the predicted duration of
the i‐th individual, and | · | denotes the absolute value function.
The MAE duration represents the average absolute difference between the
predicted durations and the actual durations across all individuals. The smaller the
MAE duration, the better our predictions are.

Large Databases

To demonstrate the impact of a sizeable facial database on face recognition, a chart will
display the metrics following each video analysis. As each video is processed,
individuals with identified names are added to a shared database, gradually expanding
it over time. The facial database will persist throughout subsequent video analyses,
while metrics will be collected after each video to illustrate the effect of the growing
database.
The chart will consist of four separate graphs, each showcasing a different
metric. The first graph will display the metrics for Name Extraction, while the second
will present the metrics for Face Recognition. The third graph will showcase the

27
4 Evaluation

Duration of Individuals, using the average MAE. Finally, the fourth graph will depict the
time taken for analysis completion. Together, these four graphs will provide an
overview of how increasing the facial database size will affect the performance of the
system.

4.2.2 Results
The results presented in Table 4.1 demonstrate the performance of the Video Analysis
system for each variation tested. The table is divided into two sections: one for the
scene detection methods, and the other for the most recent method using face
tracking. The numbers in parentheses represent the number of seconds skipped
between frames during analysis. For the scene detection methods, two values are
shown: the number of seconds skipped during analysis, and the number of seconds
skipped during scene detection. ’Skips’ does not have any direct significance other
than indicating that the same default parameters were used. ’Def.’ indicates the default
resolution was used, while 640 × 360 pixels was used to speed up frame analysis in
other cases. ’Fir.’, ’Mid.’, and ’Avg.’ refer to the first, middle, and average face encodings,
respectively, as explained in Section 3.3.3. For scene detection, only the first
encountered face encoding was used for each individual. The values of P (Precision), R
(Recall), F1 (F1‐score), and A (Accuracy) were calculated as explained in Sections 4.2.1
and 4.2.1. The MAE was calculated for the duration of each individual, as described in
Section 4.2.1. The Time column shows the number of hours required to analyse all of
the ’News Video Segments’ videos. Note that multiprocessing was used in some
analyses, which reduced the processing time in some cases. The best metric achieved
for each calculation is shown in bold and underlined.
At first glance, it can be seen that using the default resolution for the video
analysis achieved the best results. However, this slight improvement to the metrics is
at the cost of halving the processing speed which can be shown when comparing ’Def.
(2, 2)’ with ’Skips (2, 2)’ and ’Def. (2)’ with ’Avg. (2)’. These variations use the exact
same parameters with the difference that one uses the default resolution of the video
and the other 640 × 360 resolution. Note that the resolutions for all the videos can be
seen in table 3.1.
The observed discrepancy between the face recognition metrics and the name
extraction metrics is worth noting. It is important to understand that the validation
process was primarily focused on the extracted names of the recognised individuals.
Consequently, if an incorrect name was extracted, it would not only impact the
accuracy of the name extraction metrics but also influence the face recognition
metrics. Therefore, any inaccuracies or errors in the name extraction process would
have a cascading effect on the overall evaluation of the face recognition system.

28
4 Evaluation

Table 4.1 The performance of the Video Analysis system on the News Video Segments
dataset is evaluated using the precision (P), recall (R), F1‐score (F1), and accuracy (A)
metrics for name extraction and face recognition. Higher values of these metrics
indicate better system performance. Conversely, for mean absolute error (MAE) and
processing time, lower values are preferable. These results highlight the differences for
each variation, such as using scene detection instead of face tracking, the encoding
selection process as well as the frequency of frames used for analysis.

Variation Name extraction Face recognition MAE Time


metrics P R F1 A P R F1 A (secs) (hrs)
Scene Detection
Skips (1, 2) 0.78 0.67 0.72 0.56 0.81 0.65 0.72 0.57 10.20 4.71
Skips (2, 1) 0.81 0.67 0.73 0.58 0.84 0.65 0.73 0.57 05.73 4.27
Skips (2, 2) 0.79 0.66 0.72 0.56 0.82 0.65 0.72 0.56 09.49 2.95
Skips (2, 3) 0.81 0.67 0.73 0.58 0.82 0.63 0.71 0.56 08.45 2.42
Skips (3, 2) 0.81 0.62 0.70 0.54 0.82 0.60 0.70 0.53 05.93 1.94
Def. (2, 2) 0.83 0.72 0.77 0.62 0.84 0.69 0.76 0.61 09.87 4.51
Face Tracking
Fir. (2) 0.81 0.66 0.73 0.57 0.83 0.65 0.73 0.57 07.40 2.46
Mid. (0.5) 0.78 0.72 0.75 0.60 0.82 0.68 0.74 0.59 04.22 7.59
Mid. (2) 0.81 0.68 0.74 0.59 0.83 0.63 0.72 0.56 06.89 1.98
Avg. (0.5) 0.78 0.70 0.74 0.58 0.80 0.66 0.72 0.57 04.74 7.20
Avg. (1) 0.80 0.70 0.75 0.60 0.83 0.68 0.75 0.60 05.72 4.48
Avg. (2) 0.81 0.66 0.72 0.57 0.83 0.61 0.70 0.54 07.18 1.76
Def. (1) 0.83 0.72 0.77 0.83 0.86 0.71 0.78 0.63 05.47 5.91
Def. (2) 0.82 0.70 0.75 0.60 0.85 0.67 0.75 0.60 07.03 4.79

Upon further investigation, it was discovered that the name extraction process
encountered several issues. Firstly, names containing punctuation marks were not
properly detected, resulting in their omission. Secondly, there was a specific problem
with the letter ’L’ where the right side of the caption box was mistakenly identified as
this letter, and sometimes ’L’ was incorrectly interpreted as the end of a name. Thirdly,
due to optimisation for faster computation, there was a limit on the frame view,
causing some names to exceed this limit and be partially read. Finally, frames were
skipped during video analysis for increased speed, inadvertently leading to instances
where frames containing names were also skipped. These shortcomings resulted in
reduced accuracy and incomplete name extraction. Addressing these issues would be
crucial to enhance the effectiveness of the extraction process.
To address the numbers in parameters, these represent the interval in which the
video analyses the frames. From what can be seen, using a lower interval means that
the video need to analyse more frames and thus will increase the duration of the
analysis. This can clearly be seen in the case of ’Mid. (0.5)’ where an interval of 0.5
seconds was chosen which took more than 7 hours to complete instead of the usual
2‐5 hours. Furthermore, in the case of Scene Detection, using a lower interval for the

29
4 Evaluation

scene detection sub‐component seems to decrease the MAE and thus achieve better
performance. From the table, the ideal interval for scene detection seems to be 1
frame per second, which is the lowest configuration that was tested by the variation
’Skips (2, 1)’. This also aligns with the paper by Lisena et al. [41]. On the other hand,
the interval of frames in video analysis, there does not seem to be a lot of significance
in the case of Scene Detection, just as long the system is able to identify a face within a
scene. However, in the case of Face Tracking, having a large interval period will make
the tracker lose the face due to a large amount of movement, and thus will increase the
MAE of the individual duration. This can be seen with ’Mid. (0.5)’ which achieved the
lowest value, while on the other hand, ’Fir. (2)’, ’Mid. (2)’, ’Avg. (2)’, ’Def. (2)’ have the
highest interval and thus reached higher values for the MAE which is sub‐optimal.
Although this, there does seem to be a more specific interval value. From the table, it
can be seen that from ’Avg. (1)’ and ’Def. (1)’ with an interval of 1 second achieved the
best results for the system. Similar to the scene detection interval and Lisena et al.
[41], 1 second seems to be the right value for such a system.
Reducing the frame processing time usually improves speed but reduces
accuracy. Surprisingly, the 1‐second interval performs better than the 0.5‐second
interval. This is because the 0.5‐second system is more susceptible to transition
animations, leading to incorrect name detection and higher error rates. Additionally,
analysing more frames with a lower interval sometimes results in misread names, while
the 1‐second system avoids reading names altogether. As a result, the accuracy of the
0.5‐second system is lower in the confusion matrix calculations. Appendix I provides a
detailed overview of this issue with specific cases.
The ’Skips (2, 1)’ variation appears to be the most effective system for Scene
Detection, disregarding the default resolution. This approach involves analysing frames
at a 2‐second interval and processing scenes at a 1‐second interval. The higher
accuracy of this method can be attributed to more precise scene detection, as
explained previously, leading to improved identification and storage of individuals for
future reference. However, using a shorter interval results in a longer analysis duration,
as indicated in the table.
The comparison between Scene Detection and Face Tracking reveals that Face
Tracking achieves a lower MAE value, implying better results in selecting individual
timestamps. Overall, Face Tracking outperforms Scene Detection, and although further
improvements can be made, as evidenced by the FaceRec [41] system, it appears to be
the more promising approach.
There isn’t a noticeable difference in the performance of face recognition when
different face encodings are used. It is worth noting that, similarly to the approach
taken by Gao et al. [37] who used middle frames for classification, using the middle
face encoding tends to produce slightly better results. However, relying solely on the

30
4 Evaluation

middle face or the first face encoding may not always be reliable if the face is unclear
or not well represented in that particular frame. Therefore, it is preferable to use an
average calculation of all face encodings for improved accuracy.
Based on the results presented in Table 4.1, the optimal configuration for the
Video Analysis system, supported by [37, 41], is to use Face Tracking and analyse
frames at 1‐second intervals. While it’s true that the default resolutions performed
slightly better, this advantage is not significant enough to justify the longer processing
time required for those higher resolutions. Therefore, it’s recommended to use the 640
× 360 resolution instead. Overall the results seem to perform relatively well, although
there is room for improvement for each of the sub‐components.

4.2.3 Larger database results


This Video Analysis component had a major problem in which its performance
significantly decreased over time as the number of individuals in the database
increased. This occurred because, during the face recognition stage, the system was
matching new individuals to existing people in the database, instead of identifying
them as new individuals. This issue led to a significant decrease in the system’s overall
performance.
It is important to note that the decrease in accuracy could be attributed to
individual video‐based annotations. In this case, if a person appeared in a previous
video, they were treated as a new encounter, potentially leading to misleading
accuracy. For instance, if a person was identified by name in a previous video but not
in a subsequent one, the annotator would classify the new instance as unnamed, while
the system might still recognise the individual from the previous videos. This disparity
between the annotations and the system’s recognition could contribute to the
observed decrease in accuracy.
Figure 4.1 demonstrates how the system’s performance varies as the number of
individuals in the database increases over time. The best‐performing parameters,
shown in Section 4.2.2 and in Section 3.3.1, are used with the exception that a
matching tolerance of 0.5 was used instead of 0.6 to allow more new individuals. The
performance of the 0.6 tolerance can be seen in Appendix G, which performed much
worse. These parameters are used to evaluate performance, with each metric plotted
against the number of individuals in the database.
Figure 4.1 highlights that while the name extraction and MAE remain relatively
unchanged, achieving the same results as shown in Figure 4.2.2, this is not the case for
the other metrics. Although not significantly, the time tends to increase due to the
additional computation needed for face recognition. For the face recognition
sub‐component, all metrics show a downward trend, indicating that with the increasing

31
4 Evaluation

Figure 4.1 The Incremental Database Results display metrics as the facial database size
increases. For Name Extraction and Face Recognition, higher values are better, ranging
from 0 to 1. Conversely, for the time graph, lower values indicate faster processing,
which is preferable. The Average MAE measures time error in seconds, where lower
values are desirable.

size of the database, the system is less likely to identify new people, even in cases
where new individuals are present. To resolve this problem, the distance threshold for
matching could be adjusted to be stricter. However, this could result in a lower
likelihood of identifying already known individuals.
This solution might require investing in better face recognition methods, such as
using a system like ArcFace[18], which is capable of extracting more robust and
meaningful features. However, one of the strengths of this tool 3.3 is its ability to
perform fast analysis of news videos. Training a model on a face database for each
video would be time‐consuming, and specific parameters might need to be changed for
each case to achieve optimal results. Therefore, finding a balance between accuracy
and speed would be crucial in improving the performance of the Video Analysis
component.

32
4 Evaluation

4.3 Analysis Visualisation


This sub‐component involved mapping the Video Analysis system with a user‐friendly
GUI. While this was a straightforward task, there is room for improvement in terms of
the GUI’s appearance and functionality. For example, implementing better graphics,
customised buttons and inputs, and adhering to more standardised practices could
enhance the overall user experience.
Additionally, providing users with more options to customise the video analysis
parameters would enhance their ability to explore the analysis portion of the system
and adjust the system to meet their specific needs. Displaying more comprehensive
analysis statistics, such as video duration, extraction date, and the number of frames
analysed, would also provide users with valuable information about the video and its
analysis. Furthermore, improving the timeline functionality to be more user‐interactive
would be a useful enhancement, allowing users not only to view the timeline but also
to interact with it by clicking on specific points and navigating the video in a more
intuitive way.
Furthermore, a future improvement for the system could be to extend its
capabilities to include other applications such as emotion detection, object recognition,
and transcript extractions. The current system provides a useful pipeline for analysing
news videos, which can serve as a foundation for further advancements in these areas.

33
5 Conclusion
This study aimed in creating a system where individuals could be identified through
news videos utilising the power of computer vision. This was achieved by first
extracting a number of news videos from the TVM website. Then this system was
created utilising face detection, face encoding extraction, face recognition, and OCR.
Finally, a GUI was created to facilitate the interactions between the user and the
created news video analysis system.
The best variation of the system achieved an 83% accuracy for name extraction
and a 63% accuracy for face recognition. It had an average error of 5.47 seconds in
identifying individuals’ timestamps. The system was designed to run in real‐time and
featured a user‐friendly GUI. However, there are ongoing challenges in improving
accuracy, especially when dealing with a large facial database, as accuracy tends to
decrease with a larger number of faces.
The developed tool has the potential to facilitate efficient analysis of news
videos, enabling quick extraction of relevant information. This can contribute to raising
awareness of biases present in news videos, both intended and unintended, benefiting
users, news sources, and the general public. Furthermore, the tool can assist journalists
in quickly retrieving pertinent information and provide the general public with an
unbiased summary of news, streamlining the process of staying up‐to‐date with
current events.
However, there are ethical concerns that need to be taken into account,
especially regarding the extraction of individual facial information. It is crucial to
handle this data responsibly and prioritise the protection of privacy, particularly when
it comes to non‐public figures. The potential misuse of such information highlights the
need for cautious and ethical practices in its utilisation.

5.1 Future Work


Moving forward, there are several areas for potential improvement and expansion of
the developed video analysis system. This section will explore and elaborate on some
of these possibilities.

• Enhancing the Video Analysis component by employing advanced algorithms


such as MTCNN [14] for face detection, as well as newer techniques like
RetinaFace [28] or BlazeFace [16]. Additionally, investigating the use of ArcFace
[18] for face recognition could further enhance accuracy while maintaining
efficiency. Addressing these limitations would make the system more robust and
effective in identifying individuals in news videos.

34
5 Conclusion

• Expanding the system to work with a wider range of news sources by developing
a more robust video parsing algorithm capable of handling different video
formats and styles.

• Adding additional components, such as emotion detection and object detection,


to provide a more comprehensive analysis of news videos. Emotion detection
can offer insights into the sentiment and tone of news reports, while object
detection can help identify important objects and events in videos.

• Enhancing the GUI by incorporating additional statistics on the video analysis


and video itself, as well as allowing for parameter adjustments tailored to the
user’s needs. An interactive timeline could also improve user understanding and
interpretation of the results.

These improvements would enhance the accuracy, efficiency, and applicability


of the system, providing a more comprehensive and insightful analysis of news videos.

35
References
[1] A. Edmunds and A. Morris, “The problem of information overload in business
organisations: A review of the literature,” International journal of information
management, vol. 20, no. 1, pp. 17–28, 2000.
[2] A. A. Naz and R. A. Akbar, “Use of media for effective instruction its importance:
Some consideration,” Journal of elementary education, vol. 18, no. 1‐2, pp. 35–40,
2008.
[3] D. Bernhardt, S. Krasa, and M. Polborn, “Political polarization and the electoral
effects of media bias,” Journal of Public Economics, vol. 92, no. 5‐6,
pp. 1092–1104, 2008.
[4] J.‐M. Eberl, H. G. Boomgaarden, and M. Wagner, “One bias fits all? three types of
media bias and their effects on party preferences,” Communication Research,
vol. 44, no. 8, pp. 1125–1148, 2017.
[5] S. DellaVigna and E. Kaplan, “The fox news effect: Media bias and voting,” The
Quarterly Journal of Economics, vol. 122, no. 3, pp. 1187–1234, 2007.
[6] D. N. Hopmann, R. Vliegenthart, C. De Vreese, and E. Albæk, “Effects of election
news coverage: How visibility and tone influence party choice,” Political
communication, vol. 27, no. 4, pp. 389–405, 2010.
[7] K. Choroś, “Video structure analysis and content‐based indexing in the
automatic video indexer avi,” in Advances in Multimedia and Network Information
System Technologies, Springer, 2010, pp. 79–90.
[8] S. Lee and K. Jo, “Strategy for automatic person indexing and retrieval system in
news interview video sequences,” in 2017 10th International Conference on
Human System Interactions (HSI), IEEE, 2017, pp. 212–215.
[9] H. Zhang, Y. Gong, S. Y. Tan, et al., “Automatic parsing of news video,” in 1994
Proceedings of IEEE International Conference on Multimedia Computing and
Systems, IEEE, 1994, pp. 45–54.
[10] L. Lu, H.‐J. Zhang, and H. Jiang, “Content analysis for audio classification and
segmentation,” IEEE Transactions on speech and audio processing, vol. 10, no. 7,
pp. 504–516, 2002.
[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural
network architectures and their applications,” Neurocomputing, vol. 234,
pp. 11–26, 2017.

36
5 REFERENCES

[12] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural
networks: Analysis, applications, and prospects,” IEEE transactions on neural
networks and learning systems, 2021.
[13] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), Ieee, vol. 1, 2005, pp. 886–893.
[14] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using
multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23,
no. 10, pp. 1499–1503, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778.
[16] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann,
“Blazeface: Sub‐millisecond neural face detection on mobile gpus,” arXiv preprint
arXiv:1907.05047, 2019.
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492–1500.
[18] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss
for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2019, pp. 4690–4699.
[19] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for
face recognition and clustering,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 815–823.
[20] S. R. Boyapally and K. Supreethi, “Facial recognition and attendance system using
dlib and face recognition libraries,” 2021 International Research Journal of
Modernization in Engineering Technology and Science, pp. 409–417, 2021.
[21] P. F. De Carrera and I. Marques, “Face recognition algorithms,” Master’s thesis in
Computer Science, Universidad Euskal Herriko, vol. 1, 2010.
[22] H. Wang, Y. Wang, and Y. Cao, “Video‐based face recognition: A survey,”
International Journal of Computer and Information Engineering, vol. 3, no. 12,
pp. 2809–2818, 2009.
[23] M. Everingham and A. Zisserman, “Automated person identification in video,” in
International Conference on Image and Video Retrieval, Springer, 2004,
pp. 289–298.

37
5 REFERENCES

[24] A. Geitgey, Machine learning is fun! part 4: Modern face recognition with deep
learning, Sep. 2020. [Online]. Available:
https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-
face-recognition-with-deep-learning-c3cffc121d78.
[25] T. Shan, B. C. Lovell, and S. Chen, “Face recognition robust to head pose from
one sample image,” in 18th International Conference on Pattern Recognition
(ICPR’06), IEEE, vol. 1, 2006, pp. 515–518.
[26] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE computer society conference on computer
vision and pattern recognition. CVPR 2001, Ieee, vol. 1, 2001, pp. I–I.
[27] C. Zhang and Z. Zhang, “A survey of recent advances in face detection,” 2010.
[28] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single‐shot
multi‐level face localisation in the wild,” in Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2020, pp. 5203–5212.
[29] H. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic partitioning of
full‐motion video,” Multimedia systems, vol. 1, no. 1, pp. 10–28, 1993.
[30] X. Zhao, “3d face analysis: Landmarking, expression recognition and beyond,”
Ph.D. dissertation, Ecully, Ecole centrale de Lyon, 2010.
[31] V. Bruce and A. Young, “Understanding face recognition,” British journal of
psychology, vol. 77, no. 3, pp. 305–327, 1986.
[32] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms‐celeb‐1m: A dataset and
benchmark for large‐scale face recognition,” in Computer Vision–ECCV 2016:
14th European Conference, Amsterdam, The Netherlands, October 11‐14, 2016,
Proceedings, Part III 14, Springer, 2016, pp. 87–102.
[33] G. B. Huang, M. Ramesh, T. Berg, and E. Learned‐Miller, “Labeled faces in the
wild: A database for studying face recognition in unconstrained environments,”
University of Massachusetts, Amherst, Tech. Rep. 07‐49, Oct. 2007. [Online].
Available: http://vis-www.cs.umass.edu/lfw/results.html.
[34] D. King, Dlib‐ml: A machine learning toolkit, 2010. [Online]. Available:
http://dlib.net/.
[35] D. Zhang, J. Li, and Z. Shan, “Implementation of dlib deep learning face
recognition technology,” in 2020 International Conference on Robots & Intelligent
System (ICRIS), IEEE, 2020, pp. 88–91.
[36] Ageitgey, Ageitgey/face_recognition: The world’s simplest facial recognition api for
python and the command line. [Online]. Available:
https://github.com/ageitgey/face_recognition.

38
5 REFERENCES

[37] X. Gao and X. Tang, “Unsupervised video‐shot segmentation and model‐free


anchorperson detection for news video story parsing,” IEEE Transactions on
circuits and systems for video technology, vol. 12, no. 9, pp. 765–776, 2002.
[38] W. Qi, L. Gu, H. Jiang, X.‐R. Chen, and H.‐J. Zhang, “Integrating visual, audio and
text analysis for news video,” in Proceedings 2000 International Conference on
Image Processing (Cat. No. 00CH37101), IEEE, vol. 3, 2000, pp. 520–523.
[39] A. Nagasaka and Y. Tanaka, “Automatic video indexing and full‐video search for
object appearances,” Journal of Information Processing, vol. 15, no. 2, p. 316, 1992.
[40] B. Shahraray and D. C. Gibbon, “Automated authoring of hypermedia documents
of video programs,” in Proceedings of the third ACM international conference on
Multimedia, 1995, pp. 401–409.
[41] P. Lisena, J. Laaksonen, and R. Troncy, “Facerec: An interactive framework for
face recognition in video archives,” in DataTV 2021, 2nd International Workshop
on Data‐driven Personalisation of Television, 2021.
[42] D. Comaniciu, V. Ramesh, and P. Meer, “Real‐time tracking of non‐rigid objects
using mean shift,” in Proceedings IEEE Conference on Computer Vision and Pattern
Recognition. CVPR 2000 (Cat. No. PR00662), IEEE, vol. 2, 2000, pp. 142–149.
[43] M. Isard and A. Blake, “Condensation–conditional density propagation for visual
tracking,” International journal of computer vision, vol. 29, no. 1, p. 5, 1998.
[44] R. E. Kalman, “A new approach to linear filtering and prediction problems,” 1960.
[45] N. Islam, Z. Islam, and N. Noor, “A survey on optical character recognition
system,” arXiv preprint arXiv:1710.05703, 2017.
[46] R. Mithe, S. Indalkar, and N. Divekar, “Optical character recognition,” International
journal of recent technology and engineering (IJRTE), vol. 2, no. 1, pp. 72–75, 2013.
[47] R. Chatterjee and A. Mondal, “Effects of different filters on text extractions from
videos using tesseract,”
[48] T. Sato, T. Kanade, E. K. Hughes, and M. A. Smith, “Video ocr for digital news
archive,” in Proceedings 1998 IEEE International Workshop on Content‐Based
Access of Image and Video Database, IEEE, 1998, pp. 52–60.
[49] J. Yang and A. G. Hauptmann, “Naming every individual in news video
monologues,” in Proceedings of the 12th annual ACM international conference on
Multimedia, 2004, pp. 580–587.
[50] Tesseract‐Ocr, Tesseract‐ocr/tesseract: Tesseract open source ocr engine (main
repository). [Online]. Available: https://github.com/tesseract-ocr/tesseract.

39
REFERENCES

[51] R. Smith, “An overview of the tesseract ocr engine,” in Ninth international
conference on document analysis and recognition (ICDAR 2007), IEEE, vol. 2, 2007,
pp. 629–633.
[52] C. Patel, A. Patel, and D. Patel, “Optical character recognition by open source ocr
tool tesseract: A case study,” International Journal of Computer Applications,
vol. 55, no. 10, pp. 50–56, 2012.
[53] W. Wattanarachothai and K. Patanukhom, “Key frame extraction for text based
video retrieval using maximally stable extremal regions,” in 2015 1st International
Conference on Industrial Networks and Intelligent Systems (INISCom), IEEE, 2015,
pp. 29–37.
[54] S. Kamate and N. Yilmazer, “Application of object detection and tracking
techniques for unmanned aerial vehicles,” Procedia Computer Science, vol. 61,
pp. 436–441, 2015.
[55] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm computing
surveys (CSUR), vol. 38, no. 4, 13–es, 2006.
[56] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking
using adaptive correlation filters,” in 2010 IEEE computer society conference on
computer vision and pattern recognition, IEEE, 2010, pp. 2544–2550.
[57] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr, “End‐to‐end
representation learning for correlation filter based tracking,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp. 2805–2813.
[58] A. He, C. Luo, X. Tian, and W. Zeng, “A twofold siamese network for real‐time
object tracking,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2018, pp. 4834–4843.
[59] R. Agius, Tvm news bulletin is the most widely‐followed programme in malta, Dec.
2021. [Online]. Available: https://tvmnews.mt/en/news/tvm-news-bulletin-
is-the-most-widely-followed-programme-in-malta/.
[60] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High‐speed tracking with
kernelized correlation filters,” IEEE transactions on pattern analysis and machine
intelligence, vol. 37, no. 3, pp. 583–596, 2014.

40
Appendix A Level 0 DFD

Figure A.1 The level 0 Data Flow Diagram (DFD) illustrates the user’s interactions with
the GUI system. Users can input a video path for video analysis and a facial database
path for facial matching. They can also specify an analysis path to save the analysis
results. If no analysis path is provided, the program will automatically generate one.
Additionally, if only the analysis path is entered without the video path, the program
will load an existing analysis for further evaluation or export a facial database.

41
Appendix B Video Frame/Scene Categories

Figure B.1 A person speaking with their name displayed in a caption.

42
B Video Frame/Scene Categories

Figure B.2 A presenter speaking with their name displayed in a caption. This is a
variation to category 1.

Figure B.3 A person speaking without their name shown.

43
B Video Frame/Scene Categories

Figure B.4 A presenter inside the studio.

Figure B.5 A presenter outside the studio.

44
B Video Frame/Scene Categories

Figure B.6 Scene footage.

45
Appendix C Video Analysis Class Diagram

Figure C.1 The Video Analysis Class Diagram shows the classes used in Python along
with all the interactions.

46
Appendix D Analysis Timeline

47
Appendix E Timeline of Analysis vs Validation

48
Appendix F Visualisation GUI

Figure F.1 The main GUI showing analysis for video segment ’09.08.2022 segment
#0.mp4’. The timeline shows that 2 people appear along with their respective
timestamps and durations.

49
F Visualisation GUI

Figure F.2 The export window GUI shows the individuals’ names, timestamps, duration,
and images, along with the option to select and export individuals to an external facial
database.

50
Appendix G Incremental Database

Figure G.1 The graph shows the analysis with an incremental database. The analysis
uses a threshold of 0.6 for face recognition. The graph shows how worse this performs
as more individuals are added to the face recognition database.

51
Appendix H File Structure
The system was designed with the purpose of creating two different file structures to
improve the convenience of saving, loading, and debugging.
The first file structure, called the ”analysis data,” contains all the relevant
information about the analysed video. This includes the parameters used, the
generated analysis results, precomputed calculations like the timeline, and other
related data. The analysis data is stored in both Python pickle format (’.pkl’ and ’.dat’)
and a human‐readable format using JSON.
The second file structure, known as the ”facial database,” stores comprehensive
information about individuals. It includes their names, facial encodings, and profile
images. Similar to the analysis data, the facial database is stored in both Python pickle
format and JSON.
Additionally, the facial database maintains a JSON log that keeps track of any
errors encountered when adding individuals to the database.
In both the analysis data and facial database cases, the system also stores
images of the respective individuals. In the analysis data, these images represent the
individuals who appeared in the analysed video. In the facial database, the images are
associated with individuals stored in the database.
The general file structure can be seen as follows:
root
analysis
<video_name>
Images
analysis_data.json
analysis_data.pkl
database
Images
face_database.dat
face_database.json
log.json

Figure H.1 This figure shows an example of the generated files for video analysis.

52
H File Structure

Figure H.2 This image displays the extracted images obtained from video analysis. It is
worth noting that when individuals are named, these images correspond directly to the
frame in which the person’s name was extracted. It is important to mention that this
example aligns with the content presented in Figure 3.7 and Figure 3.8, where the
same identified individuals can be observed in these images.

Figure H.3 This figure shows an example of the generated files for the video analysis
facial database.

53
H File Structure

Figure H.4 This image displays images that have been extracted through video analysis
of individuals and subsequently stored in the database. These images represent all the
individuals currently stored in the database. It is worth mentioning that these
individuals correspond to the results shown in Figure 4.1.

54
Appendix I Comparison of frame intervals of 0.5
and 1 seconds
I.1 Case 1
0.5‐second skips found an incorrect name, while the 1‐second skips did not find a
name at all. This meant an accuracy of 33% for the 0.5‐second skips and a 50%
accuracy for the 1‐second skips. This inaccurate name extraction happened because
the OCR has trouble with punctuation such as an apostrophe.

Figure I.1 The system for the 0.5‐second skips achieved better results than the
1‐second. Although this the name extracted was incorrect thus resulting in 33%
accuracy since 1 name was matched, 1 name was extra, and 1 name was missing.

55
I Comparison of frame intervals of 0.5 and 1 seconds

Figure I.2 The system for the 1‐second skips has a missing name instead of an incorrect
one. This meant a 50% accuracy since 1 name matched, and another was missing.
Unfortunately, this resulted in better accuracy, while in actuality this is slightly worse.

56
I Comparison of frame intervals of 0.5 and 1 seconds

I.2 Case 2
Like case 1, 0.5‐second skips, found an incorrect name, while the 1‐second skips did
not find a name at all. In this case, this inaccuracy happened due to transition
animation on the text caption as shown in Figure 5.

Figure I.3 Wrong name found (0.5‐second skips)

Figure I.4 No names found (1 second skips)

57
I Comparison of frame intervals of 0.5 and 1 seconds

Figure I.5 Problematic frame for 0.5‐second skip system, which is extracting the
incorrect name (reading ”LDA SE TINHATAR PRIM MINISTRU”). Note that this frame is
skipped in the 1‐second skip system, thus resulting in no found name.

58
I Comparison of frame intervals of 0.5 and 1 seconds

I.3 Case 3
This case had the same face recognition accuracy although again, because of a caught
transition animation on the text, the name extraction received a lower score for the
0.5‐second skips.

Figure I.6 Correct face recognition (0.5‐second skip system)

Figure I.7 Correct face recognition (1‐second skip system)

59
I Comparison of frame intervals of 0.5 and 1 seconds

Figure I.8 Incorrect Name Extraction for 0.5‐second skip system. In this case, the face
recognition successfully matched the individual, although name extraction was still
inaccurate (reading ”YDE CARUANA”).

60
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

Appendix J Survey
This survey explored the importance, benefits, and concerns of analysing news videos
automatically.

Automatic Analysis of News Video


Interviews
17 responses

Publish analytics

News

How frequently do you watch the news? Copy


17 responses

Multiple times a day


Once a day
35.3% A few times a week
11.8%
Once a week
Less than once a week
Never

17.6%
29.4%

On which platform do you watch the news? Copy


16 responses

Television 4 (25%)

Online news websites 9 (56.3%)

Social media (e.g.,


Facebook, Twitter, 12 (75%)
Instagram)

0 5 10 15

61
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 1/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

What type of news content do you prefer? (Select all that apply.) JCopy
Survey
17 responses

Text-based news articles 9 (52.9%)

Video-based news articles 3 (17.6%)

Short video clips (e.g.,


6 (35.3%)
news highlights, summari…

Image-based news content


5 (29.4%)
(e.g., in a Facebook or In…

No preference 2 (11.8%)

0.0 2.5 5.0 7.5 10.0

Automatic Analysis of News Videos

These questions are all related to Automatic Analysis of News Videos. Copy
Please select whether using such a technology would help in the
following:

Yes No Not sure

cy…
… i… re… u… e…
ra fy
im eb o yo t im
cu nti duc ym k e ve
Ac Ide Re Sta Ma Sa

62
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 2/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

Would you be interested in using a tool that provides automatic analysis JCopy
Survey
of news videos?
17 responses

Yes
No
Not sure
17.6%

82.4%

Which of the following automatic analysis features would be most Copy


helpful to you when watching news video? (Select all that apply.)
17 responses

Automatic transcription
13 (76.5%)
(automatically showing te…

Sentiment analysis
2 (11.8%)
(detecting emotions expr…

Named entity recognition


11 (64.7%)
(identifying people, place…

Showing the main topics


14 (82.4%)
discussed in the video

Giving a brief summary of


16 (94.1%)
the video

0 5 10 15 20

In what ways would you use automatic analysis of news videos? (Select Copy
all that apply.)
17 responses

Quickly understand what


15 (88.2%
the interview is about

Easily find and share


7 (41.2%)
interesting parts of the int…

Fact-check or verify
10 (58.8%)
information mentioned in…

Better understand the


8 (47.1%)
context of the interview

See how people feel about


6 (35.3%)
a particular topic

0 5 10 15

63
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 3/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

Would you pay for a service that provides automatic analysis of news JCopy
Survey
videos?
17 responses

Yes
No
Unsure

11.8%
76.5%
11.8%

Which approach do you believe is best for analyzing and critiquing news Copy
17 responses

Human journalists
AI automation
Hybrid (AI and reviewed by
Humans)
94.1%

Are there any concerns you have about the use of automatic analysis of news videos?
If so, please specify.
7 responses

no

will AI be verifying the sources of information? as there are a lot of sites and news sources
that post fake news.

May not always be accurate and perhaps eventually human journalists would loses their jobs

I am wary of the risks done by AI when trained on biased data. This can lead to AI mirroring the
same implicit biases as people

Reliability of information sharing

The main problem with News today is the bias, and AI alone cannot help with that

It can be biased as well


64
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 4/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

Demographic J Survey

To which gender do you relate? Copy


17 responses

Male
41.2% Female
Other
Prefer not to say

58.8%

What is your age? Copy


17 responses

18-24
25-34
23.5%
35-44
45-54
55-64
65 and over
Prefer not to say
70.6%

Do you live in Malta? Copy


17 responses

Yes
No
Prefer not to say

100%

65
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 5/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews

What is your current employment status? JCopy


Survey
17 responses

Full-time employed
Part-time employed
Self-employed
70.6% Student
Unemployed
Retired
17.6% Prefer not to say

This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy

Forms

66
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 6/7
Appendix K Code
The code is available at the following GitHub repository:
https://github.com/OpNoob/Automatic-Analysis-of-News-Videos. Setup
instructions are provided within the repository. Please note that the execution code
and the results generation code have been separated into different branches.

67

You might also like