Professional Documents
Culture Documents
Jonathan Attard
May 2023
i
Acknowledgements
I express profound gratitude to my family for their unwavering support and
encouragement throughout my academic journey, particularly my parents. I am
genuinely grateful to Jeanine Attard for her excellent work in designing the program’s
logo.
My close friends also deserve my thanks for keeping me motivated during this
research, in particular, Jean Sacco provided me with invaluable perspectives and advice
on the challenges encountered.
My supervisor, Dr Dylan Seychell, deserves special thanks for his steadfast
commitment to my academic success, shaping the direction and scope of my research
with his feedback, insights, and expertise.
Additionally, I appreciate the assistance of the students who aided me with the
data gathering.
ii
Contents
Abstract i
Acknowledgements ii
Contents v
List of Abbreviations ix
Glossary of Symbols 1
1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Methodology 13
iii
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 News Video Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 News Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 News Video Segments: Reducing video duration . . . . . . . . . . 14
3.2.3 Transcript Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Video Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Name Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Face Detection, Encoding, and Recognition . . . . . . . . . . . . . . 20
3.3.4 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Analysis Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Evaluation 25
4.1 News Video Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Reduced dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Larger database results . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Analysis Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusion 34
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A Level 0 DFD 41
D Analysis Timeline 47
F Visualisation GUI 49
G Incremental Database 51
H File Structure 52
iv
I Comparison of frame intervals of 0.5 and 1 seconds 55
I.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
I.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
I.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
J Survey 61
K Code 67
v
List of Figures
vi
Figure I.2 Case 1: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 56
Figure I.3 Case 2: 0.5‐second frame intervals . . . . . . . . . . . . . . . . . . . . . 57
Figure I.4 Case 2: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 57
Figure I.5 Case 2: Problematic frames with text transition animation . . . . . . . 58
Figure I.6 Case 3: 0.5‐second frame intervals . . . . . . . . . . . . . . . . . . . . . 59
Figure I.7 Case 3: 1‐second frame intervals . . . . . . . . . . . . . . . . . . . . . . 59
Figure I.8 Case 3: Problematic frames with text transition animation . . . . . . . 60
vii
List of Tables
viii
List of Abbreviations
AI Artificial Intelligence.
ix
1 Introduction
1.1 Problem definition
In today’s world, the abundance of easily accessible information has led to the problem
of information overload [1]. This overload often results in non‐optimal
decision‐making, particularly in the realm of media, where staying well‐informed [2]
while maintaining a balanced perspective on the content we consume is crucial [3].
Numerous studies [3–6] have examined the impact of media bias. As highlighted
by Bernhardt et al. [3], ”Even if citizens are completely rational and take media bias into
account, they cannot recover all of the missing information.” Additionally, research has
shown that media bias influences political voting and policy outcomes. For instance,
DellaVigna et al. [5] found a significant positive correlation between exposure to Fox
News and the Republican vote share in the 2000 Presidential elections compared to
1996, suggesting that media bias can have a substantial political effect on general
beliefs and voter turnout. Similarly, Eberl et al. [4] identified media bias as influencing
voter behaviour based on data from the 2013 Austrian parliamentary election
campaign. Additionally, Hopmann et al. [6] observed that both the tone towards a
political party and the visibility of candidates impact voting outcomes.
Given the rise of video content in the media, it is imperative to analyse and
evaluate the information we consume on a daily basis.
1.2 Motivation
As discussed in the previous section, the widespread availability of information poses
challenges such as information overload and media bias. Fortunately, the rapid
advancements in Artificial Intelligence (AI) technology offer opportunities to develop
tools that aid users in identifying key information and enhancing their awareness of
consumed content, especially in the context of news media. Leveraging AI algorithms,
these tools can analyse and evaluate video content, including news videos, to provide
valuable insights and identify potential biases.
Although several tools have been developed to address specific challenges
[7–10], few focus on analysing and identifying individuals in news videos. Therefore,
an ideal tool would facilitate the direct visibility analysis of individuals in news videos,
considering the growing use of video content in the media.
1
1 Introduction
2. Create a program that automatically extracts names from the news video frames
containing name captions.
3. Design and implement a system that can automatically recognise and identify
individuals shown in news videos, retrieve their names from the captions, and
store newly encountered individuals for future use.
4. Implement a machine learning and computer vision method that identifies and
tracks individuals to create a timeline of appearances as a camera‐time report.
2
1 Introduction
3
2 Background and Literature Review
This chapter provides an essential foundation for understanding the proposed system
by introducing the relevant background concepts and exploring the technologies
utilised. It aims to enhance comprehension and clarity for readers, ensuring a solid
understanding of the techniques employed.
The chapter begins with an overview of the background concepts, providing the
necessary context for the subsequent discussions. It then explores the employed
technologies, offering a concise background, examining relevant papers and research,
discussing strengths, weaknesses, and challenges, and highlighting state‐of‐the‐art
advancements. Furthermore, similar systems will be discussed to gain insights into
existing solutions.
2.1 Background
The concepts of neural networks and Convolutional Neural Network (CNN)s will be
used throughout the literature review and the proposed system. This section aims to
provide an introduction to these concepts so that the mentioned technologies of face
recognition, face tracking, scene detection, and OCR, could be better understood.
Deep learning is a subfield of machine learning that utilises neural networks to
learn and make predictions. Neural networks, computational models inspired by the
human brain, serve as powerful tools for approximating complex functions. They
consist of interconnected nodes known as neurons, working together to learn and
make predictions by adjusting the weights and biases of their connections. Deep
learning has found success in diverse areas like computer vision, image analysis,
information retrieval, Natural Language Processing (NLP), and speech recognition [11].
Common types of neural networks used in deep learning include feed‐forward neural
networks, and CNNs.
CNNs are a type of deep learning model specifically designed for analysing
visual data, such as images or videos [12]. CNNs consist of multiple layers, which
include convolutional layers, pooling layers, and fully connected layers. Convolutional
layers apply filters to extract patterns, pooling layers reduce spatial dimensions, and
fully connected layers enable the network to learn complex relationships and make
predictions [12]. These networks are capable of automatically learning feature
representations from raw data, without the need for explicit feature engineering [12].
By leveraging large amounts of labelled data and powerful computational resources,
CNNs can learn complex patterns and relationships, making them highly effective in
tasks such as image recognition, object detection, and NLP [13–20].
4
2 Background and Literature Review
5
2 Background and Literature Review
6
2 Background and Literature Review
In a recent study by Zhang et al. [35], the Dlib toolkit was utilised to create a
face identification system. The system first detects the face using the HOG algorithm,
aligns it by estimating facial landmarks, and then extracts face encodings using the
CNN model. Instead of using an SVM for classification, the Euclidean distance
between the extracted features is calculated, and if the distance is below a threshold,
the system identifies the closest matching feature. However, if the distance exceeds
the threshold, the system cannot identify the person and may classify them as an
unknown individual. This same method can also be seen in the face_recognition library
[36]. One of the significant advantages of the Dlib toolkit is that it is freely available
and can be easily installed with Python, making it accessible for face recognition tasks.
7
2 Background and Literature Review
2.5.1 FaceRec
In their paper, Lisena et al. [41] proposed a system for recognising faces in videos using
a combination of MTCNN [14], FaceNet [19], and SVM classifiers as shown in 2.1. To
train the system, images were obtained using crawlers and processed through MTCNN
and FaceNet to extract embeddings of the faces. These embeddings were then passed
into the SVM classifier to retrieve the most likely match of known faces, including a
confidence score. For video analysis, each frame was individually analysed, and if a face
was detected by MTCNN, it was cropped and aligned, and its embeddings were fed into
the SVM classifier to identify the person. In order to speed up computation they did
not analyse every single frame within a video, but rather a frame for every second that
passes in the video. The Simple Online and Realtime Tracking (SORT) algorithm was
used to track each individual face and determine whether the same person appeared in
subsequent images. To label a face in a frame sequence, a simple algorithm calculated
8
2 Background and Literature Review
the weighted average of the confidence scores for each possible label and selected the
label with the highest weighted average confidence score as the final prediction.
To identify new unknown faces, all the FaceNet encodings were kept and input
into a new model for future matching. Hierarchical clustering was used to group similar
encodings based on a distance threshold, and the clusters were filtered to exclude
clusters with a side face or ones that can already be assigned a label. Clusters with a
duration longer than a second had a stricter distance threshold to limit the number of
encodings.
In order to evaluate their system, they used two datasets of news videos, the
ANTRACT and MeMAD, so that then a ground truth could be created manually by
selecting segments containing the faces of the most present celebrities. What is
important to note is that the resolutions for both datasets were quite low, with the
ANTRACT dataset containing shots in black‐and‐white with a resolution of 512x384
pixels, while the MeMAD dataset had news videos in colour with a resolution of
455x256 pixels.
The ground‐truth datasets were created by following a process where domain
experts first provided a list of historically well‐known people. The next step was to
search for segments in the videos where these people appeared and divide them into
shots. Then, face recognition was performed on the central frame of each shot,
resulting in a large number of shots. To ensure accuracy, the presence of the person in
each selected segment was manually checked. Furthermore, some shots not involving
any of the specified people were also added to the dataset. This iterative process
continued until a final set of shots was obtained, which included people from the list
provided by domain experts, as well as some additional shots. By using this method,
the ground‐truth datasets were carefully curated to include shots where historically
well‐known people were present, allowing for more accurate and reliable analysis. It
should be noted that since the MeMAD dataset was made up of videos instead of
shots, face recognition was performed every quarter frame of the segment instead.
The system’s precision, recall, and F‐score were calculated based on both ground‐truth
datasets, which were used to compare the appearances of each person.
9
2 Background and Literature Review
Overall, the system’s performance was good as indicated by the results, with
better performance achieved on the MeMAD dataset compared to the ANTRACT
dataset. Across both datasets, the F‐scores varied for each individual, ranging from
0.37 to 0.96. Although this, issues arose in cases with short scenes, which could be
addressed by using scene boundaries. Suggestions for improving the system included
incorporating contextual information such as the date of the video and other people
who appeared [41]. Furthermore, the system’s recall could also be improved, especially
for side faces, which require a proper strategy for handling.
Additionally, the number of individuals included in a face recognition system
and datasets is an important factor affecting the system’s performance. Upon analysing
the study, it is evident that the system used in the study was limited in terms of the
number of individuals included. Specifically, the system only searched for 19
celebrities, and there were only 82 fragments with unknown faces, which may not
provide sufficient diversity for testing the system’s ability to recognise the different
faces. Therefore, while the study’s findings are informative, it is important to consider
the limitations of the system and datasets used when interpreting the results.
2.5.2 Survey
Another approach for identifying faces within a video was discussed by Wang et al.
[22]. Similar to the previous paper [41], it was highlighted that the different
components of face recognition in a video include face detection, face tracking and
face recognition. Wang et al. discussed different algorithms and techniques for each
component within the system and the strength and problems that come with each
method. In the survey, two main issues were identified, which were the lack of
standardised video databases since these tend to take a lot of storage, and also that
currently there are not a lot of methods that deal with a sequence of images or frames
within in a video, but rather more on just still images. Regarding this last point, utilising
a sequence of images or frames could help to extract more information such as a 3D
face model or a better normalised and standardised 2D face model. In the survey, they
also provided some databases for video analysis which include the faces of different
subjects, as well as discussing specifically the exploits of CAMSHIFT [42],
condensation [43], and adaptive Kalman filter algorithms [44] for facial tracking.
10
2 Background and Literature Review
Figure 2.2 (
Source: Architecture of Tesseract OCR [52]
11
2 Background and Literature Review
12
3 Methodology
The methodology chapter provides a detailed explanation of the design and
implementation process used to create the proposed solution, linking the research
discussed in the literature review to the chosen solution. It offers a critical analysis of
the system’s operation, detailing any unforeseen problems encountered during
implementation and discussing how they were addressed. The chapter begins with a
high‐level overview of the system design and specifications, which is then followed by
a detailed description of each component, providing a link to the big picture presented
earlier. Any design choices made throughout the project are justified, discussing the
implications of different design choices and then giving reasons for making the choices
made.
For video analysis, the TVM 1 website domain was selected which is one of the
most popular news stations in Malta [59]. Moreover, its extensive collection of daily
news broadcasts, available for both live streaming and later online viewing, solidifies its
position as an ideal choice. The implementation of the system is specifically catered
towards this video format.
For this system, the Python programming language was used throughout as well
as the use of external Python packages. The main packages that were used are
face_recogntion, OpenCV, and Pytesseract, which will be discussed further in the next
sections.
13
3 Methodology
Figure 3.1 Level 1 DFD showing the interaction between the 3 main components.
14
3 Methodology
time‐consuming. To overcome this challenge, a limit was set on the number of videos,
and measures were taken to make video analysis faster and more efficient. However,
even with such limitations, having these sections with no faces was taking too much
extra time which could be limited further.
Further investigation revealed that TVM has news articles with clips directly
from news videos. These clips usually contain interviews related to the particular
article. Although this, these clips were missing the captions containing the names of
the interviewees, and this is an issue since name extraction is an important function of
the current system. A system was therefore devised to locate these clips within the
news videos that were extracted.
To accomplish this, a list of articles, called ’Article Links’, was linked to news
videos manually through external sources. This linking contained the start time and
duration of the clip mapped up to the respective video. Furthermore, for each article,
there was usually a Maltese and English version, both of which were mapped with the
same clip. A sample of this data can be seen in Figure 3.2.
Figure 3.2 A sample of the article links mapped with the video date 29.07.2022
Using these time stamps for the article links data, a simple system, shown by
Figure 3.3, was designed to extract parts of the videos starting from the given
timestamp until the given duration. For this system, Python’s package for FFmpeg was
used, which is a widely used open‐source program for handling video and audio files.
The statistics for the videos extracted by this method can be seen in table 3.1 under
the column ’News Video Segments’.
15
3 Methodology
Figure 3.3 Showing the interactions of data within the video segment extraction
system. Furthermore, an additional system was made to extract transcripts from the
article URLs.
Basic Statistics
News Videos News Video Segments
Number of videos 52 215
Number of News Video Dates 52 41
Total Video Duration 31.74 hours 8.48 hours
Average Video Duration 36.63 minutes 2.37 minutes
Total Frame Count 3,256,423 858,398
Average Frame Count 62,623 3,993
Frequencies
News Videos News Video Segments
Frames per Second (fps)
‐ 30 36 135
‐ 25 16 80
Resolutions in pixels (width × height)
‐ 1280 × 720 40 150
‐ 1024 × 576 8 43
‐ 960 × 540 4 22
16
3 Methodology
17
3 Methodology
Figure 3.4 The level 2 DFD for the Video Analysis show the interaction between each
sub‐component
3.3.1 Overview
The Video Analysis system breaks down a video into scenes and analyses each frame
to identify the occurrences of individuals in the video. It consists of several key classes.
The Face Info class stores individual data, including names and facial encodings, while
the Face Info Database class manages these instances, facilitating storage, loading,
sorting, and merging. The Scene class represents occurrence data about an individual
and comprises multiple Scene Instances, which hold the data of each analysed frame.
The News Analysis class combines the functionalities of the Face Info Database and
Scene classes, populating scenes with individual occurrence information and updating
the database with new individuals. This design allows for efficient and accurate
analysis of individual occurrences in the video, while utility functions facilitate data
retrieval and calculations. For a clearer overview of these classes, refer to Appendix C.
The video analysis process involves frame‐by‐frame analysis using face
detection. When a single face is detected, the system tracks it, forming a ”Scene.” Face
encodings are extracted for each detected face in a scene, and name extraction is
employed to attempt name retrieval. At the end of a scene, face recognition is
performed by comparing the face encodings with a database of known individuals.
18
3 Methodology
New faces are added to the database with their respective names and encodings. The
specifics of the system’s sub‐components and their interactions are depicted in Figure
3.4.
The Video Analysis class offers various parameters to customise the analysis
for optimal accuracy, efficiency, and speed. The specific parameters used for the final
analysis include:
• Video Resolution: The resolution of the video was set to 640 × 360 pixels,
allowing for faster face detection, encoding extraction, and tracking.
• Interval for Frame Analysis: The system analyses each frame at an interval of 1
second, balancing the need for accuracy with the efficiency of processing time.
• Tracker Type: The system uses the KCF tracker algorithm for face tracking, which
provides a balance between accuracy and speed.
• Face Recognition Tolerance: The system has a tolerance of 0.6 for Face
Recognition, meaning that a match is only considered valid if the distance metric
between the face encodings in the scene and the database is smaller than or
equal to 0.6. This was the default value, which achieved an accuracy score of
99.38% on the LFW dataset [33].
• Face Encoding Selection: Since from a scene multiple face encodings are
extracted, this parameter gave more flexibility when choosing the selection
process. The chosen method for this is to average all the face encodings together.
When analysing a batch of videos, a multi‐processing approach was
implemented to simultaneously analyse multiple videos, reducing the overall analysis
time. It’s important to note that no shared facial database was utilised to avoid
conflicts; instead, an internal database was used. Additionally, for analysing a single
video, multi‐processing was also employed, with face encodings being extracted
simultaneously with frame analysis.
19
3 Methodology
interviewee was displayed in a white box with blue text. The text could appear in two
separate lines or locations, and sometimes a non‐name appeared as a red box with
white text, which marked the current frame with no name. Moreover, it was observed
that the name usually appeared at the topmost line which included this white box with
blue text.
To accurately extract the name from the video frames, adequate preprocessing
was carried out using OpenCV3 . This involved closing and opening the image,
thresholding HSV colours, and identifying box‐like contours through conditional
statements and extensive testing to ensure the optimal values and configurations were
used. The preprocessed image was then passed through Pytesseract OCR4 , which is
based on Google’s Tesseract‐OCR Engine [50]. Pytesseract was selected based on its
accuracy in recognising simple formalised text, such as the white box with blue text
that appeared in the videos.
Challenges arose in cases where text transition animations occurred, making it
difficult to extract the name correctly. To address this issue, the most common
non‐empty text was selected as the name of the individual in the scene, and if the
individual was matched with the database, their original name was used instead. This
approach helped to solve the problem of extracting the correct name from the videos
and provided a reliable method for name extraction.
20
3 Methodology
detected face with a known individual, the distance between the face and the database
encodings is calculated for each frame. The average distance across all frames is then
compared to a given threshold, inspired by FaceRec [41].
To address the challenge of selecting the facial encoding of a person when
multiple encodings are extracted due to appearing in multiple frames, three different
approaches were used: taking the first frame as the encoding, taking the middle frame
as the encoding, inspired by Gao et al.’s [37] paper on scene detection, and calculating
the average extracted encodings. These strategies help in selecting the most suitable
encoding for each person, improving facial recognition accuracy.
However, the system faced challenges when using a larger database of faces, as
matching an increasing number of individuals with other individuals decreased the
system’s accuracy. Figure 4.1 demonstrates the decrease in performance as the
database size increased.
21
3 Methodology
3.3.5 Validation
To evaluate the performance of the Video Analysis system, three main parts will be
measured: OCR name extraction, face recognition of the system, and duration of
occurrence for each person. These metrics will help assess the effectiveness of the
system in accurately identifying and tracking individuals in a video. By measuring the
accuracy of name extraction and face recognition, the system’s ability to extract
meaningful information from the video will be evaluated. Additionally, the duration of
occurrence measurement will provide insight into the system’s ability to accurately
track individuals over time. By analysing the performance of each component, the
overall effectiveness of the Video Analysis system can be assessed. These evaluations
will be crucial in determining the system’s strengths and weaknesses and identifying
areas for improvement.
Initially, the Video Analysis system’s results were obtained using the transcript
generated in Section 3.2.2. However, upon closer inspection, it became evident that
these results were flawed. The transcripts did not include timestamps for individuals,
and in a single clip, there could be multiple instances of different people or no people
at all. As a result, the results obtained using the transcripts were deemed insufficient to
meet the validation requirements. More detailed annotations were required to
accurately evaluate the system’s performance, prompting the need for more
comprehensive and accurate data sets.
To address the issues with the initial results obtained from the transcript‐based
evaluation, a new validation set was created thanks to independent third‐party
annotators. The validation set was carefully annotated and included all occurrences of
individuals in the news video segments in Excel format. The annotations included the
timestamps, names, and flags indicating whether the name was shown in the current
scene and whether the person was a presenter. The creation of this validation set
provided a more accurate and comprehensive dataset for evaluating the system’s
performance and allowed for a more detailed analysis of the results.
The annotations for all 215 video segments, as presented in 3.1, comprised a
total of 1008 annotations, involving 244 distinct individuals. Out of these annotations,
580 featured named individuals, while 428 featured unnamed individuals. Additionally,
218 annotations indicated the presence of presenters, and 262 annotations indicated
22
3 Methodology
Figure 3.6 This example demonstrates scene data generated from video analysis. It is
presented in Excel format for comparison with the validation in Figure 3.5. Many
unnamed faces are detected, indicating that the program is detecting background faces
rather than identifiable individuals.
23
3 Methodology
Figure 3.7 A frame timeline showing 3 people who appeared, along with the real‐time
analysis annotations. The generated analysis timeline can be seen in Figure 3.8
Figure 3.8 The analysis timeline of Figure 3.7, showing 3 people along with their
timestamps and duraions.
24
4 Evaluation
In the previous chapter, a system was developed to automatically analyse news videos
by extracting a video database, analysing individual information using video analysis,
and integrating the results into a user‐friendly GUI. This chapter will evaluate the full
system, including its strengths, weaknesses, and performance.
Since the core system is built from the Video Analysis component, it is
important to correctly analyse and evaluate this system especially. While there have
been studies on news videos [8, 37, 38, 47–49] and face recognition in videos [22, 41],
few studies have simultaneously addressed the task of live face recognition and name
extraction. Therefore, sub‐components needed to be calculated and evaluated
separately to assess the system’s overall performance.
The evaluation will follow a similar structure to the previous chapter, with each
component and sub‐component presented in nested sections. The following sections
will provide a detailed evaluation of the system.
25
4 Evaluation
provide a complete representation of the analysis that could be conducted on the full
news video. In future work, this limitation could be addressed by providing annotations
for the full news videos. Additionally, using a larger dataset with annotations from
multiple annotators could lead to more accurate annotations, as the current
annotations, which were not perfect, had to be modified in some cases.
4.2.1 Metrics
Name Extraction
• True Positives (TP): The number of actual names that were correctly identified.
• False Positives (FP): The number of names that were incorrectly identified.
• True Negatives (TN): The number of non‐names that were correctly identified by
the name extraction sub‐component.
26
4 Evaluation
• False Negatives (FN): The number of actual names that were missed by the name
extraction sub‐component.
Face Recognition
Duration of Individuals
To evaluate the accuracy of the scene prediction sub‐component, or the tracker, for
each person, the Mean Absolute Error (MAE) will be calculated using the following
equation:
1∑
MAE duration = i = 1n |ŷi − yi | (4.1)
n
Here, n is the number of common individuals between the actual and predicted
individuals, yi is the actual duration of the i‐th individual, ŷi is the predicted duration of
the i‐th individual, and | · | denotes the absolute value function.
The MAE duration represents the average absolute difference between the
predicted durations and the actual durations across all individuals. The smaller the
MAE duration, the better our predictions are.
Large Databases
To demonstrate the impact of a sizeable facial database on face recognition, a chart will
display the metrics following each video analysis. As each video is processed,
individuals with identified names are added to a shared database, gradually expanding
it over time. The facial database will persist throughout subsequent video analyses,
while metrics will be collected after each video to illustrate the effect of the growing
database.
The chart will consist of four separate graphs, each showcasing a different
metric. The first graph will display the metrics for Name Extraction, while the second
will present the metrics for Face Recognition. The third graph will showcase the
27
4 Evaluation
Duration of Individuals, using the average MAE. Finally, the fourth graph will depict the
time taken for analysis completion. Together, these four graphs will provide an
overview of how increasing the facial database size will affect the performance of the
system.
4.2.2 Results
The results presented in Table 4.1 demonstrate the performance of the Video Analysis
system for each variation tested. The table is divided into two sections: one for the
scene detection methods, and the other for the most recent method using face
tracking. The numbers in parentheses represent the number of seconds skipped
between frames during analysis. For the scene detection methods, two values are
shown: the number of seconds skipped during analysis, and the number of seconds
skipped during scene detection. ’Skips’ does not have any direct significance other
than indicating that the same default parameters were used. ’Def.’ indicates the default
resolution was used, while 640 × 360 pixels was used to speed up frame analysis in
other cases. ’Fir.’, ’Mid.’, and ’Avg.’ refer to the first, middle, and average face encodings,
respectively, as explained in Section 3.3.3. For scene detection, only the first
encountered face encoding was used for each individual. The values of P (Precision), R
(Recall), F1 (F1‐score), and A (Accuracy) were calculated as explained in Sections 4.2.1
and 4.2.1. The MAE was calculated for the duration of each individual, as described in
Section 4.2.1. The Time column shows the number of hours required to analyse all of
the ’News Video Segments’ videos. Note that multiprocessing was used in some
analyses, which reduced the processing time in some cases. The best metric achieved
for each calculation is shown in bold and underlined.
At first glance, it can be seen that using the default resolution for the video
analysis achieved the best results. However, this slight improvement to the metrics is
at the cost of halving the processing speed which can be shown when comparing ’Def.
(2, 2)’ with ’Skips (2, 2)’ and ’Def. (2)’ with ’Avg. (2)’. These variations use the exact
same parameters with the difference that one uses the default resolution of the video
and the other 640 × 360 resolution. Note that the resolutions for all the videos can be
seen in table 3.1.
The observed discrepancy between the face recognition metrics and the name
extraction metrics is worth noting. It is important to understand that the validation
process was primarily focused on the extracted names of the recognised individuals.
Consequently, if an incorrect name was extracted, it would not only impact the
accuracy of the name extraction metrics but also influence the face recognition
metrics. Therefore, any inaccuracies or errors in the name extraction process would
have a cascading effect on the overall evaluation of the face recognition system.
28
4 Evaluation
Table 4.1 The performance of the Video Analysis system on the News Video Segments
dataset is evaluated using the precision (P), recall (R), F1‐score (F1), and accuracy (A)
metrics for name extraction and face recognition. Higher values of these metrics
indicate better system performance. Conversely, for mean absolute error (MAE) and
processing time, lower values are preferable. These results highlight the differences for
each variation, such as using scene detection instead of face tracking, the encoding
selection process as well as the frequency of frames used for analysis.
Upon further investigation, it was discovered that the name extraction process
encountered several issues. Firstly, names containing punctuation marks were not
properly detected, resulting in their omission. Secondly, there was a specific problem
with the letter ’L’ where the right side of the caption box was mistakenly identified as
this letter, and sometimes ’L’ was incorrectly interpreted as the end of a name. Thirdly,
due to optimisation for faster computation, there was a limit on the frame view,
causing some names to exceed this limit and be partially read. Finally, frames were
skipped during video analysis for increased speed, inadvertently leading to instances
where frames containing names were also skipped. These shortcomings resulted in
reduced accuracy and incomplete name extraction. Addressing these issues would be
crucial to enhance the effectiveness of the extraction process.
To address the numbers in parameters, these represent the interval in which the
video analyses the frames. From what can be seen, using a lower interval means that
the video need to analyse more frames and thus will increase the duration of the
analysis. This can clearly be seen in the case of ’Mid. (0.5)’ where an interval of 0.5
seconds was chosen which took more than 7 hours to complete instead of the usual
2‐5 hours. Furthermore, in the case of Scene Detection, using a lower interval for the
29
4 Evaluation
scene detection sub‐component seems to decrease the MAE and thus achieve better
performance. From the table, the ideal interval for scene detection seems to be 1
frame per second, which is the lowest configuration that was tested by the variation
’Skips (2, 1)’. This also aligns with the paper by Lisena et al. [41]. On the other hand,
the interval of frames in video analysis, there does not seem to be a lot of significance
in the case of Scene Detection, just as long the system is able to identify a face within a
scene. However, in the case of Face Tracking, having a large interval period will make
the tracker lose the face due to a large amount of movement, and thus will increase the
MAE of the individual duration. This can be seen with ’Mid. (0.5)’ which achieved the
lowest value, while on the other hand, ’Fir. (2)’, ’Mid. (2)’, ’Avg. (2)’, ’Def. (2)’ have the
highest interval and thus reached higher values for the MAE which is sub‐optimal.
Although this, there does seem to be a more specific interval value. From the table, it
can be seen that from ’Avg. (1)’ and ’Def. (1)’ with an interval of 1 second achieved the
best results for the system. Similar to the scene detection interval and Lisena et al.
[41], 1 second seems to be the right value for such a system.
Reducing the frame processing time usually improves speed but reduces
accuracy. Surprisingly, the 1‐second interval performs better than the 0.5‐second
interval. This is because the 0.5‐second system is more susceptible to transition
animations, leading to incorrect name detection and higher error rates. Additionally,
analysing more frames with a lower interval sometimes results in misread names, while
the 1‐second system avoids reading names altogether. As a result, the accuracy of the
0.5‐second system is lower in the confusion matrix calculations. Appendix I provides a
detailed overview of this issue with specific cases.
The ’Skips (2, 1)’ variation appears to be the most effective system for Scene
Detection, disregarding the default resolution. This approach involves analysing frames
at a 2‐second interval and processing scenes at a 1‐second interval. The higher
accuracy of this method can be attributed to more precise scene detection, as
explained previously, leading to improved identification and storage of individuals for
future reference. However, using a shorter interval results in a longer analysis duration,
as indicated in the table.
The comparison between Scene Detection and Face Tracking reveals that Face
Tracking achieves a lower MAE value, implying better results in selecting individual
timestamps. Overall, Face Tracking outperforms Scene Detection, and although further
improvements can be made, as evidenced by the FaceRec [41] system, it appears to be
the more promising approach.
There isn’t a noticeable difference in the performance of face recognition when
different face encodings are used. It is worth noting that, similarly to the approach
taken by Gao et al. [37] who used middle frames for classification, using the middle
face encoding tends to produce slightly better results. However, relying solely on the
30
4 Evaluation
middle face or the first face encoding may not always be reliable if the face is unclear
or not well represented in that particular frame. Therefore, it is preferable to use an
average calculation of all face encodings for improved accuracy.
Based on the results presented in Table 4.1, the optimal configuration for the
Video Analysis system, supported by [37, 41], is to use Face Tracking and analyse
frames at 1‐second intervals. While it’s true that the default resolutions performed
slightly better, this advantage is not significant enough to justify the longer processing
time required for those higher resolutions. Therefore, it’s recommended to use the 640
× 360 resolution instead. Overall the results seem to perform relatively well, although
there is room for improvement for each of the sub‐components.
31
4 Evaluation
Figure 4.1 The Incremental Database Results display metrics as the facial database size
increases. For Name Extraction and Face Recognition, higher values are better, ranging
from 0 to 1. Conversely, for the time graph, lower values indicate faster processing,
which is preferable. The Average MAE measures time error in seconds, where lower
values are desirable.
size of the database, the system is less likely to identify new people, even in cases
where new individuals are present. To resolve this problem, the distance threshold for
matching could be adjusted to be stricter. However, this could result in a lower
likelihood of identifying already known individuals.
This solution might require investing in better face recognition methods, such as
using a system like ArcFace[18], which is capable of extracting more robust and
meaningful features. However, one of the strengths of this tool 3.3 is its ability to
perform fast analysis of news videos. Training a model on a face database for each
video would be time‐consuming, and specific parameters might need to be changed for
each case to achieve optimal results. Therefore, finding a balance between accuracy
and speed would be crucial in improving the performance of the Video Analysis
component.
32
4 Evaluation
33
5 Conclusion
This study aimed in creating a system where individuals could be identified through
news videos utilising the power of computer vision. This was achieved by first
extracting a number of news videos from the TVM website. Then this system was
created utilising face detection, face encoding extraction, face recognition, and OCR.
Finally, a GUI was created to facilitate the interactions between the user and the
created news video analysis system.
The best variation of the system achieved an 83% accuracy for name extraction
and a 63% accuracy for face recognition. It had an average error of 5.47 seconds in
identifying individuals’ timestamps. The system was designed to run in real‐time and
featured a user‐friendly GUI. However, there are ongoing challenges in improving
accuracy, especially when dealing with a large facial database, as accuracy tends to
decrease with a larger number of faces.
The developed tool has the potential to facilitate efficient analysis of news
videos, enabling quick extraction of relevant information. This can contribute to raising
awareness of biases present in news videos, both intended and unintended, benefiting
users, news sources, and the general public. Furthermore, the tool can assist journalists
in quickly retrieving pertinent information and provide the general public with an
unbiased summary of news, streamlining the process of staying up‐to‐date with
current events.
However, there are ethical concerns that need to be taken into account,
especially regarding the extraction of individual facial information. It is crucial to
handle this data responsibly and prioritise the protection of privacy, particularly when
it comes to non‐public figures. The potential misuse of such information highlights the
need for cautious and ethical practices in its utilisation.
34
5 Conclusion
• Expanding the system to work with a wider range of news sources by developing
a more robust video parsing algorithm capable of handling different video
formats and styles.
35
References
[1] A. Edmunds and A. Morris, “The problem of information overload in business
organisations: A review of the literature,” International journal of information
management, vol. 20, no. 1, pp. 17–28, 2000.
[2] A. A. Naz and R. A. Akbar, “Use of media for effective instruction its importance:
Some consideration,” Journal of elementary education, vol. 18, no. 1‐2, pp. 35–40,
2008.
[3] D. Bernhardt, S. Krasa, and M. Polborn, “Political polarization and the electoral
effects of media bias,” Journal of Public Economics, vol. 92, no. 5‐6,
pp. 1092–1104, 2008.
[4] J.‐M. Eberl, H. G. Boomgaarden, and M. Wagner, “One bias fits all? three types of
media bias and their effects on party preferences,” Communication Research,
vol. 44, no. 8, pp. 1125–1148, 2017.
[5] S. DellaVigna and E. Kaplan, “The fox news effect: Media bias and voting,” The
Quarterly Journal of Economics, vol. 122, no. 3, pp. 1187–1234, 2007.
[6] D. N. Hopmann, R. Vliegenthart, C. De Vreese, and E. Albæk, “Effects of election
news coverage: How visibility and tone influence party choice,” Political
communication, vol. 27, no. 4, pp. 389–405, 2010.
[7] K. Choroś, “Video structure analysis and content‐based indexing in the
automatic video indexer avi,” in Advances in Multimedia and Network Information
System Technologies, Springer, 2010, pp. 79–90.
[8] S. Lee and K. Jo, “Strategy for automatic person indexing and retrieval system in
news interview video sequences,” in 2017 10th International Conference on
Human System Interactions (HSI), IEEE, 2017, pp. 212–215.
[9] H. Zhang, Y. Gong, S. Y. Tan, et al., “Automatic parsing of news video,” in 1994
Proceedings of IEEE International Conference on Multimedia Computing and
Systems, IEEE, 1994, pp. 45–54.
[10] L. Lu, H.‐J. Zhang, and H. Jiang, “Content analysis for audio classification and
segmentation,” IEEE Transactions on speech and audio processing, vol. 10, no. 7,
pp. 504–516, 2002.
[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural
network architectures and their applications,” Neurocomputing, vol. 234,
pp. 11–26, 2017.
36
5 REFERENCES
[12] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural
networks: Analysis, applications, and prospects,” IEEE transactions on neural
networks and learning systems, 2021.
[13] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE computer society conference on computer vision and pattern recognition
(CVPR’05), Ieee, vol. 1, 2005, pp. 886–893.
[14] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using
multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23,
no. 10, pp. 1499–1503, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778.
[16] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann,
“Blazeface: Sub‐millisecond neural face detection on mobile gpus,” arXiv preprint
arXiv:1907.05047, 2019.
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492–1500.
[18] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss
for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2019, pp. 4690–4699.
[19] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for
face recognition and clustering,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 815–823.
[20] S. R. Boyapally and K. Supreethi, “Facial recognition and attendance system using
dlib and face recognition libraries,” 2021 International Research Journal of
Modernization in Engineering Technology and Science, pp. 409–417, 2021.
[21] P. F. De Carrera and I. Marques, “Face recognition algorithms,” Master’s thesis in
Computer Science, Universidad Euskal Herriko, vol. 1, 2010.
[22] H. Wang, Y. Wang, and Y. Cao, “Video‐based face recognition: A survey,”
International Journal of Computer and Information Engineering, vol. 3, no. 12,
pp. 2809–2818, 2009.
[23] M. Everingham and A. Zisserman, “Automated person identification in video,” in
International Conference on Image and Video Retrieval, Springer, 2004,
pp. 289–298.
37
5 REFERENCES
[24] A. Geitgey, Machine learning is fun! part 4: Modern face recognition with deep
learning, Sep. 2020. [Online]. Available:
https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-
face-recognition-with-deep-learning-c3cffc121d78.
[25] T. Shan, B. C. Lovell, and S. Chen, “Face recognition robust to head pose from
one sample image,” in 18th International Conference on Pattern Recognition
(ICPR’06), IEEE, vol. 1, 2006, pp. 515–518.
[26] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE computer society conference on computer
vision and pattern recognition. CVPR 2001, Ieee, vol. 1, 2001, pp. I–I.
[27] C. Zhang and Z. Zhang, “A survey of recent advances in face detection,” 2010.
[28] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single‐shot
multi‐level face localisation in the wild,” in Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, 2020, pp. 5203–5212.
[29] H. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic partitioning of
full‐motion video,” Multimedia systems, vol. 1, no. 1, pp. 10–28, 1993.
[30] X. Zhao, “3d face analysis: Landmarking, expression recognition and beyond,”
Ph.D. dissertation, Ecully, Ecole centrale de Lyon, 2010.
[31] V. Bruce and A. Young, “Understanding face recognition,” British journal of
psychology, vol. 77, no. 3, pp. 305–327, 1986.
[32] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms‐celeb‐1m: A dataset and
benchmark for large‐scale face recognition,” in Computer Vision–ECCV 2016:
14th European Conference, Amsterdam, The Netherlands, October 11‐14, 2016,
Proceedings, Part III 14, Springer, 2016, pp. 87–102.
[33] G. B. Huang, M. Ramesh, T. Berg, and E. Learned‐Miller, “Labeled faces in the
wild: A database for studying face recognition in unconstrained environments,”
University of Massachusetts, Amherst, Tech. Rep. 07‐49, Oct. 2007. [Online].
Available: http://vis-www.cs.umass.edu/lfw/results.html.
[34] D. King, Dlib‐ml: A machine learning toolkit, 2010. [Online]. Available:
http://dlib.net/.
[35] D. Zhang, J. Li, and Z. Shan, “Implementation of dlib deep learning face
recognition technology,” in 2020 International Conference on Robots & Intelligent
System (ICRIS), IEEE, 2020, pp. 88–91.
[36] Ageitgey, Ageitgey/face_recognition: The world’s simplest facial recognition api for
python and the command line. [Online]. Available:
https://github.com/ageitgey/face_recognition.
38
5 REFERENCES
39
REFERENCES
[51] R. Smith, “An overview of the tesseract ocr engine,” in Ninth international
conference on document analysis and recognition (ICDAR 2007), IEEE, vol. 2, 2007,
pp. 629–633.
[52] C. Patel, A. Patel, and D. Patel, “Optical character recognition by open source ocr
tool tesseract: A case study,” International Journal of Computer Applications,
vol. 55, no. 10, pp. 50–56, 2012.
[53] W. Wattanarachothai and K. Patanukhom, “Key frame extraction for text based
video retrieval using maximally stable extremal regions,” in 2015 1st International
Conference on Industrial Networks and Intelligent Systems (INISCom), IEEE, 2015,
pp. 29–37.
[54] S. Kamate and N. Yilmazer, “Application of object detection and tracking
techniques for unmanned aerial vehicles,” Procedia Computer Science, vol. 61,
pp. 436–441, 2015.
[55] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm computing
surveys (CSUR), vol. 38, no. 4, 13–es, 2006.
[56] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking
using adaptive correlation filters,” in 2010 IEEE computer society conference on
computer vision and pattern recognition, IEEE, 2010, pp. 2544–2550.
[57] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr, “End‐to‐end
representation learning for correlation filter based tracking,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp. 2805–2813.
[58] A. He, C. Luo, X. Tian, and W. Zeng, “A twofold siamese network for real‐time
object tracking,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2018, pp. 4834–4843.
[59] R. Agius, Tvm news bulletin is the most widely‐followed programme in malta, Dec.
2021. [Online]. Available: https://tvmnews.mt/en/news/tvm-news-bulletin-
is-the-most-widely-followed-programme-in-malta/.
[60] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High‐speed tracking with
kernelized correlation filters,” IEEE transactions on pattern analysis and machine
intelligence, vol. 37, no. 3, pp. 583–596, 2014.
40
Appendix A Level 0 DFD
Figure A.1 The level 0 Data Flow Diagram (DFD) illustrates the user’s interactions with
the GUI system. Users can input a video path for video analysis and a facial database
path for facial matching. They can also specify an analysis path to save the analysis
results. If no analysis path is provided, the program will automatically generate one.
Additionally, if only the analysis path is entered without the video path, the program
will load an existing analysis for further evaluation or export a facial database.
41
Appendix B Video Frame/Scene Categories
42
B Video Frame/Scene Categories
Figure B.2 A presenter speaking with their name displayed in a caption. This is a
variation to category 1.
43
B Video Frame/Scene Categories
44
B Video Frame/Scene Categories
45
Appendix C Video Analysis Class Diagram
Figure C.1 The Video Analysis Class Diagram shows the classes used in Python along
with all the interactions.
46
Appendix D Analysis Timeline
47
Appendix E Timeline of Analysis vs Validation
48
Appendix F Visualisation GUI
Figure F.1 The main GUI showing analysis for video segment ’09.08.2022 segment
#0.mp4’. The timeline shows that 2 people appear along with their respective
timestamps and durations.
49
F Visualisation GUI
Figure F.2 The export window GUI shows the individuals’ names, timestamps, duration,
and images, along with the option to select and export individuals to an external facial
database.
50
Appendix G Incremental Database
Figure G.1 The graph shows the analysis with an incremental database. The analysis
uses a threshold of 0.6 for face recognition. The graph shows how worse this performs
as more individuals are added to the face recognition database.
51
Appendix H File Structure
The system was designed with the purpose of creating two different file structures to
improve the convenience of saving, loading, and debugging.
The first file structure, called the ”analysis data,” contains all the relevant
information about the analysed video. This includes the parameters used, the
generated analysis results, precomputed calculations like the timeline, and other
related data. The analysis data is stored in both Python pickle format (’.pkl’ and ’.dat’)
and a human‐readable format using JSON.
The second file structure, known as the ”facial database,” stores comprehensive
information about individuals. It includes their names, facial encodings, and profile
images. Similar to the analysis data, the facial database is stored in both Python pickle
format and JSON.
Additionally, the facial database maintains a JSON log that keeps track of any
errors encountered when adding individuals to the database.
In both the analysis data and facial database cases, the system also stores
images of the respective individuals. In the analysis data, these images represent the
individuals who appeared in the analysed video. In the facial database, the images are
associated with individuals stored in the database.
The general file structure can be seen as follows:
root
analysis
<video_name>
Images
analysis_data.json
analysis_data.pkl
database
Images
face_database.dat
face_database.json
log.json
Figure H.1 This figure shows an example of the generated files for video analysis.
52
H File Structure
Figure H.2 This image displays the extracted images obtained from video analysis. It is
worth noting that when individuals are named, these images correspond directly to the
frame in which the person’s name was extracted. It is important to mention that this
example aligns with the content presented in Figure 3.7 and Figure 3.8, where the
same identified individuals can be observed in these images.
Figure H.3 This figure shows an example of the generated files for the video analysis
facial database.
53
H File Structure
Figure H.4 This image displays images that have been extracted through video analysis
of individuals and subsequently stored in the database. These images represent all the
individuals currently stored in the database. It is worth mentioning that these
individuals correspond to the results shown in Figure 4.1.
54
Appendix I Comparison of frame intervals of 0.5
and 1 seconds
I.1 Case 1
0.5‐second skips found an incorrect name, while the 1‐second skips did not find a
name at all. This meant an accuracy of 33% for the 0.5‐second skips and a 50%
accuracy for the 1‐second skips. This inaccurate name extraction happened because
the OCR has trouble with punctuation such as an apostrophe.
Figure I.1 The system for the 0.5‐second skips achieved better results than the
1‐second. Although this the name extracted was incorrect thus resulting in 33%
accuracy since 1 name was matched, 1 name was extra, and 1 name was missing.
55
I Comparison of frame intervals of 0.5 and 1 seconds
Figure I.2 The system for the 1‐second skips has a missing name instead of an incorrect
one. This meant a 50% accuracy since 1 name matched, and another was missing.
Unfortunately, this resulted in better accuracy, while in actuality this is slightly worse.
56
I Comparison of frame intervals of 0.5 and 1 seconds
I.2 Case 2
Like case 1, 0.5‐second skips, found an incorrect name, while the 1‐second skips did
not find a name at all. In this case, this inaccuracy happened due to transition
animation on the text caption as shown in Figure 5.
57
I Comparison of frame intervals of 0.5 and 1 seconds
Figure I.5 Problematic frame for 0.5‐second skip system, which is extracting the
incorrect name (reading ”LDA SE TINHATAR PRIM MINISTRU”). Note that this frame is
skipped in the 1‐second skip system, thus resulting in no found name.
58
I Comparison of frame intervals of 0.5 and 1 seconds
I.3 Case 3
This case had the same face recognition accuracy although again, because of a caught
transition animation on the text, the name extraction received a lower score for the
0.5‐second skips.
59
I Comparison of frame intervals of 0.5 and 1 seconds
Figure I.8 Incorrect Name Extraction for 0.5‐second skip system. In this case, the face
recognition successfully matched the individual, although name extraction was still
inaccurate (reading ”YDE CARUANA”).
60
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews
Appendix J Survey
This survey explored the importance, benefits, and concerns of analysing news videos
automatically.
Publish analytics
News
17.6%
29.4%
Television 4 (25%)
0 5 10 15
61
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 1/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews
What type of news content do you prefer? (Select all that apply.) JCopy
Survey
17 responses
No preference 2 (11.8%)
These questions are all related to Automatic Analysis of News Videos. Copy
Please select whether using such a technology would help in the
following:
cy…
… i… re… u… e…
ra fy
im eb o yo t im
cu nti duc ym k e ve
Ac Ide Re Sta Ma Sa
62
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 2/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews
Would you be interested in using a tool that provides automatic analysis JCopy
Survey
of news videos?
17 responses
Yes
No
Not sure
17.6%
82.4%
Automatic transcription
13 (76.5%)
(automatically showing te…
Sentiment analysis
2 (11.8%)
(detecting emotions expr…
0 5 10 15 20
In what ways would you use automatic analysis of news videos? (Select Copy
all that apply.)
17 responses
Fact-check or verify
10 (58.8%)
information mentioned in…
0 5 10 15
63
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 3/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews
Would you pay for a service that provides automatic analysis of news JCopy
Survey
videos?
17 responses
Yes
No
Unsure
11.8%
76.5%
11.8%
Which approach do you believe is best for analyzing and critiquing news Copy
17 responses
Human journalists
AI automation
Hybrid (AI and reviewed by
Humans)
94.1%
Are there any concerns you have about the use of automatic analysis of news videos?
If so, please specify.
7 responses
no
will AI be verifying the sources of information? as there are a lot of sites and news sources
that post fake news.
May not always be accurate and perhaps eventually human journalists would loses their jobs
I am wary of the risks done by AI when trained on biased data. This can lead to AI mirroring the
same implicit biases as people
The main problem with News today is the bias, and AI alone cannot help with that
Demographic J Survey
Male
41.2% Female
Other
Prefer not to say
58.8%
18-24
25-34
23.5%
35-44
45-54
55-64
65 and over
Prefer not to say
70.6%
Yes
No
Prefer not to say
100%
65
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 5/7
5/18/23, 7:12 PM Automatic Analysis of News Video Interviews
Full-time employed
Part-time employed
Self-employed
70.6% Student
Unemployed
Retired
17.6% Prefer not to say
This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy
Forms
66
https://docs.google.com/forms/d/104Hn1Iz8M6we5qeNn0NOhuEVO0milSKYiZz639Klnts/viewanalytics 6/7
Appendix K Code
The code is available at the following GitHub repository:
https://github.com/OpNoob/Automatic-Analysis-of-News-Videos. Setup
instructions are provided within the repository. Please note that the execution code
and the results generation code have been separated into different branches.
67