deSaLowande Rafael Thesis

Please do not remove this page
Visual Question Answering (vqa) Analyses for

Post-Disaster Damage Detection and
Identification Using Aerial Footage
de Sa Lowande, Rafael
https://ircommons.uwf.edu/esploro/outputs/graduate/Visual-Question-Answering-vqa-Analyses-for/99380090596106600/filesAndLinks?index=0
de Sa Lowande, R. (2022). Visual Question Answering (vqa) Analyses for Post-Disaster Damage Detection
and Identification Using Aerial Footage [University of West Florida Libraries].
https://ircommons.uwf.edu/esploro/outputs/graduate/Visual-Question-Answering-vqa-Analyses-for/993
80090596106600
Repository homepage:
https://uwf-flvc-researchmanagement.esploro.exlibrisgroup.com/esploro/?institution=01FALSC_UWF
© 2022 Enrique Plasencia
Downloaded On 2024/03/30 09:28:11 -0500
VISUAL QUESTION ANSWERING (VQA) ANALYSES FOR POST-DISASTER
DAMAGE DETECTION AND IDENTIFICATION USING AERIAL FOOTAGE
By
Rafael De Sa Lowande
B.S., The University of West Florida, 2020
A thesis submitted to the Department of Electrical and Computer Engineering

Hal Marcus College of Science and Engineering
The University of West Florida
In partial fulfillment of the Requirements for the degree of
Master of Science
May 2022
THESIS CERTIFICATION
Rafael de Sa Lowande defended this thesis on 25/03/2022. The members of the thesis
committee were:
Hakki Erhan Sevil, Ph.D Committee Chair

Thomas Gilbar, Ph.D Committee Member
Arash Mahyari, Ph.D Committee Member
Mohammed Khabou, Ph.D Committee Member
Accepted for the Department:
Thomas Gilbar, Ph.D Chair Department of Electrical and Computer Engineering
The University of West Florida Graduate School verifies the names of the committee members
and certifies that the thesis has been approved in accordance with university requirements.
Dr. Kuiyuan Li, Dean, Graduate School
Copyright © by RAFAEL DE SA LOWANDE 2022
All Rights Reserved

ACKNOWLEDGMENTS
I would like to thank my supervising professor Dr. Hakki Erhan Sevil for all the support
and mentoring he has provided me throughout this time. All of his invaluable advice during these
past few years provided me with the necessary knowledge I needed to be able to complete this
thesis. I started working with Dr. Sevil in 2019 doing a research assignment regarding damage
detection using computer vision. It was my first time being introduced to this possible form of
using computer vision and robotics in order to help other people in crucial times. Since then, we
have published a few papers together and I have realized that this was the type of research I
would like to conduct forward in my life. I am deeply grateful for the guidance Dr. Sevil
provided during this time. He would always be available to help me if I had any questions and
helped me revising and writing all my papers. I would also like to thank the knowledgeable
academic committee members, Dr. Gilbar, Dr. Mahyari and Dr. Khabou for their interest in my
research and for taking the time to serve in my thesis committee.
I am grateful to all the teachers who taught me since elementary school in Brazil to
graduate school in the Unites States. I am also grateful to the Department of Electrical and
Computer Engineering, at UWF, for all the support and attention they provided me in order to
help me complete my studies. I would like to thank my girlfriend Milena Ghtait for encouraging
and inspiring me to pursue graduate studies. Without her help, I would not be able to go forward
and finish my thesis with as much support as she provided me. I deeply appreciate all her help,
encouragement and motivation throughout this time.
Finally, I would like to express my deep gratitude to my parents and my sister, who have
encouraged, inspired, assisted and sponsored me through my entire life. I am also extremely
grateful to my family for their support and patience.
May 07, 2022
iv
TABLE OF CONTENTS
Acknowledgments ……………………...………………………………………………..iv
List of Tables ………………………...………………………………………………….vii
List of Figures …………………………………………………………………………..viii
List of Abbreviations ……………………...……………………………………………..xi
Abstract ……………………………...…………………………………………………..xii
Chapter 1 Introduction ………...……………………………………………………….…1
1.1 Motivation and Problem Statement …………………………………...……………...1
1.2 Literature Survey ……………………………………………...…………...…. …..…2
1.2.1 Unmanned Vehicles ………………………………………………………2
1.2.2 Cascade Classifier …………………………………………………...……2
1.2.3 Convolutional Neural Network ……………………………………….…. 3
1.2.4 Visual Question Answering……………………...………………………. 3
1.3 Organization of the Thesis ……………………………...…………………….………4
1.4 Original Contribution ……………………………………………….……….………..5
Chapter 2 Cascade Classifier ……………………………………………………………..6
2.1 Overview ……………………………………………………………………...6
2.2 Cascade Classifier Model Using Haar Features ………………………………7
2.2.1 Background Theory ……………………………………..………….7
2.2.2 Methodology ………………………………………………….…….8
2.3 Training …………………………………………………………………..….11
2.3.1 Positive Images ……………………………...…………………….12
2.3.2 Negative Images ………………………………………..………….15
2.3.3 Computation Stages ………………...……………………….…….18
2.4 Results ……………………………………………………………………….18
Chapter 3 Convolutional Neural Network …………………………………………….24
3.1 Overview …………………………………………………………………….24
3.2 Faster R-CNN model ……...…………………………………………..…….28
3.2.1 Background Theory ……………………...………………….…….28
3.2.2 Methodology ………………………………………………...…….29
3.3 Training ……………………………………………………………..……….30
v
3.4 Results ……………………………………………………………………….34
Chapter 4 Visual Question Answering ………………………………………………….39
4.1 Overview …….………………………………………………...…………….39
4.2 Annotations ………………………………………………………………….41
4.3 CNN + BoW ……………………………………………………..………….44
4.3.1 Methodology ………...…………………………………………….44
4.3.2 CNN model …………………………………………….………….44
4.3.3 BoW model ………………………………………………………..46
4.3.4 Results …………………………………………………………….49
Chapter 5 Analyses and Comparison ………………………………………………….55
5.1 Cascade Classifier ……………………………...………………...………….55
5.2 Convolutional Neural Network …………………………………..………….59
5.3 Visual Question Answering ………………………...……………………….62
5.4 Comparison ………………………………………………………………….63
Chapter 6 Conclusion and Future Work ……………………………………………….65
References …………………………………………………………………….….…….67
vi
LIST OF TABLES
Table 1. Example of the vectorization of a Bag of Words application.......................................... 47
Table 2. Damage Detection Contents with Cascade Classifier. .................................................... 56
Table 3. Damage Detection Contents with Cascade Classifier – Selected Sections. .................... 57
Table 4. Damage Detection Contents with CNN. ……………...………………………….......... 60
Table 5. Damage Detection Contents with CNN – Selected Sections. ......................................... 60
Table 6. Validation of Results for VQA methodology. ................................................................ 62
vii
LIST OF FIGURES
Figure 1. Object detection example obtained using a screenshot picture from the Runescape game……...6
Figure 2. Representation of Haar Features for the Cascade Classifier model …….....…………………… 7
Figure 3. Example of pixel values results for each specific pixel.. .............................................................. 9
Figure 4. Full example of the usage of a Haar Cascade Classifier model. ................................................ 11
Figure 5. Example 1 for positive damage image using the CC model. .......................................................13
Figure 10. Example 1 for negative damage image using the CC model. ....................................................15
Figure 15. Example 1 of results obtained using the CC model. …………………………..…….…..…… 19
Figure 16. Example 2 of results obtained using the CC model. …………………………..………...…… 20
Figure 17. Example 3 of results obtained using the CC model. …………………………..………..….… 20
Figure 18. Example 4 of results obtained using the CC model. ……………………………..……..….… 21
Figure 19. Example 5 of results obtained using the CC model. ………..………………………..…….… 21
Figure 20. Example 6 of results obtained using the CC model. …………..……………………..….…… 22
Figure 21. Example 7 of results obtained using the CC model. ……………………..…………...……… 22
Figure 22. Example 8 of results obtained using the CC model. ………………………..…….…..……… 23
Figure 23. Convolutional Neural Network process example using kernels. .............................................. 25
viii
Figure 24. Example of Convolved Feature. ............................................................................................... 26
Figure 25. Example 1 of CNN model. ....................................................................................................... 27
Figure 26. Example 2 of CNN model......................................................................................................... 27
Figure 27. Example of Faster R-CNN methodology. ................................................................................ 29
Figure 28. Example 1 of Dataset image. .................................................................................................... 31
Figure 33. Example 1 of results using the CNN methodology. ................................................................. 34
Figure 41. VQA Example........................................................................................................................... 41
Figure 42. Simple block diagram demonstrating the VQA process. ......................................................... 42
Figure 43. Annotations for JSON file 1. .................................................................................................... 43
Figure 44. Annotations for JSON file 2...................................................................................................... 43
Figure 45. Example of convolutional neural network python model. ....................................................... 45
Figure 46. Example of Feedforward neural network. ................................................................................ 47
Figure 47. Example 1 of VQA result image............................................................................................... 49
Figure 48. Example 2 of VQA result image. ............................................................................................. 50
ix
Figure 50. Example 4 of VQA result image............................................................................................... 52
Figure 53. Precision-Recall for Argo Hall using the Cascade Classifier. .................................................. 58
Figure 54. Precision-Recall for Martin Hall using the Cascade Classifier. ............................................... 58
Figure 55. Precision-Recall for Martin Hall using the CNN. .................................................................... 61
Figure 56. Precision-Recall for Argo Hall using the CNN. ....................................................................... 61
x
LIST OF ABBREVIATIONS
BoW Bag-of-Words
CC Cascade Classifier
CNN Convolutional Neural Network
FNN Feedforward Neural Network
ID Identification Number
R-CNN Region-Based Convolutional Neural Network
RPN Region Proposed Network
SVM Support Vector Machine
UAV Unmanned Aerial Vehicle
UWF University of West Florida
VQA Visual Question Answering
xi
ABSTRACT
Visual Question Answering (VQA) Analyses for Post-Disaster Damage Detection and
Identification Using Aerial Footage
Rafael de Sa Lowande
Natural disasters are a major source of significant damage and costly repairs around the
world. After a natural disaster occurs, there are usually an insurmountable amount of damage,
and with that there are also a lot of costs with repairing and aiding all the people involved.
Besides that, the occurrence of natural phenomenon has increased significantly in the past
decade.
With that in mind, post-disaster damage detection is usually performed manually by
human operators. Taking into consideration all the areas one has to closely look into, as well as
the difficult terrain and places with hard access, it becomes easy to understand how incredibly
difficult it is for a surveyor to identify and annotate every single possible damage out there.
Because of that, it has become essential to find new creative solutions for damage detection and
classification in case of natural disasters, especially hurricanes.
On this thesis focusses at finding the feasibility of using different types of computer
vision techniques with the help of an UAV in order to conduct an analysis for post-disaster
damage detection and identification while comparing the results obtained from each model.
xii
CHAPTER 1
Introduction
1.1 Motivation and Problem Statement
It has been clear in the past few decades how natural disasters are a major source of
significant damage and costly repairs around the world. In 2020 alone, more than $43 billion
dollars of damage resulted from the Atlantic hurricane season in North America [1]. In
perspective, according to The Wall Street Journal, 31% of all hurricane damages from 1980 to
2018 occurred in 2017, with a total of $268 billion-dollar in damages [2]. This shows how the
occurrence of natural disasters are constantly increasing, especially in the last decade. Taking
this information into consideration, the need for means to quickly identify and respond to these
disasters have also significantly increased recently.
Normally, post-disaster damage detection is usually performed manually by human
surveyors. Taking into consideration all the areas one has to closely look into, as well as the
difficult terrain and places with hard access, it becomes easy to understand how incredibly
difficult it is for a surveyor to identify and annotate every single possible amount of damage out
there. It is also reasonably understandable to assume that the surveyor will miss some type of
information in the field that could be essential. With that in mind, it is possible to conclude that
this method of damage identification has become obsolete, making it a slow process and an
inefficient and inconsistent way of securing human resources.
Taking all of this into consideration, this thesis proposes to utilize Unmanned Aerial
Vehicles with an attached standard color camera to capture footage of post-storm condition of
structures using computer vision techniques. In this study, three main methodologies are
customized to specifically aid in the efforts of damage recovery and identification. These
methods are the Cascade Classifier, the Convolutional Neural Network, and the Visual Question
1
Answering. The goal is to showcase all the possible forms of using the advanced computer vision
technology in order to provide assistance to workers and first responders when it comes to
extreme scenarios. Minutes can be the difference between potentializing or solving a crisis, and
using this technology would significantly accelerate this process and make it more efficient.
This study will analyze and provide the positive and negative aspects of each
methodology by comparing them with each other. It will also provide a detailed analysis of each
model and will provide reasoning using one model instead of the other when trying to perform
the detection in specific scenarios.
1.2 Literature Survey
1.2.1 Unmanned Aerial Vehicle
As discussed before, Unmanned Aerial Vehicles (UAV) have been used
frequently in the aid for damage detection and identification in the case of natural disasters, such
as hurricanes and earthquakes among others. As UAV technology rapidly becomes popular and
widely accessibly, their application spectrum has been increased, ranging from aerial refueling
[4] – [12] to 3D reconstruction [13]. Besides the application field, the employed UAV platform
type for these applications also varies in previous studies, e.g., quadrotor, [3], [13] – [20] fixed
wing, [21] – [23] and even airship [24]. In this study, we focus on the use of a quadrotor UAV
for the application of post-disaster damage detection and localization, as they possess higher
mobility that allows the capturing of different points of view for the scene of interest. 1
1.2.2 Cascade Classifier
Image processing, computer vision, and pattern recognition-related research studies have
been getting a lot of attention in recent years due to advancement in algorithms and equipment
1
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.
2
used in their application, which varies from object detection [25], object tracking [26], to 3D
modeling [27]. The Cascade Classifier, the first method we analyzed for object detection, has
shown significant success for object detection tasks in the past. Viola and Jones [28] present a
comparison of several different cascade of classifier detectors with high detection rates for face
detection, with low false positive rates, which is typically a significant issue with cascade of
classifier models. A similar study performed by Lienhart et al. [29] also produced marked
success for different classifier boosting strategies applied to cascade of classifier type models
trained for face detection. Wang et al. [30] expands the usage of cascade classifiers to the general
case with the PASCAL VOC datasets (20 object classes) and the ImageNet dataset (200 object
classes).
1.2.3 Convolutional Neural Network
In this paper, the second method analyzed for object detection is the Convolutional
Neural Network (CNN). The CNN have become widely popular for carrying out object detection
tasks in recent years. A notable study conducted by Zhu et al. [31] for roof damage detection,
which also is the source of our training data, uses their own CNN model to perform the detection
with excellent accuracy. A more general building damage study by Nex et al. [32] utilizes CNNs
with a morphological filter method for damage candidate region proposals. Pi et. al [33] compare
a number of CNN architectures for post-hurricane damage detection with high mean average
precision. Some recent methods have combined the cascading strategy with CNNs, as presented
by Cai and Vasconcelos [34]. In their paper, the researches use a cascading region proposal with
a sequentially higher intersection over union thresholds to filter out false positive samples.
3
1.2.4 Visual Question Answering
Looking into the Visual Question Answering (VQA) methodology now, there has
been considerable research conducted on this new model recently. Studies [35] – [42] made
significant efforts in order to develop and study the algorithm. These methodologies propose
different approaches for the union of semantic image and question features. However, in the
literature there are not many papers that address the usage of VQA paired with UAVs in order to
identify damages caused by natural disasters. From the time this thesis is being written, there is
only one work addressing this issue [43]. In their study, the researches propose a simple baseline
and Multimodal Factorized Bilinear baseline model paired with their own dataset in order to
conduct their experiment and obtain the results. In this research paper, a VQA model with a new
dataset will be used to conduct the experimentation.
1.3 Organization of The Thesis
This study aims to investigate the accuracy and precision when implementing the VQA
for post-disaster damage detection and identification using UAVs. A few different
methodologies are studied throughout this thesis in order to have a good comparing base when
testing the actual feasibility of implementing the VQA model for damage detection. The first
methodology, cascade classifier, used for damage detection, is presented and analyzed in Chapter
2. Background theory, testing, and analysis of this methodology is analyzed in this chapter. The
Convolutional Neural Network, a different methodology used for detection, is presented in
Chapter 3. Similar to Chapter 2, in this chapter a background theory, testing, and analysis of this
methodology is observed. In Chapter 4, the visual question answering is introduced and studied.
A VQA model is introduced and studied, and the feasibility of its use in post-disaster damage
scenarios is observed. Chapter 5 demonstrates the analyses and comparison between all the
4
methodologies of this research thesis, their pros and cons, and explains why they should be used
for damage detection. In Chapter 6, the conclusions of this study are presented and the planned
improvements are listed.
1.4 Original Contribution
In this study, post-disaster damage detection analysis is conducted on aerial
footage that is unique and was gathered after Hurricane Sally. Besides the custom dataset (Sally-
UWF), custom annotations and images are created for roof damage, and three different classes
demonstrating low, medium and heavy damage, as well as the instance and frequency of the
presence of roof damages, are introduced. Lastly, the performance of three object detection
methods are compared. The first method is a Cascade Classifier. The strategy behind cascading
classifiers is to pass the output of one classifier to a following classifier as additional input. This
process is repeated as many times as necessary to improve detection results. The second method
is a Convolutional Neural Network (CNN) model. CNNs learn object feature maps produced by
the process of convolution. In doing so, these models can localize the learned features to provide
the object prediction. The third method is the Visual Question Answering (VQA). Firstly, the
VQA dataset for post-disaster damage assessment based on UAV imageries is introduced. Then,
a comprehensive study of the performances of a baseline VQA algorithm on our dataset is
conducted.
5
CHAPTER 2
Cascade Classifier
2.1 Overview
Object detection is one of the main applications of Computer Vision. We often see it in
display in day-to-day appliances (e.g., self-driving cars, cellphone cameras, videogames). There
are many different software and methodologies nowadays that are used for object detection. The
first methodology that will be analyzed in this paper is the Cascade Classifier. The cascade
classifier is a machine learning methodology that uses positive and negative images to train its
model and obtain the final detection. Positive images are considered to be all the images
containing the object the model is searching for, while negative images are considered to be all
the images that do not contain the object that the model is searching for. Figure 1 demonstrates
an example of object detection and identification using the Cascade Classifier methodology. In
this example, a screenshot from the Runescape game [44] was taken, and the image was passed
through the CC model proposed by this thesis in order to provide an example of its functionality.
Figure 1. Object detection example obtained using a screenshot picture from the
Runescape game [44].
6
2.2 Cascade Classifier Model Using Haar Features
2.2.1 Background Theory
Analyzing this approach, the algorithm for the Cascade Classifier takes a lot of positive
images and negative images and separates them into two files. It then extracts features from each
file; for this paper, the Haar features are used in order to separate and extract features from
images. The Haar features work similarly to the convolutional kernel that we are going to
observe later. A picture representing the Haar features can be observed in Figure 2. Haar features
are a sequence of rescaled square shape functions first introduced in the literature by Alfred Haar
in 1909 [28]. It assumes that each feature is a single value obtained by subtracting the sum of
pixels under the white rectangle from the sum of pixels under the black rectangle.
Figure 2. Representation of Haar Features for the Cascade Classifier model.
7
2.2.2 Methodology
Looking more into what the Haar features are, as it can be observed in Figure 2, the first
and second squares represent edge feature detection while the third square represents line
features. Also, each white feature is assigned a pixel value “0” and each black feature is assigned
a pixel value “1”. With that in mind, Viola-Jones [28] introduced an algorithm in which it is
possible to identify the Haar feature in an image. According to them, the closer their algorithm
comes to a “1” result, the more likely it has identified the feature it was searching for. In Figure
3, the pixel results obtained from an image can be observed. The closer the pixel is to the white
color, the closer its value will be to 0, while the closer the pixel is to a darker color, the closer its
value will be to a 1.
𝑛
1 1 𝑛
Δ = dark – white = ∑ 𝐼 (x) − ∑𝑤ℎ𝑖𝑡𝑒 𝐼(𝑥) (1)
𝑛 𝑑𝑎𝑟𝑘 𝑛
8
Figure 3. Example of pixel values results for each specific pixel.
The next step in this methodology is to define a threshold. In a perfect scenario, when
identifying an image, the algorithm would return a “1” result. However, since it is very unlikely
that the model will find the final result with 100% certainty every time the object is in the image,
a threshold is determined. For example, setting the threshold to 0.6 (60%) would mean that every
time the equation proposed by Viola-Jones returns a value equal or greater than 0.6, the model
would consider that the feature has been identified. On the other hand, if it returns a value below
9
0.6, that means the model has not identified the feature it has been looking for. When setting the
threshold, one has to be careful. Setting the threshold to low could lead to many false positives
after the training of the model. On the other hand, setting the threshold to high could lead to false
negatives being observed. Due to this, it is important to keep these issues in mind when choosing
the number for the threshold.
After calculating all the features, the algorithm starts a process that applies the features to
all the collected images, with the goal of predicting which are the positive and which are the
negative images. The algorithm can do so by sliding the features through the window. By
obtaining a value above the threshold when sliding the feature through the window, it determines
that it is possible that the object that the model is looking for is in that place in the window. The
algorithm then classifies each image into positive or negative, with the positive ones being the
ones that contain the “object” and negative ones being the ones that do not, as it was seen before.
The more data used, the better the prediction will be, and therefore, the final results will be more
accurate.
10
Figure 4. Full example of the usage of a Haar Cascade Classifier model [56].
2.3 Training
In our study, it is expected that the majority of areas in an image will not have visible
damages. Therefore, the cascade classifier is introduced to speed the process of detection, as well
as to identify the location of the detected damage inside a positive image. Thus, instead of
applying all the features collected to an image, the features are grouped into different stages of
the classifier and applied individually. If an image fails the first “layer” of features, that image is
11
discarded. If an image passes the first stage, it means that it is possible that the image that it is
being analyzed potentially has the damage the algorithm is trying to find. Therefore, the second
stage is applied to it, followed by the third stage, and so on, until the algorithm can detect, with
certainty, if there is damage on that image or not. To do the training of our model, it is necessary
to first separate the training data into positive and negative images.
2.3.1 Positive Images
The positive images, as discussed before, are all the images that contain any type
of damage caused by a natural disaster. For this research, the training data was obtained from the
Hurricane Sally-UWF dataset developed for the purpose of this research. In the proposed
approach, the UAV would fly over an affected area post-disaster where an onboard camera
would capture footage of any potential damages. Roof damage was the specific focus in this
study, but the ability to expand to other damage types is also possible. Footage of the UWF
campus was recorded following the hurricane using an UAV with an attached camera. The
videos were captured at 3840x2160p with 24 frames per second. All frames have been
downsized to 1920x1080p for testing in this study. In total, 24 videos ranging from 15 seconds to
5 minutes were captured. From this collection, two videos with high damage instance counts
were selected and used for testing. 2 Some examples of positive images obtained from the Sally-
UWF dataset can be observed in Figures 5 - 9.
2
12
Figure 5. Example 1 for positive damage image using the CC model.
13
14
2.3.2 Negative Images
Contrary to the positive images, negative images are all the images that do not
contain any type of damage. The same Sally-UWF dataset was used to separate and identify the
negative images. Some examples of negative images obtained from the Sally-UWF dataset can
be observed in Figures 10 - 14.
Figure 10. Example 1 for negative damage image using the CC model.
15
16
17
2.3.3 Computation Stages
In our study, 1000 negative images and 1100 positive images from the Hurricane Sally
Dataset, obtained using the footage from the DJI Quadcopter at UWF, were used as the trained
data for this method. After that, all the images acquired are analyzed by the algorithm, which
tries to determine in a pre-set number of stages, if each image it is observing is positive or
negative, based on the Haar features technique observed before. The greater number of stages
used for training, the more precise the results will be.
The data was trained in a total of twenty stages, which was the maximum number of
training stages possible without leading the algorithm to over-training. In the case of object
detection, the model prediction consists of two parts: the bounding boxes and the corresponding
class label. The bounding box is the area of interest for our model, with a range from X1 to X2
and Y1 to Y2 in pixel coordinates. Ideally this box perfectly surrounds the damage to be
detected. Any given image can have a range from zero prediction boxes to as many as detected.
Along with each box comes the class label, which is the description of what kind of damage the
bounding box represents, as well as the confidence ratings. 3 Another technique was also used in
this part to facilitate the identification of damage, which is called grouping rectangles. By
grouping all the bounding boxes that are relatively near to each other, it becomes easier and
simpler to observe the damage being identified in an image.
2.4 Results
For this study, the main focus when analyzing the results is on the detection accuracy
aspect of this methodology. In the aerial footage, as discussed before, two different videos
including footage of two different buildings at UWF, the Martin Hall and Argo Hall, were
3
18
studied. The reason behind choosing these specific footages is that both of the buildings filmed
have high volumes of damage instance and both roofs have similar types of damages. The model
prediction was performed and individual frames for analysis were saved. For the analysis, the
true positive, false positive and false negative rates for model predictions on sampled video
frames from two videos were recorded for this methodology. The frames were sampled using
evenly spaced time intervals. A prediction is deemed a true positive if the bounded area contains
at least 50% of the ground truth area for that damage, which means that, as discussed previously,
the threshold set for this model was of 0.5. If the bounded area covers less than 50% of the
ground truth area or there is no relevant damage at the bounded location, the prediction becomes
a false positive. When a prediction is not made for relevant damage instance, this is a false
negative. 4 Figures 15 - 22 demonstrating the results for this methodology can be observed. The
analysis for this model as well as the comparison of the results with results from other models
are discussed in Chapter 5.
Figure 15. Example 1 of results obtained using the CC model.
4
19
20
21
22
23
CHAPTER 3
Convolutional Neural Network
3.1 Overview
The next methodology analyzed in this paper is the Convolutional Neural Network
(CNN). The CNN is a type of Deep Learning algorithm that takes an input image, assigns
importance to each aspect inside that image, and can differentiate each aspect from one another.
The architecture behind the CNN is analogous to that of the connectivity pattern of Neurons in
the human brain, and it was inspired by the Visual Cortex. The goal of a CNN model is to reduce
the images into a form in which it is easier to process without losing features which are critical
for getting a good prediction [45]. The basic process of using a neural network starts with
building a model, training the network, and finally, testing the network on validation or on real-
world data. The training process takes in some labelled training data and gives a prediction for
that data. When the model prediction does not match the expected output, internal model weights
are adjusted through the process of backpropagation, which is a backwards traversal through the
network, updating each layer of weights along the way with the goal of changing values to
provide a better prediction in the next iteration. When enough high quality, representative
training data is used, the network is likely to provide high quality outputs which match what is
desired. 5
Convolutional neural networks use all of these principles and apply them to image
frames. The process of convolution (Fig. 23 and Fig. 24) involves using kernels, which are small
square matrices with specific values. These kernels are multiplied by every area of an image’s
pixel value sequentially to create an output matrix called a “feature map.” Typically, many
5
24
different kernels are used to create many feature maps for each image frame. These feature maps
undergo additional processing such as padding and pooling, which aims to reduce the size of the
representation while losing as little accuracy as possible and further iterations of convolution to
strengthen the feature representation. 6 The main goal of the Convolution Operation is to take the
input image and extract the high-level features such as edges from it. In general, the CNNs do
not need to be limited to only one Convolutional Layer. Overall, usually the first CNN layer is
responsible for capturing the Low-Level features like the colors, gradient orientation, edges,
among others. By adding more layers, the general architecture adapts to High-Level features as
well, providing a full network, which has the wholesome understanding of images in the dataset,
similar to what humans’ experience [45].
Figure 23. Convolutional Neural Network process example using kernels.
6
25
Figure 24. Example of Convolved Feature.
When the convolutional process is complete, the resulting feature maps are flattened and
inputted as vectors into what is essentially a standard neural network for classification purposes,
where the actual prediction is made (Fig. 25 and Fig. 26).
26
Figure 25. First example of Convolutional Neural Network model [46].
Figure 26. Second example of Convolutional Neural Network model [46].
27
3.2 Faster R-CNN model
3.2.1 Background Theory
First introduced in 2014 by a group of researchers at UC Berkeley, the region-based
convolutional neural network (R-CNN) was able to detect eighty different types of objects in
images. Compared to the general CNN model observed before, the main contribution of the R-
CNN is just extracting the features based on a CNN. However, this method had several
drawbacks. For example, it is a multi-stage model, where each stage is an independent
component. Thus, it cannot be trained end-to-end, and it caches the extracted features from the
pre-trained CNN on the disk to later train the SVMs. This process requires hundreds of gigabytes
of storage [46].
To fix these issues, the Fast R-CNN method was proposed. Developed by Ross Girshick,
this methodology solves several issues observed before. For example, compared to R-CNN,
which has multiple stages (region proposal generation, feature extraction, and classification
using SVM), Fast R-CNN builds a network that has only a single stage and Fast R-CNN also
shares computations like convolutional layer calculations across all proposals rather than doing
the calculations for each proposal independently. This sequence is done by using the new ROI
Pooling layer, which makes Fast R-CNN faster than the original R-CNN [46].
Lastly, the Faster R-CNN, which is the model used in the paper, was introduced as an
extension of the Fast R-CNN model. As the name implies, this model is faster than both the
previous models. This model is faster because of the region proposal network (RPN) introduced
by this methodology, which is a fully connected CNN that creates proposals with various scales
28
and aspects ratios. It introduces the attention to a neural network, which means that it tells the
model exactly where to look inside a frame. 7
3.2.2 Methodology
The Faster R-CNN methodology follows a set pattern. It first generates the region
proposals using the RPN. It then extracts a fixed-length-feature vector from each region using
the ROI pooling layer. After that, all the extracted feature vectors are classified using the Fast R-
CNN approach. Lastly, the class scores of the detected objects and the bounding boxes around
each object are presented. An example of a Faster R-CNN model can be observed in Figure 27.
Figure 27. Example of Faster R-CNN methodology [47].
In this study, the TensorFlow Object Detection API for Python for object detection is
utilized. This API features a suite of convolutional neural network models designed for object
detection. All models in the TensorFlow model zoo come pre-trained on the COCO 2017 dataset
as a starting point but can simply be re-trained on appropriate data to fit any object detection
7
29
task. As discussed, the Faster R-CNN Inception ResNet V2 640x640p model is used and re-
trained on the ISBDA dataset. More discussion about the dataset and training data appears in the
following section. This model is one of the more accurate in precision in the model zoo, scoring
a 37.7% mean average precision score. This accuracy comes at the cost of a fairly high
processing time of 206 milliseconds per frame. If this system required immediate analysis of
UAV footage, it could lead to a delay, but in the scenario of a post-processing case, it’s not
critical. Additional Python libraries used for this study include OpenCV for video reading and
writing as well as NumPy for image format manipulation. 8
In the case of object detection, it is similar to the one used in the previous methodology
(CC). The model prediction consists of 3 parts, the bounding boxes, the corresponding class
label, and the corresponding confidence ratings. The bounding box is the area of interest for our
model, a range from X1 to X2 and Y1 to Y2 in pixel coordinates. Ideally this box perfectly
surrounds the object to be detected. Any given image can have a range from zero prediction
boxes to as many as necessary. Along with each box comes the class label, which is the
description of what object the bounding box represents as well as the confidence ratings. Each
bounding box will get a percentage rating for every possible object class. These classes are tied
together because the chosen class label is always the class with the highest confidence rating for
each bounding box.
3.3 Training
For this model, the training data comes from the Instance Segmentation in
Building Damage Assessment (ISDBA) dataset [48]. This dataset consists of 1030 total images,
908 of which were selected to be used in this study. Segmentation annotations, damage bounding
8
30
box annotations, and house bounding box annotations are provided with the dataset, but these
annotations were not used in this study. Instead, these images were re-annotated to bound each
damage instance. Instances were labelled into three distinct classes: “Light,” “Medium,” and
“Heavy,” corresponding to the level of damage present. “Light” damage may refer to single or
patch single damage, small debris, water damage, major discoloration, or slight bending of metal
roofs. “Medium” damage may refer to exposed wooden portions or open areas of the structure
without collapse, major debris, or significant bending of metal roofs. “Heavy” damage may refer
to structural collapse of the roof, complete removal of the roof, or destruction of the building as a
whole. 9 The training data was then used in the Hurricane Sally-UWF dataset, which was
developed for the purpose of this research. The Sally-UWF dataset uses the same image frames
discussed in the previous methodology. Examples of images obtained in the ISDBA dataset can
be observed in Figures 28 - 32.
Figure 28. Example 1 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).
9
31
dataset [48]).
dataset [48]).
32
dataset [48]).
dataset [48]).
33
3.4 Results
As it was with the previous methodology, for this model, the main focus will be
on the detection accuracy aspect of the proposed pipeline. Two different videos including
footage of two different buildings on the UWF campus with high volumes of damage instances
were identified, the Argo Hall and the Martin Hall. Both roofs have similar types of damage.
Same as it was before, in the analysis the true positive, false positive, and false negative rates for
model predictions on the sampled video frames from two videos were recorded for this
methodology. The frames were sampled using evenly spaced time intervals. A prediction is
deemed a true positive if the bounded area contains at least 50% of the ground truth area for that
damage. If the bounded area covers less than 50% of the ground truth area or there is no relevant
damage at the bounded location, the prediction becomes a false positive. 10 When a prediction is
not made for relevant damage instance, the prediction is a false negative. Figures 33 - 40
demonstrate the results for this methodology. The analysis for this model as well as the
comparison of the results, with results from other models, are discussed in Chapter 5.
Figure 33. Example 1 of results using the CNN methodology.
10
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies
was also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.
34
35
36
37
38
CHAPTER 4
Visual Question Answering
4.1 Overview
The next methodology to be analyzed in this study is the Visual Question
Answering (VQA). Visual Question and Answering is a new concept in today’s computer vision
literature. Introduced in this past decade, VQA is a new area in which an AI is introduced to
answer questions made by the user. Different from previous concepts, this system must
demonstrate a more profound knowledge of images. It has to, after questioned by the user in real
time, run an analysis and answer completely different questions based on an image. With that,
the user does not have to predetermine what the program will be looking for in an image. Using
this algorithm, the user can just ask questions based on an image, and the program will run an
analysis based on the user’s question and will build an answer upon that question, finding the
object or person asked by the question.
A VQA methodology integrated with an UAV can be fundamental on the
advancement of damage identification and assessment in the case of natural disasters, such as
hurricanes. Since aiding damaged areas is an activity that is heavily dependent on real-time
evaluation and estimation, the introduction to a VQA model can prove to be essential when
dealing with high- risk situations. The VQA model is considered a complicated, multimodal
research problem in which the aim is to address an image-specified question [49] - [51]. Visual
Question Answering can be considered a type of comprehensible activity that differentiates itself
from other types of activities, like the identification of images. Since the VQA model needs to
have high understanding of the attributes of an image, and it also has to be able to find all the
relevant objects based on natural language questions, it can prove to be an important aspect in
39
the support for damage detection and identification after natural disasters. An example of the
VQA methodology can be observed in Figure 41.
When thinking about what is essential after a hurricane passes by, some of the
first questions that comes to mind are “Is there anyone in the area” or “How many houses were
destroyed,” among other questions. Being able to answer these questions in real-time is one of
the many benefits provided by this methodology. Since the success of this model is heavily
dependent on the data collection, when pairing it with the usage of an UAV, which reduces the
risks of unnecessary injuries since it allows first responders to avoid doing the recognition of the
area by themselves, it can facilitate the damage assessment task.
For this purpose, as discussed previously, the Sally-UWF dataset is introduced,
obtained using footage of the UWF campus, which was recorded following hurricane Sally using
an UAV with an attached camera. The same videos used for the previous methodology are also
used in this one. Overall, two videos with high damage instance counts were selected and used
for testing. For each video, 1000 frames were selected to be part of the dataset. In total, 2000
images, 4000 training questions, and 1000 training annotations were used to compose this
dataset.
40
Figure 41. VQA Example.
4.2 Annotations
For the annotations, this study based its approach on the VQA API, introduced in
the visualqa website [52]. For this approach, a few sets of requirements need to be met in order
to have a model working. There is the data collection in which images need to be collected in
order to train the model. All images were collected following hurricane Sally using an UAV with
an attached camera. Next, there is the question section, in which questions need to be inputted
and paired with an image. Last comes the annotation section, in which an answer is introduced
and put together with both an image and a question. To facilitate this process, identification
numbers (ID) are provided for all questions, answers, and images on the dataset. This process is
41
the standard approach for performing a visual question answering methodology. The image is
processed firstly; then the question is processed. After that, both features, the image, and the
question are combined, and probabilities to each possible answer are assigned. A simple block
diagram demonstrating the process for a VQA model can be seen in Figure 42. The JSON file
format that the annotations were based on, which is required to be filled for this approach, can be
observed in Figures 43 and 44.
Figure 42. Simple block diagram demonstrating the VQA process.
42
Figure 43. Annotations for JSON file 1 [52]. Figure 44. Annotations for JSON file 2 [52].
43
4.3 CNN + Bag of Words (BoW)
4.3.1 Overview
There are many different ways to approach the Visual Question Answering model, as
seen before in this research. The first one to be focused in this study is the one that uses the
Convolutional Neural Network methodology paired with the Bag of Words. To be able to answer
open-ended questions, it is necessary to combine both visual and language understanding. With
that in mind, the most common approach to this problem is to use two different types of
methodologies, one to do the understanding and analysis of images (which is the visual aspect),
and one to do the analyzes and understanding of the language (which is the question and
answering aspect). In resume, the VQA model needs to be able to observe and understand what
is being displayed in an image, in order to effectively give an appropriate answer to what is being
discussed.
4.3.2 CNN Model
Because of this, for this study, the approach that seems to be the best when trying to
perform the VQA on a fix dataset, like the Sally-UWF dataset, is the CNN + Bag of Words.
Starting on the CNN side, briefly going over what has been discussed on this paper before, this
methodology is mainly used to do the analysis and classification of an image. In layman terms, it
has the goal to look at an image with automobiles and classify which ones are cars and which
ones are motorcycles. In a more in-depth look, CNNs can be considered neural networks with a
set of filters, also know as convolutional layers. These layers consist of a set of filters responsible
for producing an output image. It does so by taking an input image and passing that input image
through the filter, producing the output. This process is the same as the one discussed in the
CNN chapter, in which the convolution involves using kernels. As seen before, these kernels are
44
multiplied by every area of an image’s pixel value sequentially to create an output matrix called
a “feature map.” These feature maps undergo additional processing such as padding and pooling,
which aims to reduce the size of the representation while losing as little accuracy as possible, and
further iterations of convolution to strengthen the feature representation. One example of how the
flow of a CNN model program is created and conducted can be observed in Figure 45. In this
image, we can see that the image is passed through many different stages until it produces its
final weights. For this example, the additional processing stages discussed before, like padding,
pooling, and dense, can be observed.
Figure 45. Example of flow chart for the convolutional neural network python model.
45
4.3.3 Bag of Words (BoW) Model
Next in this methodology is the usage of a model that will be able to process the question.
For this scenario, the BoW is used. The bag of words model is a simple and commonly used way
of representing text data when doing any type of machine learning experiment. This model is a
representation of text that describes the occurrence of words within a document. The main things
it involves are a number of know words and a measure of the presence of know words. It has this
name because the order of the structure of words inside the document is unimportant; therefore it
can be considered simply to be a “bag” of words [53]. Since this methodology can be as simple
or complex as intended for an application, its usage is great for open ended question scenarios.
Now, looking specifically into this scenario, since this study intended to use only a
relatively small dataset (Sally-UWF Dataset) when compared to others, the BoW can be
considered to be a great asset. This is because one of the many limitations when using the BoW
is the length of its vocabulary. However, since only dealing with a small, fixed answer set, where
one of the answers will always be the correct one, this model can prove to be very effective.
Firstly, looking a bit more into the bag of words, this model is a representation that turns
arbitrary text into fixed-length vectors by counting how many times each word appears. This
process is also known as vectorization [54]. For example, looking into our dataset, a few specific
sentences were used in order to be vectorized, like for example “Is there any damage on this
image?” or “How much damage can be seen”. Looking at these specific questions, a vocabulary
can be determined. To create a vocabulary, one just has to separate each word by itself. (Is, there,
any, damage, on, this, image, how, much, can, be, seen). After that, this dataset is vectorized by
assigning a number to each word. What that means is, every time the word appears in a sentence,
it counts up. Table 1 shows a better example of how this works.
46
Table 1. Example of the vectorization of a Bag of Words application.
Thus, for each document, a length vector is created: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] and
[0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1].
After creating a fixed-length vector for each question, the vectors are used as an input for
a feedforward neural network. A feedforward neural network takes vectorized inputs, multiply
them by specific weights, and produces an output [54].
Figure 46. Example of Feedforward neural network.
Basically, it takes a vector input v1 and v2. It then multiplies each vector by a weight .
V1 -> v1 * w1 (2)
V2 -> v2* w2
47
Next, it takes all the weighted inputs and add then together with a bias b.
(v1 * w1) + (v2 * w2) + b (3)
Lastly, the result of the addition is passed through an active function
Y = f*(v1 * w1 + v2 * w2 + b) (4)
For this study, since a relatively simple question dataset is being used, the BoW vectors
obtained from our model is used as the input for this FNN, and passed it through two fully
connected neural network layers, which basically means that every node is connected to every
output from the previous layer, to be able to produce an output [54].
After that, the results for the CNN model and the BoW model are combined and merged
together. However, to be able to obtain the conclusive final results for this model, the SoftMax
function is used. This function allows this study to turn the output values into probabilities,
allowing us to quantify how much certainty we have into each answer. The softmax function is a
type of probability distribution, that provides outputs in between 0 and 1, adding up to 1. The
formula for this function can be observed.

𝑥
𝑒 𝑗
𝑠(𝑥𝑖 ) = 𝑛
𝑥
(5)
∑ 𝑒 𝑗
𝑗=1
Where x can be any number from 𝑥1 to 𝑥𝑛 .
48
4.3.4 Results
As it was with the previous methodologies, for this model the main focus is on the
detection accuracy aspect of the proposed pipeline. The same dataset for images used in the
previous chapters can also be observed in this chapter. In total, this model used 1000 images and
30000 questions, split into training and testing sets. The questions have 7 possible answers,
which are Yes, No, One, Two, Three, Four, More. Figures 47 - 52 demonstrating the results for
this methodology can be observed. The analysis for this model as well as the comparison of the
results with results from other models will be discussed in Chapter 5.
Figure 47. Example 1 of VQA result image.
49
50
51
52
53
54
CHAPTER 5
Analyses and Comparison
5.1 Cascade Classifier
In this chapter, the analyses and comparison of all methodologies seen in the
previous chapters is conducted. Starting with the Cascade Classifier, which was the first
methodology implemented in this thesis, this study can conclude that the proposed Cascade
Classifier algorithm successfully identifies roof damages at both Martin Hall and Argo Hall,
which are the two UWF buildings analyzed in this study. As discussed previously, in this study
the main focus was on the detection accuracy aspect of the proposed pipeline. For the analysis,
the true positive, false positive and false negative rates for model predictions on sampled video
frames from two videos were recorded for this methodology. The frames were sampled using
evenly spaced time intervals. As a reminder of what was introduced during Chapter 2, a
prediction is considered to be a true positive if the bounded area contains at least 50% of the
ground truth area for that damage. If the bounded area covers less than 50% of the ground truth
area, or there is no relevant damage at the bounded location, the prediction becomes a false
positive. When a prediction is not made for relevant damage instance, this is a false negative.
Table 2 summarizes the average damage detection rate and average number of false
positive detection. The observed rates were then used to calculate the precision and recall values
of the model for those frames. Precision is calculated as the number of true positives over the
number of true positives and false positives. It is a measure of how valid the predictions made by
the model are. False positives reduce the precision of the model. Recall is calculated as the
number of true positives over the number of true positives and false negatives. It is a measure of
the completeness of the model predictions, so false negatives reduce the recall. In regular
analysis, precision-recall curve is created with different parameters (threshold) for a single scene,
55
and it allows a common base to compare different algorithms. With changing threshold, it’s
expected to have decreased precision while recall is increasing. In this study, however, the best
parameters are kept fixed, and the precision and recall plot with results from consecutive frames
of the video footage is created. The precision-recall plots are depicted in Figure 53 and Figure
54, for Argo Hall video and Martin Hall video respectively. The expected results from the
precision-recall plots are that the value should be close to 1 in precision as well as in recall in
most of the frames.
Table 2. Damage Detection Contents with Cascade Classifier.
Average detection Average number of false
rate positive detection
Argo Hall 41% 0.13
Martin Hall 41% 0.04
According to the results, the proposed CC based detection algorithm has an average
detection rate of around 41% for both buildings. However, it has a small false detection average
(0.13), which demonstrates the accuracy of the algorithm. The reason the average detection rate
is around 41% is the motion of the UAV and the gimbal. It was attempted to look into each
frame individually from the footage in this study, and because of that the research was able to
realize that if the UAV and/or gimbal has/have a rapid motion, the results in a frame with a
motion blur, which then leads to performance degradation of the damage detection algorithm.
When UAV hovers over the roof and gimbal has a constant motion, or UAV has a constant
motion and gimbal is fixed, the detection performance increases. To prove that, parts in the video
56
where there is minimum to no motion blur effect were identified, and the average detection rate
results for those parts are 48% and 55% for Argo Hall and Martin Hall, respectively (Table 3).
More importantly, in results for Martin Hall, there is no false positive detection; and there is only
one false positive detection in entire series of frames in results for Argo Hall, for those selected
sections. The main reasons for the results to be in this range are because the parameters used for
training as well as all the data collected for this specific application. By using more data, it is
possible that the detection rate results can increase. 11
Table 3. Damage Detection Contents with Cascade Classifier – Selected Sections.
Argo Hall 48% 0.06
11
57
Figure 53. Precision-Recall for Argo Hall using the Cascade Classifier.
Figure 54. Precision-Recall for Martin Hall using the Cascade Classifier.
58
5.2 Convolutional Neural Network
The next methodology to be analyzed in this chapter is the Convolutional Neural
Network. Following the same procedure as the Cascade Classifier, the main focus for the CNN
analysis was also on the detection accuracy aspect of the proposed pipeline. For the analysis, the
true positive, false positive and false negative rates for model predictions on sampled video
frames from two videos were recorded for this model as well. According to the results observed
in table 4, the proposed CNN based detection algorithm successfully detects roof damages. The
small false detection average also shows that the proposed algorithm is accurate. Additionally,
the precision and recall variation plots, for both buildings, demonstrate high level of
performance. The reason the average detection rate is around 50% is the motion of the UAV and
the gimbal. As with the previous methodology, each frame was observed individually from the
footage and the study was able to realize if the UAV has a rapid motion, that results in a frame
with a motion blur, leading to performance degradation of the damage detection algorithm. As
observed before, when the UAV hovers over the roof and gimbal has a constant motion, or UAV
has a constant motion and gimbal is fixed, the detection performance increases. To be able prove
that, parts in the video where there is minimum to no motion blur effect were identified, and the
average detection rate results for those parts are 85% and 88% for Argo Hall and Martin Hall,
respectively (Table 5). Also, results for Argo Hall, there is no false positive detection; and there
is only one false positive detection in entire series of frames in results for Martin Hall, for those
selected sections. 12 The precision-recall plots are depicted in Figure 55 and Figure 56, for Martin
12
59
Hall video and Argo Hall video respectively. The expected results from the precision-recall plots
are that the value should be close to 1 in precision as well as in recall in most of the frames.
Table 4. Damage Detection Contents with CNN.
Argo Hall 55% 0.15
Table 5. Damage Detection Contents with CNN – Selected Sections
Argo Hall 85% 0.00
60
Figure 55. Precision-Recall for Martin Hall using the CNN.
Figure 56. Precision-Recall for Argo Hall using the CNN.
61
5.3 Visual Question Answering
The last methodology to be analyzed in this paper is the Visual Question
Answering. As discussed previously, the same dataset corresponding of frames obtained from
two different videos including footage of two different buildings on UWF campus with high
volumes of damage instances, Argo Hall and Martin Hall, were used.
However, differently from previous methodologies, in order to identify the
damage instances inside each frame, a few different questions were posed to the model. Mostly,
questions regarding if there are any damage instances in the image being analyzed, as well as
how many buildings were affected and how many damages can be identified were the main
questions used on this study.
Overall, this VQA algorithm corresponding of a CNN + BoW model obtained a
92% accuracy validation. As it can be observed in table 6, the overall accuracy of this method
when looking for any type of damage in Argo Hall was 92%, while when looking for damage
occurrences at Martin Hall, it produced an overall accuracy of 93%. Because of that, this study
can conclude that this methodology is able to successfully understand and answer questions
regarding damage identification of damages caused by the Hurricane Sally to the roof of UWF’s
buildings in 2020.
Table 6. Validation of Results for VQA methodology.
Overall Accuracy for Accuracy for
Accuracy Yes/No Count
Argo Hall 92% 94% 90%
Martin Hall 93% 96% 89%
62
5.4 Comparison
Looking at all the data obtained in this study, it is possible to observe that the
methodology that had a better performance when trying to identify damages caused by the
Hurricane Sally to UWF’s buildings in 2020 was the Visual Question Answering methodology.
That is because, only regarding the overall accuracy of each method, the VQA had an accuracy
of 92%, with the CNN being a close second with 88% and the CC with only 51%. With that, it is
possible to conclude that the models performed as expected. The main reason for the CC to have
a lower accuracy level when compared to other models was due mainly to training. By acquiring
a larger dataset of positive and negative images, as well as increasing the training stages, it is
likely that the final accuracy of this methodology would also significantly increase.
On the other hand, both the CNN and VQA methodologies had similar results.
This can be expected because the main reason the CNN model did not have a higher overall
accuracy was because it failed to identify specific instances of damage due to the image being
blur. However, most of this was mitigated on the VQA methodology because of the specific
questions being asked. Since the VQA model asked mainly yes/no questions or if there were
damaged buildings, any specific instance of a single damage being missed is harder, because it
would just be counted as part of the whole building.
Also, when comparing all models, both the CC and CNN provide an easier-to-
follow damage detection result, which can prove to be useful in stressful scenarios, such as the
ones when natural disasters are involved. Both models simply point out all the damages they can
identify while the UAV is flying through the affected area. Because of that, when only looking
for any instances of damage, these methodologies would be better. Amongst them, the CNN
method had a better performance when detecting damages caused by Hurricane Sally in 2020
63
than the cascade classifier method. However, when looking for more details inside an image, the
VQA model can provide more information than the other models. Because of having a language
model paired to the image classification, this methodology has a lot more options when
analyzing an image.
64
CHAPTER 6
Conclusion and Future Work
In this study, the analysis of three damage detection methods, the Cascade Classifier
(CC), the Convolutional Neural Network (CNN) based method and the Visual Question
Answering model, were studied. The first model analyzed was the CC, in which an algorithm
was developed in order to detect and identify roof damages using aerial footage. Next, the CNN
based detection model was analyzed, and similarly to the CC model, an algorithm to detect,
identify, and locate roof damages in aerial footage was developed. Finally, a VQA model was
designed by combining an Image Classification model (CNN) with a Language model (BoW),
with the goal of not only observing damage, but also being able to answer different questions
that might arise in post-disaster scenarios. In order to demonstrate its performance, the algorithm
for the three methodologies were tested on the videos recorded from an Unmanned Aerial
Vehicle (UAV) flying over the University of West Florida (UWF) campus after hurricane Sally
in 2020. For all methods, a table demonstrating the overall accuracy performance of each method
in different scenarios was presented. As a conclusion for this study, it was possible to observe
that the methodology that had a better performance when trying to identify damages caused by
the Hurricane Sally to UWF’s buildings in 2020 was the Visual Question Answering
methodology
For future work, in order to improve the accuracy of the CC methodology, more training
data from a larger dataset would be required. Also, the custom dataset created for this study, the
Sally-UWF dataset, was mainly focused on roof damage caused by a natural disaster. To be able
to expand these methodologies to any type of post-disaster damage detection caused by a natural
65
phenomenon, for example instances of tree damage, and other specific scenarios, a larger dataset
containing images with more different types of damages will be required.
66
REFERENCES
[1] “Record hurricane season and major wildfires – the natural disaster figures for 2020,”
MunichRE, URL: https://www.munichre.com/en/company/media-relations/media-information-
and-corporate-news/media-information/2021/2020-natural-disasters-balance.html (visited on
March 16, 2021).
[2] K. Dapena, “The rising costs of hurricanes,” Wall Street Journal, URL:
https://www.wsj.com/articles/the-rising-costs-of-hurricanes-1538222400 (visited on March 16,
2021).
[3] Fina, L., Mishra, B., and Sevil, H. E., “Design of a Nested Saturation Controller with
Improved Wind Disturbance Rejection for UAVs,” AIAA Scitech 2021 Forum, 2021, p. 1005.
[4] Sevil, H. E. and Dogan, A., “Fault Diagnosis in Air Data Sensors for Receiver
Aircraft in Aerial Refueling,” AIAA Journal of Guidance, Control and Dynamics, Vol. 38, No.
10, 2015, pp. 1959–1975.doi:10.2514/1.G000527.
[5] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Receiver
Aircraft States and Wind Vectors in Aerial Refueling,” AIAA Journal of Guidance, Control and
Dynamics, Vol. 37, No. 1, 2014, pp. 265–276.
[6] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Maneuvering
Aircraft States and Time-Varying Wind with Turbulence,” Aerospace Science and Technology,
Vol. 31, No. 1, 2013, pp. 87–98.
[7] Sevil, H. E. and Dogan, A., “Airdata-Sensor-based Relative Position Estimation for
Receiver Aircraft in Aerial Refueling,”Proc. of AIAA SciTech Forum and Exposition,
Grapevine, USA, 9-13 January 2017, AIAA 2017-1639.doi:10.2514/6.2017-1639.
[8] Sevil, H. E. and Dogan, A., “Airdata Sensor Fault Detection and Isolation for
Receiver Aircraft in Aerial Refueling,”Proc. of ASM 2013, AIAA Aerospace Sciences Meeting,
Grapevine, USA, 7-10 January 2013, AIAA 2013-0950.
[9] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Receiver
Aircraft States and Wind Vectors in Aerial Refueling,” Proc. of GNC 2012, AIAA Guidance,
Navigation, and Control Conference, Minneapolis, USA, 13-16 August 2012, AIAA 2012-4533.
[10] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Maneuvering
Aircraft States and Time-Varying Wind with Turbulence,” Proc. of GNC 2012, AIAA Guidance,
Navigation, and Control Conference, Minneapolis, USA, 13-16 August 2012, AIAA 2012-4532.
67
[11] Sevil, H. E. and Dogan, A., “False Fault Detection in Airdata Sensor due to Non
uniform Wind in Aerial Refueling,”Proc. of AFM 2011, AIAA Atmospheric Flight Mechanics
Conference, Portland, USA, 08-11 August 2011, AIAA 2011-6446.
[12] Sevil, H. E., Airdata Sensor Based Position Estimation and Fault Diagnosis in Aerial
Refueling , Phd dissertation, The University of Texas at Arlington, Arlington, TX, USA, 2013.
[13] Lundberg, C. L., Sevil, H. E., and Das, A., “A VisualSfM based Rapid 3-D
Modeling Framework using Swarm of UAVs,” 2018 International Conference on Unmanned
Aircraft Systems (ICUAS), IEEE, 2018, pp. 22–29.
[14] Youssef, T. A., Francia III, G. A., and Sevil, H. E., “Data Collection and Generation
for Radio Frequency Signal Security,”Advances in Security, Networks, and Internet of Things,
Springer, 2021, pp. 745–758.
[15] Lowande, R., Clevenger, A., Mahyari, A., and Sevil, H. E., “Analysis of Post-
Disaster Damage Detection using Aerial Footage from UWF Campus after Hurricane Sally,”
Proc. of International Conference on Image Processing, Computer Vision, Pattern Recognition
(IPCV’21), Las Vegas, USA, 26-29 July 2021.
[16] Das, A. N., Doelling, K., Lundberg, C., Sevil, H. E., and Lewis, F., “A Mixed
Reality Based Hybrid Swarm Control Architecture for Manned-Unmanned Teaming (MUM-T),”
Proc. of ASME 2017 International Mechanical Engineering Congress and Exposition
(IMECE2017), Tampa, USA, 3-9 November 2017, IMECE2017-72076.
[17] Sevil, H. E., “Anomaly Detection using Parity Space Approach in Team of UAVs
with Entropy based Distributed Behavior,” AIAA Scitech 2020 Forum, 2020, p. 1625.
[18] Das, A., Kol, P., Lundberg, C., Doelling, K., Sevil, H. E., and Lewis, F., “A Rapid
Situational Awareness Development Framework for Heterogeneous Manned-Unmanned Teams,”
NAECON 2018-IEEE National Aerospace and Electronics
[19] Youssef, T. A., III, G. F., and Sevil, H. E., “Data Collection and Generation for
Radio Frequency Signal Security,” Proc. of ESCS’20 - The 18th Int Conf on Embedded Systems,
Cyber-physical Systems, Las Vegas, USA, 27-30 July 2020.
[20] Haghshenas-Jaryani, M., Sevil, H. E., and Sun, L., “Navigation and Obstacle
Avoidance of Snake-Robot Guided by a Co-Robot UAV Visual Servoing,” Proc. of ASME 2020
Dynamic Systems and Control Conference (DSCC 2020), Pittsburgh, USA, 4-7 October 2020,
DSCC2020-3156.
68
[21] Sevil, H. E. and Dogan, A., “Investigation of Measurement Noise Effect on Wind
Field Estimation using Multiple UAVs,” Proc. of AIAA Scitech 2019 Forum, San Diego, USA,
7-11 January 2019, AIAA 2019-1601.
[22] Sevil, H. E., Dogan, A., Subbarao, K., and Huff, B., “Evaluation of Extant Computer
Vision Techniques for Detecting Intruder sUAS,” Proc. of 2017 International Conference on
Unmanned Aircraft Systems (ICUAS), Miami, USA, 13-16 June 2017, pp. 929–938.
doi:10.1109/ICUAS.2017.7991373.
[23] Ramani, A., Sevil, H. E., and Dogan, A., “Determining Intruder Aircraft Position
using Series of Stereoscopic 2-D Images,” Proc. of 2017 International Conference on Unmanned
Aircraft Systems (ICUAS), Miami, USA, 13-16 June 2017, pp. 902–911.
doi:10.1109/ICUAS.2017.7991384.
[24] Daskiran, O., Sevil, H. E., Dogan, A., and Huff, B., “UGV and UAV Cooperation
for Constructing Probabilistic Threat Exposure Map (PTEM),” Proc. of 15th AIAA Aviation
Technology, Integration, and Operations Conference, Dallas, USA, 22-26 June 2015, AIAA
2015-2740. doi:10.2514/6.2015-2740.
[25] A. Mohan, C. Papageorgiou, T. Poggio, "Example-based object detection in images
by components", Pattern Analysis and Machine Intelligence IEEE Transactions on, vol. 23, no.
4, pp. 349-361, 2001.
[26] A. Zirakchi, C. L. Lundberg, and H. E. Sevil, “Omni directional moving object
detection and tracking with virtual reality feedback,” in Dynamic Systems and Control
Conference, vol. 58288. American Society of Mechanical Engineers, 2017, p. V002T21A012.
[27] Yin, Tianwei, et al. “Center-Based 3D Object Detection and Tracking.” 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
https://doi.org/10.1109/cvpr46437.2021.01161.
[28] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.
[29] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis of detection
cascades of boosted classifiers for rapid object detection,” in joint pattern recognition
symposium. Springer, 2003, pp. 297–304.
69
[30] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in
Proceedings of the IEEE international conference on computer vision, 2013, pp. 17–24.
[31] X. Zhu, J. Liang, and A. Hauptmann, “Msnet: A multilevel instance segmentation
network for natural disaster damage assessment in aerial videos,” in Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2023–2032.
[32] F. Nex, D. Duarte, A. Steenbeek, and N. Kerle, “Towards real-time building damage
mapping with low-cost uav solutions,” Remote sensing, vol. 11, no. 3, p. 287, 2019.
[33] Y. Pi, N. D. Nath, and A. H. Behzadan, “Disaster impact information retrieval using
deep learning object detection in crowdsourced drone footage,” in Proc., Int. Workshop on
Intelligent Computing in Engineering, 2020, pp. 134–143.
[34] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object
detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 6154–6162.
[35] Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus,
“Simple baseline for visual question answering,” arXiv preprint arXiv:1512.02167, 2015.
[36] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C
Lawrence Zitnick, and Devi Parikh, “Visual question answering,” in ICCV, 2015.
[37] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram
Nevatia, “Abc-cnn: An attention based convolutional neural network for visual question
answering,” arXiv preprint arXiv:1511.05960, 2015.
[38] Kevin J Shih, Saurabh Singh, and Derek Hoiem, “Where to look: Focus regions for
visual question answering,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 4613–4621.
[39] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola, “Stacked
attention networks for image question answering,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 21–29.
[40] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and
Marcus Rohrbach, “Multimodal compact bilinear pooling for visual question answering and
visual grounding,” in EMNLP, 2016.
70
[41] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and
Byoung-Tak Zhang, “Hadamard product for low-rank bilinear pooling,” arXiv preprint
arXiv:1610.04325, 2016.
[42] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao, “Multi-modal factorized bilinear
pooling with coattention learning for visual question answering,” in Proceedings of the IEEE
international conference on computer vision, 2017, pp. 1821–1830.
[43] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, “Hierarchical question-
image co-attention for visual question answering,” in NeurIPS, 2016.
[44] “The World of Runescape.” The World of RuneScape, URL:
https://play.runescape.com/ (visited on March 27, 2021).
[45] Sarkar, Argho, and Maryam Rahnemoonfar. “VQA-Aid: Visual Question Answering
for Post-Disaster Damage Assessment and Analysis.” ArXiv.org, 19 June 2021,
https://arxiv.org/abs/2106.10548.
[46] Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks - the eli5
Way.” Medium, Towards Data Science, 17 Dec. 2018, https://towardsdatascience.com/a-
comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
[47] Gad, Ahmed Fawzy. “Faster R-CNN Explained for Object Detection Tasks.”
Paperspace Blog, Paperspace Blog, 9 Apr. 2021, https://blog.paperspace.com/faster-r-cnn-
explained-object-detection/.
[48] 28Zhu, X., Liang, J., and Hauptmann, A., “Msnet: A multilevel instance segmentation
network for natural disaster damage assessment in aerial videos,” Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, 2021, pp. 2023–2032.
[49] Agrawal, Aishwarya, et al. “VQA: Visual Question Answering.” ArXiv.org, 27 Oct.
2016, https://arxiv.org/abs/1505.00468.
[50] Goyal, Yash, et al. “Making the V in VQA Matter: Elevating the Role of Image
Understanding in Visual Question Answering.” ArXiv.org, 15 May 2017,
https://arxiv.org/abs/1612.00837.
[51] Kevin J. Shih, Saurabh Singh, Derek Hoiem; Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4613-4621.
[52] “Visualqa.” VQA: Visual Question Answering, URL:
https://visualqa.org/download.html (Visited on January 17th, 2022)
71
[53] Brownlee, Jason. “A Gentle Introduction to the Bag-of-Words Model.” Machine
Learning Mastery, 7 Aug. 2019, URL: https://machinelearningmastery.com/gentle-introduction-
bag-words-model/. (Visited on January 10th, 2022)
[54] Victor Zhou. “A Simple Explanation of the Bag-of-Words Model.” Victor Zhou,
Victor Zhou, 30 Nov. 2019, URL: https://victorzhou.com/blog/bag-of-words/. (Visited on January
10th, 2022)
[55] Victor Zhou. “Machine Learning for Beginners: An Introduction to Neural Networks.”
Victor Zhou, Victor Zhou, URL: https://victorzhou.com/blog/intro-to-neural-networks/ (Visited
on January 10th, 2022).
[56]“Feature Extraction Based on Haar Kernels: A Query Region.”, URL:
https://www.researchgate.net/figure/Feature-extraction-based-on-Haar-kernels-a-query-region-
divided-in-sliding-window_fig4_348473306. (Visited on January 14th, 2022)
72

deSaLowande Rafael Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

deSaLowande Rafael Thesis

Uploaded by

Copyright:

Available Formats

Please do not remove this page

Visual Question Answering (vqa) Analyses for

DAMAGE DETECTION AND IDENTIFICATION USING AERIAL FOOTAGE

B.S., The University of West Florida, 2020

A thesis submitted to the Department of Electrical and Computer Engineering

Hakki Erhan Sevil, Ph.D Committee Chair

Accepted for the Department:

Thomas Gilbar, Ph.D Chair Department of Electrical and Computer Engineering

All Rights Reserved

Table 1. Example of the vectorization of a Bag of Words application.......................................... 47

Table 2. Damage Detection Contents with Cascade Classifier. .................................................... 56

Table 4. Damage Detection Contents with CNN. ……………...………………………….......... 60

Table 5. Damage Detection Contents with CNN – Selected Sections. ......................................... 60

Table 6. Validation of Results for VQA methodology. ................................................................ 62

Figure 15. Example 1 of results obtained using the CC model. …………………………..…….…..…… 19

Figure 16. Example 2 of results obtained using the CC model. …………………………..………...…… 20

Figure 17. Example 3 of results obtained using the CC model. …………………………..………..….… 20

Figure 18. Example 4 of results obtained using the CC model. ……………………………..……..….… 21

Figure 19. Example 5 of results obtained using the CC model. ………..………………………..…….… 21

Figure 20. Example 6 of results obtained using the CC model. …………..……………………..….…… 22

Figure 21. Example 7 of results obtained using the CC model. ……………………..…………...……… 22

Figure 22. Example 8 of results obtained using the CC model. ………………………..…….…..……… 23

Figure 25. Example 1 of CNN model. ....................................................................................................... 27

Figure 26. Example 2 of CNN model......................................................................................................... 27

Figure 27. Example of Faster R-CNN methodology. ................................................................................ 29

Figure 28. Example 1 of Dataset image. .................................................................................................... 31

Figure 29. Example 2 of Dataset image. .................................................................................................... 32

Figure 30. Example 3 of Dataset image. .................................................................................................... 32

Figure 31. Example 4 of Dataset image. .................................................................................................... 33

Figure 32. Example 5 of Dataset image. .................................................................................................... 33

Figure 33. Example 1 of results using the CNN methodology. ................................................................. 34

Figure 34. Example 2 of results using the CNN methodology. ................................................................. 35

Figure 35. Example 3 of results using the CNN methodology. ................................................................. 35

Figure 36. Example 4 of results using the CNN methodology. ................................................................. 36

Figure 37. Example 5 of results using the CNN methodology. ................................................................. 36

Figure 38. Example 6 of results using the CNN methodology. ................................................................. 37

Figure 39. Example 7 of results using the CNN methodology. ................................................................. 37

Figure 40. Example 8 of results using the CNN methodology. ................................................................. 38

Figure 41. VQA Example........................................................................................................................... 41

Figure 43. Annotations for JSON file 1. .................................................................................................... 43

Figure 44. Annotations for JSON file 2...................................................................................................... 43

Figure 45. Example of convolutional neural network python model. ....................................................... 45

Figure 46. Example of Feedforward neural network. ................................................................................ 47

Figure 47. Example 1 of VQA result image............................................................................................... 49

Figure 48. Example 2 of VQA result image. ............................................................................................. 50

Figure 49. Example 3 of VQA result image. ............................................................................................. 51

Figure 51. Example 5 of VQA result image. ............................................................................................. 53

Figure 52. Example 6 of VQA result image. ............................................................................................. 54

CNN Convolutional Neural Network

FNN Feedforward Neural Network

R-CNN Region-Based Convolutional Neural Network

RPN Region Proposed Network

SVM Support Vector Machine

UAV Unmanned Aerial Vehicle

UWF University of West Florida

VQA Visual Question Answering

Identification Using Aerial Footage

With that in mind, post-disaster damage detection is usually performed manually by

classification in case of natural disasters, especially hurricanes.

1.1 Motivation and Problem Statement

disasters have also significantly increased recently.

Normally, post-disaster damage detection is usually performed manually by human

inefficient and inconsistent way of securing human resources.