You are on page 1of 85

Please do not remove this page

Visual Question Answering (vqa) Analyses for


Post-Disaster Damage Detection and
Identification Using Aerial Footage
de Sa Lowande, Rafael
https://ircommons.uwf.edu/esploro/outputs/graduate/Visual-Question-Answering-vqa-Analyses-for/99380090596106600/filesAndLinks?index=0

de Sa Lowande, R. (2022). Visual Question Answering (vqa) Analyses for Post-Disaster Damage Detection
and Identification Using Aerial Footage [University of West Florida Libraries].
https://ircommons.uwf.edu/esploro/outputs/graduate/Visual-Question-Answering-vqa-Analyses-for/993
80090596106600

Repository homepage:
https://uwf-flvc-researchmanagement.esploro.exlibrisgroup.com/esploro/?institution=01FALSC_UWF
© 2022 Enrique Plasencia
Downloaded On 2024/03/30 09:28:11 -0500
VISUAL QUESTION ANSWERING (VQA) ANALYSES FOR POST-DISASTER

DAMAGE DETECTION AND IDENTIFICATION USING AERIAL FOOTAGE

By

Rafael De Sa Lowande

B.S., The University of West Florida, 2020

A thesis submitted to the Department of Electrical and Computer Engineering


Hal Marcus College of Science and Engineering
The University of West Florida
In partial fulfillment of the Requirements for the degree of
Master of Science

May 2022
THESIS CERTIFICATION

Rafael de Sa Lowande defended this thesis on 25/03/2022. The members of the thesis

committee were:

Hakki Erhan Sevil, Ph.D Committee Chair


Thomas Gilbar, Ph.D Committee Member
Arash Mahyari, Ph.D Committee Member
Mohammed Khabou, Ph.D Committee Member

Accepted for the Department:

Thomas Gilbar, Ph.D Chair Department of Electrical and Computer Engineering

The University of West Florida Graduate School verifies the names of the committee members
and certifies that the thesis has been approved in accordance with university requirements.
Dr. Kuiyuan Li, Dean, Graduate School
Copyright © by RAFAEL DE SA LOWANDE 2022

All Rights Reserved


ACKNOWLEDGMENTS

I would like to thank my supervising professor Dr. Hakki Erhan Sevil for all the support
and mentoring he has provided me throughout this time. All of his invaluable advice during these
past few years provided me with the necessary knowledge I needed to be able to complete this
thesis. I started working with Dr. Sevil in 2019 doing a research assignment regarding damage
detection using computer vision. It was my first time being introduced to this possible form of
using computer vision and robotics in order to help other people in crucial times. Since then, we
have published a few papers together and I have realized that this was the type of research I
would like to conduct forward in my life. I am deeply grateful for the guidance Dr. Sevil
provided during this time. He would always be available to help me if I had any questions and
helped me revising and writing all my papers. I would also like to thank the knowledgeable
academic committee members, Dr. Gilbar, Dr. Mahyari and Dr. Khabou for their interest in my
research and for taking the time to serve in my thesis committee.
I am grateful to all the teachers who taught me since elementary school in Brazil to
graduate school in the Unites States. I am also grateful to the Department of Electrical and
Computer Engineering, at UWF, for all the support and attention they provided me in order to
help me complete my studies. I would like to thank my girlfriend Milena Ghtait for encouraging
and inspiring me to pursue graduate studies. Without her help, I would not be able to go forward
and finish my thesis with as much support as she provided me. I deeply appreciate all her help,
encouragement and motivation throughout this time.
Finally, I would like to express my deep gratitude to my parents and my sister, who have
encouraged, inspired, assisted and sponsored me through my entire life. I am also extremely
grateful to my family for their support and patience.
May 07, 2022

iv
TABLE OF CONTENTS

Acknowledgments ……………………...………………………………………………..iv
List of Tables ………………………...………………………………………………….vii
List of Figures …………………………………………………………………………..viii
List of Abbreviations ……………………...……………………………………………..xi
Abstract ……………………………...…………………………………………………..xii
Chapter 1 Introduction ………...……………………………………………………….…1
1.1 Motivation and Problem Statement …………………………………...……………...1
1.2 Literature Survey ……………………………………………...…………...…. …..…2
1.2.1 Unmanned Vehicles ………………………………………………………2
1.2.2 Cascade Classifier …………………………………………………...……2
1.2.3 Convolutional Neural Network ……………………………………….…. 3
1.2.4 Visual Question Answering……………………...………………………. 3
1.3 Organization of the Thesis ……………………………...…………………….………4
1.4 Original Contribution ……………………………………………….……….………..5
Chapter 2 Cascade Classifier ……………………………………………………………..6
2.1 Overview ……………………………………………………………………...6
2.2 Cascade Classifier Model Using Haar Features ………………………………7
2.2.1 Background Theory ……………………………………..………….7
2.2.2 Methodology ………………………………………………….…….8
2.3 Training …………………………………………………………………..….11
2.3.1 Positive Images ……………………………...…………………….12
2.3.2 Negative Images ………………………………………..………….15
2.3.3 Computation Stages ………………...……………………….…….18
2.4 Results ……………………………………………………………………….18
Chapter 3 Convolutional Neural Network …………………………………………….24
3.1 Overview …………………………………………………………………….24
3.2 Faster R-CNN model ……...…………………………………………..…….28
3.2.1 Background Theory ……………………...………………….…….28
3.2.2 Methodology ………………………………………………...…….29
3.3 Training ……………………………………………………………..……….30

v
3.4 Results ……………………………………………………………………….34
Chapter 4 Visual Question Answering ………………………………………………….39
4.1 Overview …….………………………………………………...…………….39
4.2 Annotations ………………………………………………………………….41
4.3 CNN + BoW ……………………………………………………..………….44
4.3.1 Methodology ………...…………………………………………….44
4.3.2 CNN model …………………………………………….………….44
4.3.3 BoW model ………………………………………………………..46
4.3.4 Results …………………………………………………………….49
Chapter 5 Analyses and Comparison ………………………………………………….55
5.1 Cascade Classifier ……………………………...………………...………….55
5.2 Convolutional Neural Network …………………………………..………….59
5.3 Visual Question Answering ………………………...……………………….62
5.4 Comparison ………………………………………………………………….63
Chapter 6 Conclusion and Future Work ……………………………………………….65
References …………………………………………………………………….….…….67

vi
LIST OF TABLES

Table 1. Example of the vectorization of a Bag of Words application.......................................... 47

Table 2. Damage Detection Contents with Cascade Classifier. .................................................... 56

Table 3. Damage Detection Contents with Cascade Classifier – Selected Sections. .................... 57

Table 4. Damage Detection Contents with CNN. ……………...………………………….......... 60

Table 5. Damage Detection Contents with CNN – Selected Sections. ......................................... 60

Table 6. Validation of Results for VQA methodology. ................................................................ 62

vii
LIST OF FIGURES

Figure 1. Object detection example obtained using a screenshot picture from the Runescape game……...6

Figure 2. Representation of Haar Features for the Cascade Classifier model …….....…………………… 7

Figure 3. Example of pixel values results for each specific pixel.. .............................................................. 9

Figure 4. Full example of the usage of a Haar Cascade Classifier model. ................................................ 11

Figure 5. Example 1 for positive damage image using the CC model. .......................................................13

Figure 6. Example 2 for positive damage image using the CC model. .......................................................13

Figure 7. Example 3 for positive damage image using the CC model. .......................................................14

Figure 8. Example 4 for positive damage image using the CC model. .......................................................14

Figure 9. Example 5 for positive damage image using the CC model. .......................................................15

Figure 10. Example 1 for negative damage image using the CC model. ....................................................15

Figure 11. Example 2 for negative damage image using the CC model. ....................................................16

Figure 12. Example 3 for negative damage image using the CC model. ....................................................16

Figure 13. Example 4 for negative damage image using the CC model. ....................................................17

Figure 14. Example 5 for negative damage image using the CC model. ....................................................17

Figure 15. Example 1 of results obtained using the CC model. …………………………..…….…..…… 19

Figure 16. Example 2 of results obtained using the CC model. …………………………..………...…… 20

Figure 17. Example 3 of results obtained using the CC model. …………………………..………..….… 20

Figure 18. Example 4 of results obtained using the CC model. ……………………………..……..….… 21

Figure 19. Example 5 of results obtained using the CC model. ………..………………………..…….… 21

Figure 20. Example 6 of results obtained using the CC model. …………..……………………..….…… 22

Figure 21. Example 7 of results obtained using the CC model. ……………………..…………...……… 22

Figure 22. Example 8 of results obtained using the CC model. ………………………..…….…..……… 23

Figure 23. Convolutional Neural Network process example using kernels. .............................................. 25

viii
Figure 24. Example of Convolved Feature. ............................................................................................... 26

Figure 25. Example 1 of CNN model. ....................................................................................................... 27

Figure 26. Example 2 of CNN model......................................................................................................... 27

Figure 27. Example of Faster R-CNN methodology. ................................................................................ 29

Figure 28. Example 1 of Dataset image. .................................................................................................... 31

Figure 29. Example 2 of Dataset image. .................................................................................................... 32

Figure 30. Example 3 of Dataset image. .................................................................................................... 32

Figure 31. Example 4 of Dataset image. .................................................................................................... 33

Figure 32. Example 5 of Dataset image. .................................................................................................... 33

Figure 33. Example 1 of results using the CNN methodology. ................................................................. 34

Figure 34. Example 2 of results using the CNN methodology. ................................................................. 35

Figure 35. Example 3 of results using the CNN methodology. ................................................................. 35

Figure 36. Example 4 of results using the CNN methodology. ................................................................. 36

Figure 37. Example 5 of results using the CNN methodology. ................................................................. 36

Figure 38. Example 6 of results using the CNN methodology. ................................................................. 37

Figure 39. Example 7 of results using the CNN methodology. ................................................................. 37

Figure 40. Example 8 of results using the CNN methodology. ................................................................. 38

Figure 41. VQA Example........................................................................................................................... 41

Figure 42. Simple block diagram demonstrating the VQA process. ......................................................... 42

Figure 43. Annotations for JSON file 1. .................................................................................................... 43

Figure 44. Annotations for JSON file 2...................................................................................................... 43

Figure 45. Example of convolutional neural network python model. ....................................................... 45

Figure 46. Example of Feedforward neural network. ................................................................................ 47

Figure 47. Example 1 of VQA result image............................................................................................... 49

Figure 48. Example 2 of VQA result image. ............................................................................................. 50

Figure 49. Example 3 of VQA result image. ............................................................................................. 51

ix
Figure 50. Example 4 of VQA result image............................................................................................... 52

Figure 51. Example 5 of VQA result image. ............................................................................................. 53

Figure 52. Example 6 of VQA result image. ............................................................................................. 54

Figure 53. Precision-Recall for Argo Hall using the Cascade Classifier. .................................................. 58

Figure 54. Precision-Recall for Martin Hall using the Cascade Classifier. ............................................... 58

Figure 55. Precision-Recall for Martin Hall using the CNN. .................................................................... 61

Figure 56. Precision-Recall for Argo Hall using the CNN. ....................................................................... 61

x
LIST OF ABBREVIATIONS

BoW Bag-of-Words

CC Cascade Classifier

CNN Convolutional Neural Network

FNN Feedforward Neural Network

ID Identification Number

R-CNN Region-Based Convolutional Neural Network

RPN Region Proposed Network

SVM Support Vector Machine

UAV Unmanned Aerial Vehicle

UWF University of West Florida

VQA Visual Question Answering

xi
ABSTRACT

Visual Question Answering (VQA) Analyses for Post-Disaster Damage Detection and

Identification Using Aerial Footage

Rafael de Sa Lowande

Natural disasters are a major source of significant damage and costly repairs around the

world. After a natural disaster occurs, there are usually an insurmountable amount of damage,

and with that there are also a lot of costs with repairing and aiding all the people involved.

Besides that, the occurrence of natural phenomenon has increased significantly in the past

decade.

With that in mind, post-disaster damage detection is usually performed manually by

human operators. Taking into consideration all the areas one has to closely look into, as well as

the difficult terrain and places with hard access, it becomes easy to understand how incredibly

difficult it is for a surveyor to identify and annotate every single possible damage out there.

Because of that, it has become essential to find new creative solutions for damage detection and

classification in case of natural disasters, especially hurricanes.

On this thesis focusses at finding the feasibility of using different types of computer

vision techniques with the help of an UAV in order to conduct an analysis for post-disaster

damage detection and identification while comparing the results obtained from each model.

xii
CHAPTER 1
Introduction

1.1 Motivation and Problem Statement

It has been clear in the past few decades how natural disasters are a major source of

significant damage and costly repairs around the world. In 2020 alone, more than $43 billion

dollars of damage resulted from the Atlantic hurricane season in North America [1]. In

perspective, according to The Wall Street Journal, 31% of all hurricane damages from 1980 to

2018 occurred in 2017, with a total of $268 billion-dollar in damages [2]. This shows how the

occurrence of natural disasters are constantly increasing, especially in the last decade. Taking

this information into consideration, the need for means to quickly identify and respond to these

disasters have also significantly increased recently.

Normally, post-disaster damage detection is usually performed manually by human

surveyors. Taking into consideration all the areas one has to closely look into, as well as the

difficult terrain and places with hard access, it becomes easy to understand how incredibly

difficult it is for a surveyor to identify and annotate every single possible amount of damage out

there. It is also reasonably understandable to assume that the surveyor will miss some type of

information in the field that could be essential. With that in mind, it is possible to conclude that

this method of damage identification has become obsolete, making it a slow process and an

inefficient and inconsistent way of securing human resources.

Taking all of this into consideration, this thesis proposes to utilize Unmanned Aerial

Vehicles with an attached standard color camera to capture footage of post-storm condition of

structures using computer vision techniques. In this study, three main methodologies are

customized to specifically aid in the efforts of damage recovery and identification. These

methods are the Cascade Classifier, the Convolutional Neural Network, and the Visual Question

1
Answering. The goal is to showcase all the possible forms of using the advanced computer vision

technology in order to provide assistance to workers and first responders when it comes to

extreme scenarios. Minutes can be the difference between potentializing or solving a crisis, and

using this technology would significantly accelerate this process and make it more efficient.

This study will analyze and provide the positive and negative aspects of each

methodology by comparing them with each other. It will also provide a detailed analysis of each

model and will provide reasoning using one model instead of the other when trying to perform

the detection in specific scenarios.

1.2 Literature Survey

1.2.1 Unmanned Aerial Vehicle

As discussed before, Unmanned Aerial Vehicles (UAV) have been used

frequently in the aid for damage detection and identification in the case of natural disasters, such

as hurricanes and earthquakes among others. As UAV technology rapidly becomes popular and

widely accessibly, their application spectrum has been increased, ranging from aerial refueling

[4] – [12] to 3D reconstruction [13]. Besides the application field, the employed UAV platform

type for these applications also varies in previous studies, e.g., quadrotor, [3], [13] – [20] fixed

wing, [21] – [23] and even airship [24]. In this study, we focus on the use of a quadrotor UAV

for the application of post-disaster damage detection and localization, as they possess higher

mobility that allows the capturing of different points of view for the scene of interest. 1

1.2.2 Cascade Classifier

Image processing, computer vision, and pattern recognition-related research studies have

been getting a lot of attention in recent years due to advancement in algorithms and equipment

1
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

2
used in their application, which varies from object detection [25], object tracking [26], to 3D

modeling [27]. The Cascade Classifier, the first method we analyzed for object detection, has

shown significant success for object detection tasks in the past. Viola and Jones [28] present a

comparison of several different cascade of classifier detectors with high detection rates for face

detection, with low false positive rates, which is typically a significant issue with cascade of

classifier models. A similar study performed by Lienhart et al. [29] also produced marked

success for different classifier boosting strategies applied to cascade of classifier type models

trained for face detection. Wang et al. [30] expands the usage of cascade classifiers to the general

case with the PASCAL VOC datasets (20 object classes) and the ImageNet dataset (200 object

classes).

1.2.3 Convolutional Neural Network

In this paper, the second method analyzed for object detection is the Convolutional

Neural Network (CNN). The CNN have become widely popular for carrying out object detection

tasks in recent years. A notable study conducted by Zhu et al. [31] for roof damage detection,

which also is the source of our training data, uses their own CNN model to perform the detection

with excellent accuracy. A more general building damage study by Nex et al. [32] utilizes CNNs

with a morphological filter method for damage candidate region proposals. Pi et. al [33] compare

a number of CNN architectures for post-hurricane damage detection with high mean average

precision. Some recent methods have combined the cascading strategy with CNNs, as presented

by Cai and Vasconcelos [34]. In their paper, the researches use a cascading region proposal with

a sequentially higher intersection over union thresholds to filter out false positive samples.

3
1.2.4 Visual Question Answering

Looking into the Visual Question Answering (VQA) methodology now, there has

been considerable research conducted on this new model recently. Studies [35] – [42] made

significant efforts in order to develop and study the algorithm. These methodologies propose

different approaches for the union of semantic image and question features. However, in the

literature there are not many papers that address the usage of VQA paired with UAVs in order to

identify damages caused by natural disasters. From the time this thesis is being written, there is

only one work addressing this issue [43]. In their study, the researches propose a simple baseline

and Multimodal Factorized Bilinear baseline model paired with their own dataset in order to

conduct their experiment and obtain the results. In this research paper, a VQA model with a new

dataset will be used to conduct the experimentation.

1.3 Organization of The Thesis

This study aims to investigate the accuracy and precision when implementing the VQA

for post-disaster damage detection and identification using UAVs. A few different

methodologies are studied throughout this thesis in order to have a good comparing base when

testing the actual feasibility of implementing the VQA model for damage detection. The first

methodology, cascade classifier, used for damage detection, is presented and analyzed in Chapter

2. Background theory, testing, and analysis of this methodology is analyzed in this chapter. The

Convolutional Neural Network, a different methodology used for detection, is presented in

Chapter 3. Similar to Chapter 2, in this chapter a background theory, testing, and analysis of this

methodology is observed. In Chapter 4, the visual question answering is introduced and studied.

A VQA model is introduced and studied, and the feasibility of its use in post-disaster damage

scenarios is observed. Chapter 5 demonstrates the analyses and comparison between all the

4
methodologies of this research thesis, their pros and cons, and explains why they should be used

for damage detection. In Chapter 6, the conclusions of this study are presented and the planned

improvements are listed.

1.4 Original Contribution

In this study, post-disaster damage detection analysis is conducted on aerial

footage that is unique and was gathered after Hurricane Sally. Besides the custom dataset (Sally-

UWF), custom annotations and images are created for roof damage, and three different classes

demonstrating low, medium and heavy damage, as well as the instance and frequency of the

presence of roof damages, are introduced. Lastly, the performance of three object detection

methods are compared. The first method is a Cascade Classifier. The strategy behind cascading

classifiers is to pass the output of one classifier to a following classifier as additional input. This

process is repeated as many times as necessary to improve detection results. The second method

is a Convolutional Neural Network (CNN) model. CNNs learn object feature maps produced by

the process of convolution. In doing so, these models can localize the learned features to provide

the object prediction. The third method is the Visual Question Answering (VQA). Firstly, the

VQA dataset for post-disaster damage assessment based on UAV imageries is introduced. Then,

a comprehensive study of the performances of a baseline VQA algorithm on our dataset is

conducted.

5
CHAPTER 2
Cascade Classifier
2.1 Overview

Object detection is one of the main applications of Computer Vision. We often see it in

display in day-to-day appliances (e.g., self-driving cars, cellphone cameras, videogames). There

are many different software and methodologies nowadays that are used for object detection. The

first methodology that will be analyzed in this paper is the Cascade Classifier. The cascade

classifier is a machine learning methodology that uses positive and negative images to train its

model and obtain the final detection. Positive images are considered to be all the images

containing the object the model is searching for, while negative images are considered to be all

the images that do not contain the object that the model is searching for. Figure 1 demonstrates

an example of object detection and identification using the Cascade Classifier methodology. In

this example, a screenshot from the Runescape game [44] was taken, and the image was passed

through the CC model proposed by this thesis in order to provide an example of its functionality.

Figure 1. Object detection example obtained using a screenshot picture from the
Runescape game [44].

6
2.2 Cascade Classifier Model Using Haar Features

2.2.1 Background Theory

Analyzing this approach, the algorithm for the Cascade Classifier takes a lot of positive

images and negative images and separates them into two files. It then extracts features from each

file; for this paper, the Haar features are used in order to separate and extract features from

images. The Haar features work similarly to the convolutional kernel that we are going to

observe later. A picture representing the Haar features can be observed in Figure 2. Haar features

are a sequence of rescaled square shape functions first introduced in the literature by Alfred Haar

in 1909 [28]. It assumes that each feature is a single value obtained by subtracting the sum of

pixels under the white rectangle from the sum of pixels under the black rectangle.

Figure 2. Representation of Haar Features for the Cascade Classifier model.

7
2.2.2 Methodology

Looking more into what the Haar features are, as it can be observed in Figure 2, the first

and second squares represent edge feature detection while the third square represents line

features. Also, each white feature is assigned a pixel value “0” and each black feature is assigned

a pixel value “1”. With that in mind, Viola-Jones [28] introduced an algorithm in which it is

possible to identify the Haar feature in an image. According to them, the closer their algorithm

comes to a “1” result, the more likely it has identified the feature it was searching for. In Figure

3, the pixel results obtained from an image can be observed. The closer the pixel is to the white

color, the closer its value will be to 0, while the closer the pixel is to a darker color, the closer its

value will be to a 1.

𝑛
1 1 𝑛
Δ = dark – white = ∑ 𝐼 (x) − ∑𝑤ℎ𝑖𝑡𝑒 𝐼(𝑥) (1)
𝑛 𝑑𝑎𝑟𝑘 𝑛

8
Figure 3. Example of pixel values results for each specific pixel.

The next step in this methodology is to define a threshold. In a perfect scenario, when

identifying an image, the algorithm would return a “1” result. However, since it is very unlikely

that the model will find the final result with 100% certainty every time the object is in the image,

a threshold is determined. For example, setting the threshold to 0.6 (60%) would mean that every

time the equation proposed by Viola-Jones returns a value equal or greater than 0.6, the model

would consider that the feature has been identified. On the other hand, if it returns a value below
9
0.6, that means the model has not identified the feature it has been looking for. When setting the

threshold, one has to be careful. Setting the threshold to low could lead to many false positives

after the training of the model. On the other hand, setting the threshold to high could lead to false

negatives being observed. Due to this, it is important to keep these issues in mind when choosing

the number for the threshold.

After calculating all the features, the algorithm starts a process that applies the features to

all the collected images, with the goal of predicting which are the positive and which are the

negative images. The algorithm can do so by sliding the features through the window. By

obtaining a value above the threshold when sliding the feature through the window, it determines

that it is possible that the object that the model is looking for is in that place in the window. The

algorithm then classifies each image into positive or negative, with the positive ones being the

ones that contain the “object” and negative ones being the ones that do not, as it was seen before.

The more data used, the better the prediction will be, and therefore, the final results will be more

accurate.

10
Figure 4. Full example of the usage of a Haar Cascade Classifier model [56].

2.3 Training

In our study, it is expected that the majority of areas in an image will not have visible

damages. Therefore, the cascade classifier is introduced to speed the process of detection, as well

as to identify the location of the detected damage inside a positive image. Thus, instead of

applying all the features collected to an image, the features are grouped into different stages of

the classifier and applied individually. If an image fails the first “layer” of features, that image is

11
discarded. If an image passes the first stage, it means that it is possible that the image that it is

being analyzed potentially has the damage the algorithm is trying to find. Therefore, the second

stage is applied to it, followed by the third stage, and so on, until the algorithm can detect, with

certainty, if there is damage on that image or not. To do the training of our model, it is necessary

to first separate the training data into positive and negative images.

2.3.1 Positive Images

The positive images, as discussed before, are all the images that contain any type

of damage caused by a natural disaster. For this research, the training data was obtained from the

Hurricane Sally-UWF dataset developed for the purpose of this research. In the proposed

approach, the UAV would fly over an affected area post-disaster where an onboard camera

would capture footage of any potential damages. Roof damage was the specific focus in this

study, but the ability to expand to other damage types is also possible. Footage of the UWF

campus was recorded following the hurricane using an UAV with an attached camera. The

videos were captured at 3840x2160p with 24 frames per second. All frames have been

downsized to 1920x1080p for testing in this study. In total, 24 videos ranging from 15 seconds to

5 minutes were captured. From this collection, two videos with high damage instance counts

were selected and used for testing. 2 Some examples of positive images obtained from the Sally-

UWF dataset can be observed in Figures 5 - 9.

2
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

12
Figure 5. Example 1 for positive damage image using the CC model.

Figure 6. Example 2 for positive damage image using the CC model.

13
Figure 7. Example 3 for positive damage image using the CC model.

Figure 8. Example 4 for positive damage image using the CC model.

14
Figure 9. Example 5 for positive damage image using the CC model.

2.3.2 Negative Images

Contrary to the positive images, negative images are all the images that do not

contain any type of damage. The same Sally-UWF dataset was used to separate and identify the

negative images. Some examples of negative images obtained from the Sally-UWF dataset can

be observed in Figures 10 - 14.

Figure 10. Example 1 for negative damage image using the CC model.

15
Figure 11. Example 2 for negative damage image using the CC model.

Figure 12. Example 3 for negative damage image using the CC model.

16
Figure 13. Example 4 for negative damage image using the CC model.

Figure 14. Example 5 for negative damage image using the CC model.

17
2.3.3 Computation Stages

In our study, 1000 negative images and 1100 positive images from the Hurricane Sally

Dataset, obtained using the footage from the DJI Quadcopter at UWF, were used as the trained

data for this method. After that, all the images acquired are analyzed by the algorithm, which

tries to determine in a pre-set number of stages, if each image it is observing is positive or

negative, based on the Haar features technique observed before. The greater number of stages

used for training, the more precise the results will be.

The data was trained in a total of twenty stages, which was the maximum number of

training stages possible without leading the algorithm to over-training. In the case of object

detection, the model prediction consists of two parts: the bounding boxes and the corresponding

class label. The bounding box is the area of interest for our model, with a range from X1 to X2

and Y1 to Y2 in pixel coordinates. Ideally this box perfectly surrounds the damage to be

detected. Any given image can have a range from zero prediction boxes to as many as detected.

Along with each box comes the class label, which is the description of what kind of damage the

bounding box represents, as well as the confidence ratings. 3 Another technique was also used in

this part to facilitate the identification of damage, which is called grouping rectangles. By

grouping all the bounding boxes that are relatively near to each other, it becomes easier and

simpler to observe the damage being identified in an image.

2.4 Results

For this study, the main focus when analyzing the results is on the detection accuracy

aspect of this methodology. In the aerial footage, as discussed before, two different videos

including footage of two different buildings at UWF, the Martin Hall and Argo Hall, were

3
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

18
studied. The reason behind choosing these specific footages is that both of the buildings filmed

have high volumes of damage instance and both roofs have similar types of damages. The model

prediction was performed and individual frames for analysis were saved. For the analysis, the

true positive, false positive and false negative rates for model predictions on sampled video

frames from two videos were recorded for this methodology. The frames were sampled using

evenly spaced time intervals. A prediction is deemed a true positive if the bounded area contains

at least 50% of the ground truth area for that damage, which means that, as discussed previously,

the threshold set for this model was of 0.5. If the bounded area covers less than 50% of the

ground truth area or there is no relevant damage at the bounded location, the prediction becomes

a false positive. When a prediction is not made for relevant damage instance, this is a false

negative. 4 Figures 15 - 22 demonstrating the results for this methodology can be observed. The

analysis for this model as well as the comparison of the results with results from other models

are discussed in Chapter 5.

Figure 15. Example 1 of results obtained using the CC model.

4
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

19
Figure 16. Example 2 of results obtained using the CC model.

Figure 17. Example 3 of results obtained using the CC model.

20
Figure 18. Example 4 of results obtained using the CC model.

Figure 19. Example 5 of results obtained using the CC model.

21
Figure 20. Example 6 of results obtained using the CC model.

Figure 21. Example 7 of results obtained using the CC model.

22
Figure 22. Example 8 of results obtained using the CC model.

23
CHAPTER 3
Convolutional Neural Network

3.1 Overview

The next methodology analyzed in this paper is the Convolutional Neural Network

(CNN). The CNN is a type of Deep Learning algorithm that takes an input image, assigns

importance to each aspect inside that image, and can differentiate each aspect from one another.

The architecture behind the CNN is analogous to that of the connectivity pattern of Neurons in

the human brain, and it was inspired by the Visual Cortex. The goal of a CNN model is to reduce

the images into a form in which it is easier to process without losing features which are critical

for getting a good prediction [45]. The basic process of using a neural network starts with

building a model, training the network, and finally, testing the network on validation or on real-

world data. The training process takes in some labelled training data and gives a prediction for

that data. When the model prediction does not match the expected output, internal model weights

are adjusted through the process of backpropagation, which is a backwards traversal through the

network, updating each layer of weights along the way with the goal of changing values to

provide a better prediction in the next iteration. When enough high quality, representative

training data is used, the network is likely to provide high quality outputs which match what is

desired. 5

Convolutional neural networks use all of these principles and apply them to image

frames. The process of convolution (Fig. 23 and Fig. 24) involves using kernels, which are small

square matrices with specific values. These kernels are multiplied by every area of an image’s

pixel value sequentially to create an output matrix called a “feature map.” Typically, many

5
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

24
different kernels are used to create many feature maps for each image frame. These feature maps

undergo additional processing such as padding and pooling, which aims to reduce the size of the

representation while losing as little accuracy as possible and further iterations of convolution to

strengthen the feature representation. 6 The main goal of the Convolution Operation is to take the

input image and extract the high-level features such as edges from it. In general, the CNNs do

not need to be limited to only one Convolutional Layer. Overall, usually the first CNN layer is

responsible for capturing the Low-Level features like the colors, gradient orientation, edges,

among others. By adding more layers, the general architecture adapts to High-Level features as

well, providing a full network, which has the wholesome understanding of images in the dataset,

similar to what humans’ experience [45].

Figure 23. Convolutional Neural Network process example using kernels.

6
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

25
Figure 24. Example of Convolved Feature.

When the convolutional process is complete, the resulting feature maps are flattened and

inputted as vectors into what is essentially a standard neural network for classification purposes,

where the actual prediction is made (Fig. 25 and Fig. 26).

26
Figure 25. First example of Convolutional Neural Network model [46].

Figure 26. Second example of Convolutional Neural Network model [46].

27
3.2 Faster R-CNN model

3.2.1 Background Theory

First introduced in 2014 by a group of researchers at UC Berkeley, the region-based

convolutional neural network (R-CNN) was able to detect eighty different types of objects in

images. Compared to the general CNN model observed before, the main contribution of the R-

CNN is just extracting the features based on a CNN. However, this method had several

drawbacks. For example, it is a multi-stage model, where each stage is an independent

component. Thus, it cannot be trained end-to-end, and it caches the extracted features from the

pre-trained CNN on the disk to later train the SVMs. This process requires hundreds of gigabytes

of storage [46].

To fix these issues, the Fast R-CNN method was proposed. Developed by Ross Girshick,

this methodology solves several issues observed before. For example, compared to R-CNN,

which has multiple stages (region proposal generation, feature extraction, and classification

using SVM), Fast R-CNN builds a network that has only a single stage and Fast R-CNN also

shares computations like convolutional layer calculations across all proposals rather than doing

the calculations for each proposal independently. This sequence is done by using the new ROI

Pooling layer, which makes Fast R-CNN faster than the original R-CNN [46].

Lastly, the Faster R-CNN, which is the model used in the paper, was introduced as an

extension of the Fast R-CNN model. As the name implies, this model is faster than both the

previous models. This model is faster because of the region proposal network (RPN) introduced

by this methodology, which is a fully connected CNN that creates proposals with various scales

28
and aspects ratios. It introduces the attention to a neural network, which means that it tells the

model exactly where to look inside a frame. 7

3.2.2 Methodology

The Faster R-CNN methodology follows a set pattern. It first generates the region

proposals using the RPN. It then extracts a fixed-length-feature vector from each region using

the ROI pooling layer. After that, all the extracted feature vectors are classified using the Fast R-

CNN approach. Lastly, the class scores of the detected objects and the bounding boxes around

each object are presented. An example of a Faster R-CNN model can be observed in Figure 27.

Figure 27. Example of Faster R-CNN methodology [47].

In this study, the TensorFlow Object Detection API for Python for object detection is

utilized. This API features a suite of convolutional neural network models designed for object

detection. All models in the TensorFlow model zoo come pre-trained on the COCO 2017 dataset

as a starting point but can simply be re-trained on appropriate data to fit any object detection

7
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

29
task. As discussed, the Faster R-CNN Inception ResNet V2 640x640p model is used and re-

trained on the ISBDA dataset. More discussion about the dataset and training data appears in the

following section. This model is one of the more accurate in precision in the model zoo, scoring

a 37.7% mean average precision score. This accuracy comes at the cost of a fairly high

processing time of 206 milliseconds per frame. If this system required immediate analysis of

UAV footage, it could lead to a delay, but in the scenario of a post-processing case, it’s not

critical. Additional Python libraries used for this study include OpenCV for video reading and

writing as well as NumPy for image format manipulation. 8

In the case of object detection, it is similar to the one used in the previous methodology

(CC). The model prediction consists of 3 parts, the bounding boxes, the corresponding class

label, and the corresponding confidence ratings. The bounding box is the area of interest for our

model, a range from X1 to X2 and Y1 to Y2 in pixel coordinates. Ideally this box perfectly

surrounds the object to be detected. Any given image can have a range from zero prediction

boxes to as many as necessary. Along with each box comes the class label, which is the

description of what object the bounding box represents as well as the confidence ratings. Each

bounding box will get a percentage rating for every possible object class. These classes are tied

together because the chosen class label is always the class with the highest confidence rating for

each bounding box.

3.3 Training

For this model, the training data comes from the Instance Segmentation in

Building Damage Assessment (ISDBA) dataset [48]. This dataset consists of 1030 total images,

908 of which were selected to be used in this study. Segmentation annotations, damage bounding

8
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

30
box annotations, and house bounding box annotations are provided with the dataset, but these

annotations were not used in this study. Instead, these images were re-annotated to bound each

damage instance. Instances were labelled into three distinct classes: “Light,” “Medium,” and

“Heavy,” corresponding to the level of damage present. “Light” damage may refer to single or

patch single damage, small debris, water damage, major discoloration, or slight bending of metal

roofs. “Medium” damage may refer to exposed wooden portions or open areas of the structure

without collapse, major debris, or significant bending of metal roofs. “Heavy” damage may refer

to structural collapse of the roof, complete removal of the roof, or destruction of the building as a

whole. 9 The training data was then used in the Hurricane Sally-UWF dataset, which was

developed for the purpose of this research. The Sally-UWF dataset uses the same image frames

discussed in the previous methodology. Examples of images obtained in the ISDBA dataset can

be observed in Figures 28 - 32.

Figure 28. Example 1 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).

9
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies was
also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

31
Figure 29. Example 2 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).

Figure 30. Example 3 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).

32
Figure 31. Example 4 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).

Figure 32. Example 5 of Dataset image [48]. (Sample obtained from ISDBA open source
dataset [48]).

33
3.4 Results

As it was with the previous methodology, for this model, the main focus will be

on the detection accuracy aspect of the proposed pipeline. Two different videos including

footage of two different buildings on the UWF campus with high volumes of damage instances

were identified, the Argo Hall and the Martin Hall. Both roofs have similar types of damage.

Same as it was before, in the analysis the true positive, false positive, and false negative rates for

model predictions on the sampled video frames from two videos were recorded for this

methodology. The frames were sampled using evenly spaced time intervals. A prediction is

deemed a true positive if the bounded area contains at least 50% of the ground truth area for that

damage. If the bounded area covers less than 50% of the ground truth area or there is no relevant

damage at the bounded location, the prediction becomes a false positive. 10 When a prediction is

not made for relevant damage instance, the prediction is a false negative. Figures 33 - 40

demonstrate the results for this methodology. The analysis for this model as well as the

comparison of the results, with results from other models, are discussed in Chapter 5.

Figure 33. Example 1 of results using the CNN methodology.

10
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies
was also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

34
Figure 34. Example 2 of results using the CNN methodology.

Figure 35. Example 3 of results using the CNN methodology.

35
Figure 36. Example 4 of results using the CNN methodology.

Figure 37. Example 5 of results using the CNN methodology.

36
Figure 38. Example 6 of results using the CNN methodology.

Figure 39. Example 7 of results using the CNN methodology.

37
Figure 40. Example 8 of results using the CNN methodology.

38
CHAPTER 4
Visual Question Answering

4.1 Overview

The next methodology to be analyzed in this study is the Visual Question

Answering (VQA). Visual Question and Answering is a new concept in today’s computer vision

literature. Introduced in this past decade, VQA is a new area in which an AI is introduced to

answer questions made by the user. Different from previous concepts, this system must

demonstrate a more profound knowledge of images. It has to, after questioned by the user in real

time, run an analysis and answer completely different questions based on an image. With that,

the user does not have to predetermine what the program will be looking for in an image. Using

this algorithm, the user can just ask questions based on an image, and the program will run an

analysis based on the user’s question and will build an answer upon that question, finding the

object or person asked by the question.

A VQA methodology integrated with an UAV can be fundamental on the

advancement of damage identification and assessment in the case of natural disasters, such as

hurricanes. Since aiding damaged areas is an activity that is heavily dependent on real-time

evaluation and estimation, the introduction to a VQA model can prove to be essential when

dealing with high- risk situations. The VQA model is considered a complicated, multimodal

research problem in which the aim is to address an image-specified question [49] - [51]. Visual

Question Answering can be considered a type of comprehensible activity that differentiates itself

from other types of activities, like the identification of images. Since the VQA model needs to

have high understanding of the attributes of an image, and it also has to be able to find all the

relevant objects based on natural language questions, it can prove to be an important aspect in

39
the support for damage detection and identification after natural disasters. An example of the

VQA methodology can be observed in Figure 41.

When thinking about what is essential after a hurricane passes by, some of the

first questions that comes to mind are “Is there anyone in the area” or “How many houses were

destroyed,” among other questions. Being able to answer these questions in real-time is one of

the many benefits provided by this methodology. Since the success of this model is heavily

dependent on the data collection, when pairing it with the usage of an UAV, which reduces the

risks of unnecessary injuries since it allows first responders to avoid doing the recognition of the

area by themselves, it can facilitate the damage assessment task.

For this purpose, as discussed previously, the Sally-UWF dataset is introduced,

obtained using footage of the UWF campus, which was recorded following hurricane Sally using

an UAV with an attached camera. The same videos used for the previous methodology are also

used in this one. Overall, two videos with high damage instance counts were selected and used

for testing. For each video, 1000 frames were selected to be part of the dataset. In total, 2000

images, 4000 training questions, and 1000 training annotations were used to compose this

dataset.

40
Figure 41. VQA Example.

4.2 Annotations

For the annotations, this study based its approach on the VQA API, introduced in

the visualqa website [52]. For this approach, a few sets of requirements need to be met in order

to have a model working. There is the data collection in which images need to be collected in

order to train the model. All images were collected following hurricane Sally using an UAV with

an attached camera. Next, there is the question section, in which questions need to be inputted

and paired with an image. Last comes the annotation section, in which an answer is introduced

and put together with both an image and a question. To facilitate this process, identification

numbers (ID) are provided for all questions, answers, and images on the dataset. This process is

41
the standard approach for performing a visual question answering methodology. The image is

processed firstly; then the question is processed. After that, both features, the image, and the

question are combined, and probabilities to each possible answer are assigned. A simple block

diagram demonstrating the process for a VQA model can be seen in Figure 42. The JSON file

format that the annotations were based on, which is required to be filled for this approach, can be

observed in Figures 43 and 44.

Figure 42. Simple block diagram demonstrating the VQA process.

42
Figure 43. Annotations for JSON file 1 [52]. Figure 44. Annotations for JSON file 2 [52].

43
4.3 CNN + Bag of Words (BoW)

4.3.1 Overview

There are many different ways to approach the Visual Question Answering model, as

seen before in this research. The first one to be focused in this study is the one that uses the

Convolutional Neural Network methodology paired with the Bag of Words. To be able to answer

open-ended questions, it is necessary to combine both visual and language understanding. With

that in mind, the most common approach to this problem is to use two different types of

methodologies, one to do the understanding and analysis of images (which is the visual aspect),

and one to do the analyzes and understanding of the language (which is the question and

answering aspect). In resume, the VQA model needs to be able to observe and understand what

is being displayed in an image, in order to effectively give an appropriate answer to what is being

discussed.

4.3.2 CNN Model

Because of this, for this study, the approach that seems to be the best when trying to

perform the VQA on a fix dataset, like the Sally-UWF dataset, is the CNN + Bag of Words.

Starting on the CNN side, briefly going over what has been discussed on this paper before, this

methodology is mainly used to do the analysis and classification of an image. In layman terms, it

has the goal to look at an image with automobiles and classify which ones are cars and which

ones are motorcycles. In a more in-depth look, CNNs can be considered neural networks with a

set of filters, also know as convolutional layers. These layers consist of a set of filters responsible

for producing an output image. It does so by taking an input image and passing that input image

through the filter, producing the output. This process is the same as the one discussed in the

CNN chapter, in which the convolution involves using kernels. As seen before, these kernels are

44
multiplied by every area of an image’s pixel value sequentially to create an output matrix called

a “feature map.” These feature maps undergo additional processing such as padding and pooling,

which aims to reduce the size of the representation while losing as little accuracy as possible, and

further iterations of convolution to strengthen the feature representation. One example of how the

flow of a CNN model program is created and conducted can be observed in Figure 45. In this

image, we can see that the image is passed through many different stages until it produces its

final weights. For this example, the additional processing stages discussed before, like padding,

pooling, and dense, can be observed.

Figure 45. Example of flow chart for the convolutional neural network python model.

45
4.3.3 Bag of Words (BoW) Model

Next in this methodology is the usage of a model that will be able to process the question.

For this scenario, the BoW is used. The bag of words model is a simple and commonly used way

of representing text data when doing any type of machine learning experiment. This model is a

representation of text that describes the occurrence of words within a document. The main things

it involves are a number of know words and a measure of the presence of know words. It has this

name because the order of the structure of words inside the document is unimportant; therefore it

can be considered simply to be a “bag” of words [53]. Since this methodology can be as simple

or complex as intended for an application, its usage is great for open ended question scenarios.

Now, looking specifically into this scenario, since this study intended to use only a

relatively small dataset (Sally-UWF Dataset) when compared to others, the BoW can be

considered to be a great asset. This is because one of the many limitations when using the BoW

is the length of its vocabulary. However, since only dealing with a small, fixed answer set, where

one of the answers will always be the correct one, this model can prove to be very effective.

Firstly, looking a bit more into the bag of words, this model is a representation that turns

arbitrary text into fixed-length vectors by counting how many times each word appears. This

process is also known as vectorization [54]. For example, looking into our dataset, a few specific

sentences were used in order to be vectorized, like for example “Is there any damage on this

image?” or “How much damage can be seen”. Looking at these specific questions, a vocabulary

can be determined. To create a vocabulary, one just has to separate each word by itself. (Is, there,

any, damage, on, this, image, how, much, can, be, seen). After that, this dataset is vectorized by

assigning a number to each word. What that means is, every time the word appears in a sentence,

it counts up. Table 1 shows a better example of how this works.

46
Table 1. Example of the vectorization of a Bag of Words application.

Thus, for each document, a length vector is created: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] and

[0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1].

After creating a fixed-length vector for each question, the vectors are used as an input for

a feedforward neural network. A feedforward neural network takes vectorized inputs, multiply

them by specific weights, and produces an output [54].

Figure 46. Example of Feedforward neural network.

Basically, it takes a vector input v1 and v2. It then multiplies each vector by a weight .

V1 -> v1 * w1 (2)

V2 -> v2* w2

47
Next, it takes all the weighted inputs and add then together with a bias b.

(v1 * w1) + (v2 * w2) + b (3)

Lastly, the result of the addition is passed through an active function

Y = f*(v1 * w1 + v2 * w2 + b) (4)

For this study, since a relatively simple question dataset is being used, the BoW vectors

obtained from our model is used as the input for this FNN, and passed it through two fully

connected neural network layers, which basically means that every node is connected to every

output from the previous layer, to be able to produce an output [54].

After that, the results for the CNN model and the BoW model are combined and merged

together. However, to be able to obtain the conclusive final results for this model, the SoftMax

function is used. This function allows this study to turn the output values into probabilities,

allowing us to quantify how much certainty we have into each answer. The softmax function is a

type of probability distribution, that provides outputs in between 0 and 1, adding up to 1. The

formula for this function can be observed.


𝑥
𝑒 𝑗
𝑠(𝑥𝑖 ) = 𝑛
𝑥
(5)
∑ 𝑒 𝑗
𝑗=1

Where x can be any number from 𝑥1 to 𝑥𝑛 .

48
4.3.4 Results

As it was with the previous methodologies, for this model the main focus is on the

detection accuracy aspect of the proposed pipeline. The same dataset for images used in the

previous chapters can also be observed in this chapter. In total, this model used 1000 images and

30000 questions, split into training and testing sets. The questions have 7 possible answers,

which are Yes, No, One, Two, Three, Four, More. Figures 47 - 52 demonstrating the results for

this methodology can be observed. The analysis for this model as well as the comparison of the

results with results from other models will be discussed in Chapter 5.

Figure 47. Example 1 of VQA result image.

49
Figure 48. Example 2 of VQA result image.

50
Figure 49. Example 3 of VQA result image.

51
Figure 50. Example 4 of VQA result image.

52
Figure 51. Example 5 of VQA result image.

53
Figure 52. Example 6 of VQA result image.

54
CHAPTER 5
Analyses and Comparison

5.1 Cascade Classifier

In this chapter, the analyses and comparison of all methodologies seen in the

previous chapters is conducted. Starting with the Cascade Classifier, which was the first

methodology implemented in this thesis, this study can conclude that the proposed Cascade

Classifier algorithm successfully identifies roof damages at both Martin Hall and Argo Hall,

which are the two UWF buildings analyzed in this study. As discussed previously, in this study

the main focus was on the detection accuracy aspect of the proposed pipeline. For the analysis,

the true positive, false positive and false negative rates for model predictions on sampled video

frames from two videos were recorded for this methodology. The frames were sampled using

evenly spaced time intervals. As a reminder of what was introduced during Chapter 2, a

prediction is considered to be a true positive if the bounded area contains at least 50% of the

ground truth area for that damage. If the bounded area covers less than 50% of the ground truth

area, or there is no relevant damage at the bounded location, the prediction becomes a false

positive. When a prediction is not made for relevant damage instance, this is a false negative.

Table 2 summarizes the average damage detection rate and average number of false

positive detection. The observed rates were then used to calculate the precision and recall values

of the model for those frames. Precision is calculated as the number of true positives over the

number of true positives and false positives. It is a measure of how valid the predictions made by

the model are. False positives reduce the precision of the model. Recall is calculated as the

number of true positives over the number of true positives and false negatives. It is a measure of

the completeness of the model predictions, so false negatives reduce the recall. In regular

analysis, precision-recall curve is created with different parameters (threshold) for a single scene,

55
and it allows a common base to compare different algorithms. With changing threshold, it’s

expected to have decreased precision while recall is increasing. In this study, however, the best

parameters are kept fixed, and the precision and recall plot with results from consecutive frames

of the video footage is created. The precision-recall plots are depicted in Figure 53 and Figure

54, for Argo Hall video and Martin Hall video respectively. The expected results from the

precision-recall plots are that the value should be close to 1 in precision as well as in recall in

most of the frames.

Table 2. Damage Detection Contents with Cascade Classifier.

Average detection Average number of false

rate positive detection

Argo Hall 41% 0.13

Martin Hall 41% 0.04

According to the results, the proposed CC based detection algorithm has an average

detection rate of around 41% for both buildings. However, it has a small false detection average

(0.13), which demonstrates the accuracy of the algorithm. The reason the average detection rate

is around 41% is the motion of the UAV and the gimbal. It was attempted to look into each

frame individually from the footage in this study, and because of that the research was able to

realize that if the UAV and/or gimbal has/have a rapid motion, the results in a frame with a

motion blur, which then leads to performance degradation of the damage detection algorithm.

When UAV hovers over the roof and gimbal has a constant motion, or UAV has a constant

motion and gimbal is fixed, the detection performance increases. To prove that, parts in the video

56
where there is minimum to no motion blur effect were identified, and the average detection rate

results for those parts are 48% and 55% for Argo Hall and Martin Hall, respectively (Table 3).

More importantly, in results for Martin Hall, there is no false positive detection; and there is only

one false positive detection in entire series of frames in results for Argo Hall, for those selected

sections. The main reasons for the results to be in this range are because the parameters used for

training as well as all the data collected for this specific application. By using more data, it is

possible that the detection rate results can increase. 11

Table 3. Damage Detection Contents with Cascade Classifier – Selected Sections.

Average detection Average number of false

rate positive detection

Argo Hall 48% 0.06

Martin Hall 55% 0.00

11
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies
was also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

57
Figure 53. Precision-Recall for Argo Hall using the Cascade Classifier.

Figure 54. Precision-Recall for Martin Hall using the Cascade Classifier.

58
5.2 Convolutional Neural Network

The next methodology to be analyzed in this chapter is the Convolutional Neural

Network. Following the same procedure as the Cascade Classifier, the main focus for the CNN

analysis was also on the detection accuracy aspect of the proposed pipeline. For the analysis, the

true positive, false positive and false negative rates for model predictions on sampled video

frames from two videos were recorded for this model as well. According to the results observed

in table 4, the proposed CNN based detection algorithm successfully detects roof damages. The

small false detection average also shows that the proposed algorithm is accurate. Additionally,

the precision and recall variation plots, for both buildings, demonstrate high level of

performance. The reason the average detection rate is around 50% is the motion of the UAV and

the gimbal. As with the previous methodology, each frame was observed individually from the

footage and the study was able to realize if the UAV has a rapid motion, that results in a frame

with a motion blur, leading to performance degradation of the damage detection algorithm. As

observed before, when the UAV hovers over the roof and gimbal has a constant motion, or UAV

has a constant motion and gimbal is fixed, the detection performance increases. To be able prove

that, parts in the video where there is minimum to no motion blur effect were identified, and the

average detection rate results for those parts are 85% and 88% for Argo Hall and Martin Hall,

respectively (Table 5). Also, results for Argo Hall, there is no false positive detection; and there

is only one false positive detection in entire series of frames in results for Martin Hall, for those

selected sections. 12 The precision-recall plots are depicted in Figure 55 and Figure 56, for Martin

12
Material regarding the Convolutional Neural Network and Cascade Classifier methodologies
was also presented at the AIAA SCITECH 2022 Conference, held in San Diego January 3 - 7, 2022.

59
Hall video and Argo Hall video respectively. The expected results from the precision-recall plots

are that the value should be close to 1 in precision as well as in recall in most of the frames.

Table 4. Damage Detection Contents with CNN.

Average detection Average number of false

rate positive detection

Argo Hall 55% 0.15

Martin Hall 48% 0.29

Table 5. Damage Detection Contents with CNN – Selected Sections

Average detection Average number of false

rate positive detection

Argo Hall 85% 0.00

Martin Hall 88% 0.08

60
Figure 55. Precision-Recall for Martin Hall using the CNN.

Figure 56. Precision-Recall for Argo Hall using the CNN.

61
5.3 Visual Question Answering

The last methodology to be analyzed in this paper is the Visual Question

Answering. As discussed previously, the same dataset corresponding of frames obtained from

two different videos including footage of two different buildings on UWF campus with high

volumes of damage instances, Argo Hall and Martin Hall, were used.

However, differently from previous methodologies, in order to identify the

damage instances inside each frame, a few different questions were posed to the model. Mostly,

questions regarding if there are any damage instances in the image being analyzed, as well as

how many buildings were affected and how many damages can be identified were the main

questions used on this study.

Overall, this VQA algorithm corresponding of a CNN + BoW model obtained a

92% accuracy validation. As it can be observed in table 6, the overall accuracy of this method

when looking for any type of damage in Argo Hall was 92%, while when looking for damage

occurrences at Martin Hall, it produced an overall accuracy of 93%. Because of that, this study

can conclude that this methodology is able to successfully understand and answer questions

regarding damage identification of damages caused by the Hurricane Sally to the roof of UWF’s

buildings in 2020.

Table 6. Validation of Results for VQA methodology.

Overall Accuracy for Accuracy for

Accuracy Yes/No Count

Argo Hall 92% 94% 90%

Martin Hall 93% 96% 89%

62
5.4 Comparison

Looking at all the data obtained in this study, it is possible to observe that the

methodology that had a better performance when trying to identify damages caused by the

Hurricane Sally to UWF’s buildings in 2020 was the Visual Question Answering methodology.

That is because, only regarding the overall accuracy of each method, the VQA had an accuracy

of 92%, with the CNN being a close second with 88% and the CC with only 51%. With that, it is

possible to conclude that the models performed as expected. The main reason for the CC to have

a lower accuracy level when compared to other models was due mainly to training. By acquiring

a larger dataset of positive and negative images, as well as increasing the training stages, it is

likely that the final accuracy of this methodology would also significantly increase.

On the other hand, both the CNN and VQA methodologies had similar results.

This can be expected because the main reason the CNN model did not have a higher overall

accuracy was because it failed to identify specific instances of damage due to the image being

blur. However, most of this was mitigated on the VQA methodology because of the specific

questions being asked. Since the VQA model asked mainly yes/no questions or if there were

damaged buildings, any specific instance of a single damage being missed is harder, because it

would just be counted as part of the whole building.

Also, when comparing all models, both the CC and CNN provide an easier-to-

follow damage detection result, which can prove to be useful in stressful scenarios, such as the

ones when natural disasters are involved. Both models simply point out all the damages they can

identify while the UAV is flying through the affected area. Because of that, when only looking

for any instances of damage, these methodologies would be better. Amongst them, the CNN

method had a better performance when detecting damages caused by Hurricane Sally in 2020

63
than the cascade classifier method. However, when looking for more details inside an image, the

VQA model can provide more information than the other models. Because of having a language

model paired to the image classification, this methodology has a lot more options when

analyzing an image.

64
CHAPTER 6
Conclusion and Future Work

In this study, the analysis of three damage detection methods, the Cascade Classifier

(CC), the Convolutional Neural Network (CNN) based method and the Visual Question

Answering model, were studied. The first model analyzed was the CC, in which an algorithm

was developed in order to detect and identify roof damages using aerial footage. Next, the CNN

based detection model was analyzed, and similarly to the CC model, an algorithm to detect,

identify, and locate roof damages in aerial footage was developed. Finally, a VQA model was

designed by combining an Image Classification model (CNN) with a Language model (BoW),

with the goal of not only observing damage, but also being able to answer different questions

that might arise in post-disaster scenarios. In order to demonstrate its performance, the algorithm

for the three methodologies were tested on the videos recorded from an Unmanned Aerial

Vehicle (UAV) flying over the University of West Florida (UWF) campus after hurricane Sally

in 2020. For all methods, a table demonstrating the overall accuracy performance of each method

in different scenarios was presented. As a conclusion for this study, it was possible to observe

that the methodology that had a better performance when trying to identify damages caused by

the Hurricane Sally to UWF’s buildings in 2020 was the Visual Question Answering

methodology

For future work, in order to improve the accuracy of the CC methodology, more training

data from a larger dataset would be required. Also, the custom dataset created for this study, the

Sally-UWF dataset, was mainly focused on roof damage caused by a natural disaster. To be able

to expand these methodologies to any type of post-disaster damage detection caused by a natural

65
phenomenon, for example instances of tree damage, and other specific scenarios, a larger dataset

containing images with more different types of damages will be required.

66
REFERENCES
[1] “Record hurricane season and major wildfires – the natural disaster figures for 2020,”
MunichRE, URL: https://www.munichre.com/en/company/media-relations/media-information-
and-corporate-news/media-information/2021/2020-natural-disasters-balance.html (visited on
March 16, 2021).
[2] K. Dapena, “The rising costs of hurricanes,” Wall Street Journal, URL:
https://www.wsj.com/articles/the-rising-costs-of-hurricanes-1538222400 (visited on March 16,
2021).
[3] Fina, L., Mishra, B., and Sevil, H. E., “Design of a Nested Saturation Controller with
Improved Wind Disturbance Rejection for UAVs,” AIAA Scitech 2021 Forum, 2021, p. 1005.
[4] Sevil, H. E. and Dogan, A., “Fault Diagnosis in Air Data Sensors for Receiver
Aircraft in Aerial Refueling,” AIAA Journal of Guidance, Control and Dynamics, Vol. 38, No.
10, 2015, pp. 1959–1975.doi:10.2514/1.G000527.
[5] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Receiver
Aircraft States and Wind Vectors in Aerial Refueling,” AIAA Journal of Guidance, Control and
Dynamics, Vol. 37, No. 1, 2014, pp. 265–276.
[6] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Maneuvering
Aircraft States and Time-Varying Wind with Turbulence,” Aerospace Science and Technology,
Vol. 31, No. 1, 2013, pp. 87–98.
[7] Sevil, H. E. and Dogan, A., “Airdata-Sensor-based Relative Position Estimation for
Receiver Aircraft in Aerial Refueling,”Proc. of AIAA SciTech Forum and Exposition,
Grapevine, USA, 9-13 January 2017, AIAA 2017-1639.doi:10.2514/6.2017-1639.
[8] Sevil, H. E. and Dogan, A., “Airdata Sensor Fault Detection and Isolation for
Receiver Aircraft in Aerial Refueling,”Proc. of ASM 2013, AIAA Aerospace Sciences Meeting,
Grapevine, USA, 7-10 January 2013, AIAA 2013-0950.
[9] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Receiver
Aircraft States and Wind Vectors in Aerial Refueling,” Proc. of GNC 2012, AIAA Guidance,
Navigation, and Control Conference, Minneapolis, USA, 13-16 August 2012, AIAA 2012-4533.
[10] Lee, J. H., Sevil, H. E., Dogan, A., and Hullender, D., “Estimation of Maneuvering
Aircraft States and Time-Varying Wind with Turbulence,” Proc. of GNC 2012, AIAA Guidance,
Navigation, and Control Conference, Minneapolis, USA, 13-16 August 2012, AIAA 2012-4532.

67
[11] Sevil, H. E. and Dogan, A., “False Fault Detection in Airdata Sensor due to Non
uniform Wind in Aerial Refueling,”Proc. of AFM 2011, AIAA Atmospheric Flight Mechanics
Conference, Portland, USA, 08-11 August 2011, AIAA 2011-6446.
[12] Sevil, H. E., Airdata Sensor Based Position Estimation and Fault Diagnosis in Aerial
Refueling , Phd dissertation, The University of Texas at Arlington, Arlington, TX, USA, 2013.
[13] Lundberg, C. L., Sevil, H. E., and Das, A., “A VisualSfM based Rapid 3-D
Modeling Framework using Swarm of UAVs,” 2018 International Conference on Unmanned
Aircraft Systems (ICUAS), IEEE, 2018, pp. 22–29.
[14] Youssef, T. A., Francia III, G. A., and Sevil, H. E., “Data Collection and Generation
for Radio Frequency Signal Security,”Advances in Security, Networks, and Internet of Things,
Springer, 2021, pp. 745–758.
[15] Lowande, R., Clevenger, A., Mahyari, A., and Sevil, H. E., “Analysis of Post-
Disaster Damage Detection using Aerial Footage from UWF Campus after Hurricane Sally,”
Proc. of International Conference on Image Processing, Computer Vision, Pattern Recognition
(IPCV’21), Las Vegas, USA, 26-29 July 2021.
[16] Das, A. N., Doelling, K., Lundberg, C., Sevil, H. E., and Lewis, F., “A Mixed
Reality Based Hybrid Swarm Control Architecture for Manned-Unmanned Teaming (MUM-T),”
Proc. of ASME 2017 International Mechanical Engineering Congress and Exposition
(IMECE2017), Tampa, USA, 3-9 November 2017, IMECE2017-72076.
[17] Sevil, H. E., “Anomaly Detection using Parity Space Approach in Team of UAVs
with Entropy based Distributed Behavior,” AIAA Scitech 2020 Forum, 2020, p. 1625.
[18] Das, A., Kol, P., Lundberg, C., Doelling, K., Sevil, H. E., and Lewis, F., “A Rapid
Situational Awareness Development Framework for Heterogeneous Manned-Unmanned Teams,”
NAECON 2018-IEEE National Aerospace and Electronics
[19] Youssef, T. A., III, G. F., and Sevil, H. E., “Data Collection and Generation for
Radio Frequency Signal Security,” Proc. of ESCS’20 - The 18th Int Conf on Embedded Systems,
Cyber-physical Systems, Las Vegas, USA, 27-30 July 2020.
[20] Haghshenas-Jaryani, M., Sevil, H. E., and Sun, L., “Navigation and Obstacle
Avoidance of Snake-Robot Guided by a Co-Robot UAV Visual Servoing,” Proc. of ASME 2020
Dynamic Systems and Control Conference (DSCC 2020), Pittsburgh, USA, 4-7 October 2020,
DSCC2020-3156.

68
[21] Sevil, H. E. and Dogan, A., “Investigation of Measurement Noise Effect on Wind
Field Estimation using Multiple UAVs,” Proc. of AIAA Scitech 2019 Forum, San Diego, USA,
7-11 January 2019, AIAA 2019-1601.
[22] Sevil, H. E., Dogan, A., Subbarao, K., and Huff, B., “Evaluation of Extant Computer
Vision Techniques for Detecting Intruder sUAS,” Proc. of 2017 International Conference on
Unmanned Aircraft Systems (ICUAS), Miami, USA, 13-16 June 2017, pp. 929–938.
doi:10.1109/ICUAS.2017.7991373.
[23] Ramani, A., Sevil, H. E., and Dogan, A., “Determining Intruder Aircraft Position
using Series of Stereoscopic 2-D Images,” Proc. of 2017 International Conference on Unmanned
Aircraft Systems (ICUAS), Miami, USA, 13-16 June 2017, pp. 902–911.
doi:10.1109/ICUAS.2017.7991384.
[24] Daskiran, O., Sevil, H. E., Dogan, A., and Huff, B., “UGV and UAV Cooperation
for Constructing Probabilistic Threat Exposure Map (PTEM),” Proc. of 15th AIAA Aviation
Technology, Integration, and Operations Conference, Dallas, USA, 22-26 June 2015, AIAA
2015-2740. doi:10.2514/6.2015-2740.
[25] A. Mohan, C. Papageorgiou, T. Poggio, "Example-based object detection in images
by components", Pattern Analysis and Machine Intelligence IEEE Transactions on, vol. 23, no.
4, pp. 349-361, 2001.
[26] A. Zirakchi, C. L. Lundberg, and H. E. Sevil, “Omni directional moving object
detection and tracking with virtual reality feedback,” in Dynamic Systems and Control
Conference, vol. 58288. American Society of Mechanical Engineers, 2017, p. V002T21A012.
[27] Yin, Tianwei, et al. “Center-Based 3D Object Detection and Tracking.” 2021

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021,

https://doi.org/10.1109/cvpr46437.2021.01161.

[28] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.
[29] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis of detection
cascades of boosted classifiers for rapid object detection,” in joint pattern recognition
symposium. Springer, 2003, pp. 297–304.

69
[30] X. Wang, M. Yang, S. Zhu, and Y. Lin, “Regionlets for generic object detection,” in
Proceedings of the IEEE international conference on computer vision, 2013, pp. 17–24.
[31] X. Zhu, J. Liang, and A. Hauptmann, “Msnet: A multilevel instance segmentation
network for natural disaster damage assessment in aerial videos,” in Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2023–2032.
[32] F. Nex, D. Duarte, A. Steenbeek, and N. Kerle, “Towards real-time building damage
mapping with low-cost uav solutions,” Remote sensing, vol. 11, no. 3, p. 287, 2019.
[33] Y. Pi, N. D. Nath, and A. H. Behzadan, “Disaster impact information retrieval using
deep learning object detection in crowdsourced drone footage,” in Proc., Int. Workshop on
Intelligent Computing in Engineering, 2020, pp. 134–143.
[34] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object
detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 6154–6162.
[35] Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus,
“Simple baseline for visual question answering,” arXiv preprint arXiv:1512.02167, 2015.
[36] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C
Lawrence Zitnick, and Devi Parikh, “Visual question answering,” in ICCV, 2015.
[37] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram
Nevatia, “Abc-cnn: An attention based convolutional neural network for visual question
answering,” arXiv preprint arXiv:1511.05960, 2015.
[38] Kevin J Shih, Saurabh Singh, and Derek Hoiem, “Where to look: Focus regions for
visual question answering,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 4613–4621.
[39] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola, “Stacked
attention networks for image question answering,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 21–29.
[40] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and
Marcus Rohrbach, “Multimodal compact bilinear pooling for visual question answering and
visual grounding,” in EMNLP, 2016.

70
[41] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and
Byoung-Tak Zhang, “Hadamard product for low-rank bilinear pooling,” arXiv preprint
arXiv:1610.04325, 2016.
[42] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao, “Multi-modal factorized bilinear
pooling with coattention learning for visual question answering,” in Proceedings of the IEEE
international conference on computer vision, 2017, pp. 1821–1830.
[43] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, “Hierarchical question-
image co-attention for visual question answering,” in NeurIPS, 2016.
[44] “The World of Runescape.” The World of RuneScape, URL:
https://play.runescape.com/ (visited on March 27, 2021).
[45] Sarkar, Argho, and Maryam Rahnemoonfar. “VQA-Aid: Visual Question Answering
for Post-Disaster Damage Assessment and Analysis.” ArXiv.org, 19 June 2021,
https://arxiv.org/abs/2106.10548.
[46] Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks - the eli5
Way.” Medium, Towards Data Science, 17 Dec. 2018, https://towardsdatascience.com/a-
comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
[47] Gad, Ahmed Fawzy. “Faster R-CNN Explained for Object Detection Tasks.”
Paperspace Blog, Paperspace Blog, 9 Apr. 2021, https://blog.paperspace.com/faster-r-cnn-
explained-object-detection/.
[48] 28Zhu, X., Liang, J., and Hauptmann, A., “Msnet: A multilevel instance segmentation
network for natural disaster damage assessment in aerial videos,” Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, 2021, pp. 2023–2032.
[49] Agrawal, Aishwarya, et al. “VQA: Visual Question Answering.” ArXiv.org, 27 Oct.
2016, https://arxiv.org/abs/1505.00468.
[50] Goyal, Yash, et al. “Making the V in VQA Matter: Elevating the Role of Image
Understanding in Visual Question Answering.” ArXiv.org, 15 May 2017,
https://arxiv.org/abs/1612.00837.
[51] Kevin J. Shih, Saurabh Singh, Derek Hoiem; Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4613-4621.
[52] “Visualqa.” VQA: Visual Question Answering, URL:
https://visualqa.org/download.html (Visited on January 17th, 2022)

71
[53] Brownlee, Jason. “A Gentle Introduction to the Bag-of-Words Model.” Machine
Learning Mastery, 7 Aug. 2019, URL: https://machinelearningmastery.com/gentle-introduction-
bag-words-model/. (Visited on January 10th, 2022)
[54] Victor Zhou. “A Simple Explanation of the Bag-of-Words Model.” Victor Zhou,
Victor Zhou, 30 Nov. 2019, URL: https://victorzhou.com/blog/bag-of-words/. (Visited on January
10th, 2022)
[55] Victor Zhou. “Machine Learning for Beginners: An Introduction to Neural Networks.”
Victor Zhou, Victor Zhou, URL: https://victorzhou.com/blog/intro-to-neural-networks/ (Visited
on January 10th, 2022).
[56]“Feature Extraction Based on Haar Kernels: A Query Region.”, URL:
https://www.researchgate.net/figure/Feature-extraction-based-on-Haar-kernels-a-query-region-
divided-in-sliding-window_fig4_348473306. (Visited on January 14th, 2022)

72

You might also like