Chung 2018

Cloud Computed Machine Learning Based
Real-Time Litter Detection using Micro-UAV

Surveillance
Ashley Chung, Dong Young Kim, Ethan Kwok, Ryan Gamadia
Michael Ryan, Erika Tan Corresponding Author
NJ Governor’s School of Engineering and Technology Rutgers School of Engineering
Piscataway, New Jersey Piscataway, New Jersey
Abstract—Litter can remain undetected and uncollected for the detrimental effects of littering, they are not feasible long
extended periods of time, leading to detrimental consequences term solutions, as human surveillance is both time-consuming
on the environment. Solutions to mitigating these effects focus on and dependent on manual labor. Thus, a more effective long
severe legal action directed towards offenders or litter collection
events, all of which are not automated. Therefore, to reduce term solution would be an automated object detection system
the amount of manual labor required for current solutions, this capable of detecting and locating litter solely from real-time
project aims to implement an automated micro-unmanned aerial transmission of visual remote UAV video.
vehicle (UAV) capable of real time litter detection from UAV
surveillance footage. Performances of five different algorithms II. E XPERIMENTAL P ROCEDURE
(two classifiers and three detectors) were compared after training A. Dataset Preparation
them on various public images of litter on the Google Cloud
Platform to determine the strongest models to utilize in the The images used to train and test the models were pooled
ensemble method. Out of the two ensemble models tested, one together from two main sources: Trashnet Dataset from Gary
being a custom-built ensemble and the other being a bootstrap Thung and Mindy Yang’s final project for Stanford University
aggregating (bagging) ensemble, performance of the bagging
ensemble demonstrates a significant improvement in performance
CS 229 Machine Learning course and various litter images
over any individual model. from Google Images. From the Trashnet Dataset, 2342 of
2527 litter images with a white background and resolution of
I. I NTRODUCTION 512x384 were selected as useful for the final dataset. To further
supplement the dataset, the tool Google Images Download was
Undetected litter is a ubiquitous problem that has negative utilized to gather 566 additional litter images from various
implications on quality of life, the environment, and the sources [6]. This ensured that the model would learn in a
economy. When municipal solid waste is improperly disposed, more versatile manner rather than solely detecting litter on
it can become a hindrance not only in public areas, where they white background without much noise. After forming the final
are most commonly found, but also in animal habitats. For dataset of 2825 images, the images were separated into eight
example, litter containing harmful chemicals may contaminate classes: bottle, can, cardboard, container, cup, paper, scrap, and
the surrounding water and air, causing further habitat alteration wrapper. For training and testing, while the classifiers could
by depleting levels of oxygen and light. In turn, this impedes use only the Trashnet dataset because they needed images with
the ability of habitats to support life, leading to a decline only one object, the detectors were able to utilize the final
in species diversity [1]. The ingestion and entanglement of dataset. To generate the necessary annotations for detectors,
various improperly disposed contaminants also poses a consid- the tool LabelImg was utilized to save the annotations in an
erable threat to wildlife. Unfortunately, unless these biohazards XML file using PASCAL visual object class (VOC) format
are identified and removed, they will take many years to [7]. Both classifiers and detectors trained and tested with a
decompose and will continue to disrupt the ecosystem [2]. 80:20 training to testing ratio split.
In addition, up to 90% of forest fires are commonly started by
another commonly disposed item: cigarettes [3]. These fires B. Convolutional Neural Network (CNN)
can not only cause damage to the forest ecosystem, but also To implement the CNN model, transfer learning was used
cost millions of dollars in reparations. to further train the VGG-16 pre-trained model for Keras with
Currently, most solutions to this problem are not automated; additional layers. Due to its ability to apply kernels onto image
rather, they revolve around legislation or community efforts to vectors of multi-channeled images, the VGG-16 model was
manually remove litter. For instance, the punishment for minor able to recognize import features in any image. Four more
infractions is a fine and an order to clean litter or complete layers were added to adapt the VGG-16 for litter classification:
community service [4]. Additionally, many organizations, such a Flatten layer, two Dense layers, and a Dropout layer. The
as Keep America Beautiful, work closely to maintain public Flatten layer converted all of the pooled images taken from the
areas manually [5]. While these solutions have tried to reduce dataset into a continuous one-dimensional vector. The Dense
1
978-1-5386-9374-2/18/$31.00 ©2018 IEEE
Authorized licensed use limited to: Robert Gordon University. Downloaded on May 28,2021 at 05:38:47 UTC from IEEE Xplore. Restrictions apply.
layers conducted a linear operation on the given input vector. E. You Only Look Once (YOLO)
The Dropout layer prevented overfitting the neural network The YOLO model was mainly implemented with DarkFlow,
on the litter dataset through a random elimination of units and an open source adaptation of the Darknet library for Ten-
connections between layers. Furthermore, the learning rate was sorFlow. The algorithm was generated with the architecture
increased to 1 ∗ 10−3 with the ADAM optimizer for the first and weights of both YOLO and tiny YOLO, a compressed
60 iterations, while it was later decreased to 1 ∗ 10−4 and with version of YOLO. In order to customize the algorithm for litter
the SGD optimizer for the next 180 iterations. detection, the output layer was changed to have eight possible
litter classifications and 65 filters. After initial testing, param-
C. Support Vector Machine (SVM) eters, including epochs, batch size, and learning rate, were
tuned to trade off bias with variance. It was found that YOLO
The SVM with HOG model was mainly developed with had a tendency to poorly detect smaller objects and generally
Scikit-Learn and Scikit-Image. The training process started needed to have a greater number of iterations. By changing the
with the feature extraction from the training images. The HOG learning rate and batch size hyperparameters, the YOLO was
function from Scikit-Image extracted and saved the histograms able to converge to a global optimum. Ultimately, the YOLO
from each image. An 8x8 cell, 1x1 block, and 12 orienta- model was finalized with a learning rate of 1 ∗ 10−5 and a
tion bins were used as parameters for the feature descriptor, batch size of 16.
resulting in a vector size of 3600 per image. These values
were chosen to maintain the useful features while also saving F. Custom Ensemble
computation time. After saving all the feature descriptors into A custom ensemble method was designed and implemented
separate files, the SVC module with a linear kernel from using the two best detectors and best classifier to extrapolate
Scikit-Learn was used to train and save the necessary classifier. a more accurate prediction. The custom ensemble first ran the
The testing process followed a similar structure; images in first detector to propose its predictions for possible locations
the the testing process were passed through the same feature of litter in the image. Afterwards, the custom ensemble ran
extraction process as the training step. the second detector on each proposed subregion from the
first model. The second detector shrunk the predictions of
D. Single Shot Multibox Detector (SSD) and Region-based the first detector model and detected objects missed by the
Fully Convolutional Network (R-FCN) first detector. Finally, the custom ensemble ran the classifier
model on each subregion proposed by the second detector
The SSD and R-FCN models were built using the Ten- model to determine if the smallest predictions were accurate
sorFlow Object Detection API [8]. Transfer learning was or if the prediction of second detector was too small. If all of
used to instantiate the new R-FCN model with some of the bounding boxes proposed by the second model matched
the previously trained layers of the Resnet-101 network and the classes assigned by the classifier, then the predictions of
the SSD model with previously trained layers from Googles the second model were selected in place of the predictions of
InceptionV2. Both models were previously trained on the the first model. Otherwise, the predictions of the first detector
Microsoft Common Objects in Context (MSCOCO) Dataset were accepted. In this way, the various models used a system
using over 220,000 annotated images of general objects from of checks and balances to produce a more accurate final
people to planes. In order to use the TensorFlow Object prediction.
Detection API, a simple script was written in Python to convert
the images and image annotations from PASCAL-VOC format G. Bagging Ensemble
to TFRecord files. The bagging ensemble was built with the two best per-
After preliminary testing, it was clear that the hyperparame- forming detectors. Predictions from each detector were aver-
ters for both models needed tuning. During initial training, the aged together to determine the final prediction. Due to the
loss value fluctuated instead of steadily decreasing. To mitigate possibility of detecting multiple objects in one image, the
this issue, the learning rate of both models was decreased to ensemble method needed to determine which bounding boxes
make the training more gradual and consistent. The R-FCN corresponded to the same object. This step was critical to
learning rate was changed from 3 ∗ 10−4 to 6 ∗ 10−6 , and the ensure the ensemble would not ruin predictions with multiple
SSD learning rate was changed from 4 ∗ 10−4 to 1 ∗ 10−4 . The objects by averaging the box to a point in the image that
rate of decay of the learning rate was also slightly increased, likely does not contain any object at all. In order to match the
so that the models would learn quickly at the beginning of boxes together, the Intersection over Union (IoU) values were
the training, but would slow down as the model progressed, generated for every combination of boxes. The bounding boxes
making changes more gradually. For the R-FCN, this was were considered to be correlated and were averaged together
achieved by dividing the learning rate by ten at around every only if the IoU score exceeded 50%. Otherwise the predictions
80,000 steps. For the SSD, this was accomplished by changing were considered separate entities. If a bounding box did not
the learning rate decay factor from 0.95 to 0.9. After these have any other boxes correlated to it, the prediction would be
small adjustments the models were ready for training on the added to the final prediction only if the original model had an
final dataset. exceedingly high confidence value [9].
2
978-1-5386-9374-2/18/$31.00 ©2018 IEEE
H. Micro-UAV Integration an F1 score of 0. However, the SVM saw successes with
To control and receive surveillance footage from the micro- cups and papers due to both types of objects having a distinct
UAV, a third party, open-source project called Hack-a-Drone shape, which was very advantageous for the SVM since it took
was implemented due to the limited access to the built-in in HOG as its feature descriptor. Unfortunately, the feature
software [10]. The buffered image files were piped from the descriptor caused problems for the SVM as well. Many bottles
project, which was written in Java, to a separate Python script and containers had a similar shape, which lead to the SVM
through a localhost socket for compatibility with the machine confusing the two classes many times, as shown by the F1
learning models developed with Python. Despite the slight scores.
speed loss due to the indirect connection, this method saved TABLE 2. AP S CORES OF D ETECTORS AND E NSEMBLES
time from recreating the drone communication protocol from
scratch. In order to minimize the delay in the video feed, a SSD R-FCN YOLO Custom Bagging
multithreading approach was utilized to capture drone footage Bottle 0.62 0.71 0.56 0.56 0.77
in one thread while processing the footage in another. Can 0.70 0.55 0.48 0.67 0.72
III. R ESULTS AND A NALYSIS Cardboard 0.52 0.73 0.79 0.49 0.78
Container 0.71 0.68 0.55 0.71 0.83
Different evaluation metrics were used to determine the
Cup 0.12 0.00 0.00 0.09 0.11
effectiveness of each classifier and detector. F1 score was used
Paper 0.80 0.74 0.84 0.78 0.85
to assess the classifiers to balance the weights of precision
and recall. Average Precision (AP) score was used to evaluate Scrap 0.67 0.56 0.00 0.67 0.67
the detectors because the IoU values needed to be taken into Wrapper 0.32 0.22 0.00 0.20 0.41
account. Average 0.56 0.53 0.40 0.52 0.64
TABLE 1. F1 S CORES OF C LASSIFIERS

CNN SVM C. SSD
Bottle 0.681 0.647 The SSD produced the best results out of the three detector
Can 0.489 0.510 models, returning an mAP score of 0.56. The top results of
Cardboard 0.588 0.689 the SSD may be attributed to its longest training time out of
Container 0.442 0.419 the three detector models. Moreover, bottles, cans, cardboards,
Cup 0.000 0.889 containers, and papers had the most images, resulting in
Paper 0.000 0.746 a thoroughly trained model. The lack of cup images also
Scrap 0.154 0.000 impacted the SSD, producing an AP score of merely 0.12.
Wrapper 0.282 0.643
D. R-FCN
Average 0.330 0.623
The R-FCN performed nearly as well as the SSD, delivering
an mAP score of 0.53. R-FCN’s best performances were seen
A. CNN with bottles, cardboards, and papers. This is consistent with
While the CNN was able to achieve a low training loss of the fact that there was a higher number of images for those
0.1598, the average F1 score of the test set showed that the three classes compared to others. The R-FCN received an AP
model had overfitted the training dataset. Two methods were score of 0.00 on cup images largely due to the lack of cup
brought in to relieve overfitting: the creation of a validation training images available.
subset and the increase of the learning rate. Hyperparameter
E. YOLO
tuning resulted in a model that demonstrated a noticeable
improvement, with an average F1 score of 0.330. The F1 The tiny YOLO algorithm performed poorly with an mAP
scores of 0.0 regarding the cup and scrap classes may have score of 0.08. As a result, a second YOLO model was
resulted from the fact that there was an unequal distribution implemented using a more powerful version of YOLO with
of image data. There were fewer images of cups and scraps more convolutional layers. The updated YOLO model saw
to train on compared to other types of litter such as bottles; significant improvement, as it returned an mAP score of 0.40.
thus, there was not as much information that the models could YOLO’s lower performance was mostly attributed due to its
learn from. As a result, the model rarely classified litter as a speed. A faster detection algorithm resulted in a trade off
cup or a scrap. between speed and accuracy. Moreover, the YOLO model
had a considerable demand for data, as demonstrated by
B. SVM its poor performances with cups, scraps, and wrappers. The
As seen in Table 1, the SVM produced an average F1 architecture of a YOLO model also caused some difficulty
score of 0.623, markedly higher than the F1 score of the regarding small object detection. This issue was undoubtedly
CNN. Similar to the CNN, the SVM performed poorly when present in the same aforementioned classes, thus significantly
classifying scraps due to the lack of training data, returning weakening YOLO’s average mAP score.
3
978-1-5386-9374-2/18/$31.00 ©2018 IEEE
F. Custom Ensemble and higher-resolution camera to further train and enhance the
From the results of the classifiers and detectors, the custom models. Continuation of this research could result in the de-
ensemble was assembled with the SSD, R-FCN, and SVM ployment of drones to public areas. With further developments
models. Unexpectedly, the custom ensemble delivered an mAP such as the installation of a GPS and automated piloting
score of 0.52, performing marginally worse than the two system, the micro-UAVs could report the precise location and
best detector models, SSD and R-FCN. Observing individual type of litter to organizations to help communities stay clean
results revealed the main reason for the unexpectedly worse with minimal human supervision. The design presented in this
performance: the second detector (R-FCN) focused too closely paper demonstrates the versatility and extensibility to larger
on an object causing the classifier (SVM) to inaccurately areas of land.
recognize an object from a very miniscule bounding box. Fur-
thermore, the SVM was not robust enough in some categories ACKNOWLEDGEMENTS
to be the final classifier for the ensemble, as demonstrated The authors of this paper would like to acknowledge project
by its F1 score of 0.623. In addition, the Trashnet dataset mentor Ryan Gamadia and residential teaching assistant (RTA)
contained many images of only specific parts of litter, such Pragya Hooda for their guidance and involvement throughout
as only the upper half of a water bottle. This may have the research process, in addition to Andrew Page and Google
confused the model when shrinking the image, as the second for their contribution of Google Cloud Platform credits. More-
detector model often split the object into pieces and the over, special gratitude towards program director Ilene Rosen,
classifier would still detect each piece as the correct object of associate program director Jean Patrick Antoine, research
interest. Ultimately, the checks and balances system between coordinator Brian Lai, and head RTA Nicholas Ferraro for
the models in the custom ensemble did not show an overall making it possible to conduct high level research at the New
improvement. Jersey Governors School of Engineering and Technology (NJ
G. Bagging Ensemble GSET). Special thanks to Anthony Yang, Shantanu Laghate,
and Jennifer He for supplying preliminary resources and aid.
The bagging ensemble was built with the SSD and R-FCN Finally, much gratitude to the sponsors of NJ GSET for
and returned an mAP score of 0.64, showing an improvement their continued participation and support: Rutgers University,
of approximately 0.086 over the SSD, the best individual Rutgers School of Engineering, the State of New Jersey,
detector. The bagging ensemble performed successfully, unlike Silverline Windows, Lockheed Martin, Rubiks, and the alumni
the custom ensemble, due to its nature of averaging the of NJ GSET.
results, which balanced inaccuracies in the predictions of
individual models. This method also made fewer false positive R EFERENCES
predictions because the boxes produced by each individual
model needed extremely high confidence scores to overwrite [1] Impacts of Mismanaged Trash, EPA, 23-May-2017. [Online]. Available:
https://www.epa.gov/trash-free-waters/impacts-mismanaged-trash.
the other model if both models did not agree. Since the [2] R. Ahmad, Detrimental Effects of Littering, EcoMENA, 15-May-2018.
bagging ensemble demonstrated the strongest results of all [Online]. Available: https://www.ecomena.org/littering/.
other ensemble methods and individual models, it was selected [3] Wildfire Causes and Evaluations (U.S. National Park
for use with the micro-UAV. Service), National Parks Service. [Online]. Available:
https://www.nps.gov/articles/wildfire-causes-and-evaluations.htm.
[4] States with Littering Penalties, National Conference of State Legis-
IV. C ONCLUSION latures. [Online]. Available: http://www.ncsl.org/research/environment-
The use of robust machine learning algorithms alongside and-natural-resources/states-with-littering-penalties.aspx.
[5] What We Do, Keep America Beautiful. [Online]. Available:
with computer vision techniques has great potential to affect https://www.kab.org/about-us/what-we-do.
environment monitoring and maintenance. This paper shows [6] Vasa, H., Google Images Download, (2015), GitHub repository,
that a combination of individual models into an ensemble https://github.com/hardikvasa/google-images-download
performs far better with regards to litter detection than any [7] Tzutalin, D., LabelImg,(2015), GitHub repository,
https://github.com/tzutalin/labelImg
single algorithm. While the complex custom ensemble did [8] Kaiser, L., Tensorflow Models, (2016), Github repository,
not create any noticeable improvements compared to the https://github.com/tensorflow/models/
individual algorithms, the more simplistic bagging ensemble [9] Arhnbom, M., Ensemble-ObjDet, (2017), GitHub repository,
https://github.com/ahrnbom/ensemble-objdet
was able to significantly improve the accuracy of the combined [10] Berlin, N. Hutting, J. and Runge R., Hack-a-Drone, (2017), GitHub
strong models. Future directions include an increased dataset repository, https://github.com/Ordina-JTech/hack-a-drone/
4
978-1-5386-9374-2/18/$31.00 ©2018 IEEE

Chung 2018

Uploaded by

Copyright:

Available Formats

You might also like

Chung 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chung 2018

Uploaded by

Copyright:

Available Formats

Cloud Computed Machine Learning Based

Real-Time Litter Detection using Micro-UAV

TABLE 1. F1 S CORES OF C LASSIFIERS

You might also like