You are on page 1of 14

© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.

org (ISSN-2349-5162)

Underwater Object Detection Using Deep


Learning Techniques
1
Shaik Javed Jani Ahmad, Department of ECE, GMR Institute of Technology, Rajam, India
2
Vallampati Gayathri, Department of ECE, GMR Institute of Technology, Rajam, India
3
Voora Naga Venkata Satya Sai Kumar, Department of ECE, GMR Institute of Technology, Rajam, India
4
Yelamanchili Praveen, Department of ECE, GMR Institute of Technology, Rajam, India
5
Seela Nikhil Kumar, Department of ECE, GMR Institute of Technology, Rajam, India
1
jjahmad.sk@gmail.com

Abstract: A model is proposed based on convolutional neural network architecture. The model was developed using photos
from the water. This technique uses YOLO to find objects underwater. An autonomous underwater object detection system
is required to lower the cost of underwater inspection lower the expense of underwater inspection, an automatic underwater
object detection system is required. This project's primary goal is to create a model which can identify and detect objects
underwater and this can be done using deep learning techniques. There are two parts in object detection one is Object
Classification, and another is Object Localization. Classifying the object into predefined classes comes under Object
Classification and distinguishing the object with the locations comes under Object Localization. Our aim is to test the input
image by matching the objects in the training data set to the training dataset after the system has been trained. In order to
find an object within an image, we proposed to use the YOLO model. YOLO is a method that provides real-time object
detection using neural networks. You Only Look Once is known by the acronym YOLO. The popularity of this algorithm
is due to its accuracy and speed. YOLO will be implemented using Matlab. There is a deep learning toolkit in Matlab. Deep
Learning Toolbox gives a framework for creating deep neural networks with pre-trained models and algorithms.

Keywords - Object Detection, You Only Look Once (YOLO-V2), ResNet, Bounding Boxes

I. INTRODUCTION
Detecting objects in photos or films taken in underwater surroundings is a very difficult task. Diverse issues arise when we
discuss underwater object detection, particularly in fields that investigate the recovery of endangered fish species, other creatures
Divers, etc.
This study's primary objective is to develop a model that can recognize and locate objects underwater using deep learning
methods. These methods assist the computer in classifying unlabelled data as human beings. Underwater object detection has several
difficulties, including low vision, colour reduction, and uneven lighting, making items appear less shiny and not their true colours.
Different target objects in various experimental setups can be appropriately identified by accurate and reliable detection of
submerged objects. Another major drawback is the small size of the dataset which is to be trained.
Training a large dataset gives the best and most accurate results, while a smaller dataset indicates insufficient training and less
accuracy. Since there are not many huge datasets pertaining to the environment of underwater, it is essential to have a variety of
performance improvement strategies that fully utilize the tiny data sets on which we train our models. The model we use is a YOLO-
V2 for image classification and a transfer learning-based pre-trained ResNet50V2 model using ImageNet weights.
There are two parts to object detection one is Object Classification and the other is Object Localization [1]. Classifying the object
into predefined classes comes under Object Classification and distinguishing the object with its locations comes under Object
Localization. The object detection system will take the input image and undergoes Pre-processing, the image will be resized at this
stage. By using the techniques of Deep Learning, the system extracts the features from the image, then classifies the object into a
finite set of classes and localizes the object by using the bounding boxes.

II. LITERATURE SURVEY

Multi-scale ResNet for real-time underwater object detection, According to Huang-Chu Huang2, Tien-Szu Pan1, Jen-Chun, Lee2,
and Chung-Hsien Chen [2], a programmed underwater object identification framework is fundamental to lessen the expenses of

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b588
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

submerged inspection. So, Multi-scale ResNet is used by them to detect underwater objects. They have trained underwater video
frames by proposing a convolutional neural network architecture, this is a modified architecture of ResNet named Multiscale ResNet
(M-ResNet) which increase and further develops efficiency by using multi-scale tasks for the precise position of items of different
sizes, particularly for small objects. Their focus is mainly on fish and their species. In one execution, they utilized 40% -for training
and 60% for testing; in one more execution, they utilized 80% dataset for training and 20% for testing purposes likewise in this
experiment, they evaluated the suggested model using 4 distinct executions. The accuracy of this proposed model was estimated by
Average Precision (AP). In future work, they will work on a bigger dataset in view of Generative Adversarial Networks (GANs) and
further, work on the improvement of the algorithm to increase the accuracy of detection.

Lightweight Deep Neural Network for Joint Learning of Underwater Object Detection and Colour Conversion, Chia-Hung Yeh,
Chih-Hsiang Huang, Chu-Han Lin, Li-Wei Kang, Min-Hui Lin, Chuan-Yu Chang, and Chua-Chin Wang [3] have proposed a network
i.e., a Lightweight Deep Neural Network is used to identify objects underwater. An image colour conversion network and an
underwater item detection network make up the bulk of the proposed model of a lightweight deep neural network for submerged
object identification. To modify the colour information in underwater pictures and for greater object detection, a colour conversion
network module is developed, because underwater images typically experience colour distortion. They have concentrated on the
objects like fish, debris, and sea divers. The accuracy (in terms of the Average Precision (AP) of this proposed model was estimated
and compared with Faster R-CNN, SSD, and Tiny-YOLO and then concludes that their proposed method gives better accuracy of
89.59% also concludes than, By contrasting it with state-of-the-art methods, experimental findings on the low-power Raspberry Pi
device have demonstrated the efficiency of the suggested lightweight jointly learning model for underwater object detection.

A Robotic Approach towards Quantifying Epipelagic Bound Plastic Using Deep Visual Models, Gautam Tataa, Jay Lowea,
Olivier Poirionb and Sarah-Jeanne Royer [4] Proposed an Approach for Object detection. At present, the most popular method for
evaluating marine plastic for manual sampling is manta trawls which reduces the cost. In this paper, by means of an autonomous
method, they used neural networks and computer vision models to eliminate the requirement for manual sampling. The dataset has
been created by gathering various footage of marine debris in California's field. Images are sourced from several Places to improve
the representation of marine plastics in various areas. These images have highly sophisticated object detection circumstances such as
occlusion, illumination, and noise. In this various Networks are used such as YOLOv5, TinyYOLOv4, Faster R-CNN, and SSD.
Both YOLOv4-Tiny and YOLOv5-S produce high debris localization metrics, according to the assessed models' high average
precision, mAP, and F1 scores relative to their inference speed. The top-performing model maintains near real-time processing speeds
of 2 ms/img while having a Mean Average Precision of 85% and an F1-Score of 0.89. They intend to eventually include more photos
taken in a variety of maritime environments and locales

III. RESEARCH METHODOLOGY


1 YOLO-V2 Architecture:
The architecture of YOLO-V2 is shown in figure 3.2, two subnetworks make up a YOLO v2 object detection network. a detection
network after a feature extraction network. For feature extraction, ResNet-50 is used in this sample. The input image size is 224 x
244 x 3. The number of convolution layers are 22, max-pooling layers are 5.

Fig1: YOLO-V2 Architecture


The proposed model in this paper is based on a unique neural network architecture. This technique uses YOLO to find
objects underwater. YOLOv2 is sometimes called YOLO9000 [9]. YOLO9000 is a real-time object detection system
that can detect over 9000 object categories. Using fully connected layers YOLO directly predicts the coordinates of
bounding boxes on convolutional feature extract. The primary objective of this study is to develop a model that can
recognize and locate items underwater using deep learning methods.
Detecting objects involves two steps. Object localization and object classification are the two. The division of an item
into predetermined classes is referred to as object classification, while the process of separating an object from its
environment is referred to as object localization. YOLO is an efficient algorithm for real-time object recognition ("You
Only Look Once").
Image classification is the process of classifying an image into one of the various categories.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b589
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Object localization helps us find the required object in the image is known as object localization.

Fig2: Object detection for multiple objects


Tools for object detection allow for the identification of every object in an image and the creation of so-called bounding
boxes around those things. A neural network recognizes and predicts the objects in an image using bounding boxes in
object detection, an advanced type of image classification. There are several circumstances where we need to determine
the precise bounds of our objects in the context of bounding boxes. The term for this procedure is instance segmentation.
In YOLO, our goal is to identify an object's class and to represent the location of the object with bounding boxes Each
bounding box can be described using four descriptors:
 bounding box’s centre (bx, by)
 width (bw)
 height (bh)
 Value corresponds to an object's class (e.g., fish, covers, bottles, etc.)

Fig3: Descriptors of the bounding box


Predicting the pc value must be done, which gives the probability of whether there is an object in the bounding box or
not. An image is split into cells, each cell is responsible for predicting 5 bounding boxes. Many bounding boxes arrived
for one image. Many of these cells and bounding boxes will not have any objects in them. To remove boxes with low
object probability and bounding boxes with the maximum shared area, a technique known as non-max suppression is
used. YOLO grew into the most widely used algorithm for object detection, due to its accuracy, speed, and ability of

learning.

Fig4: Non-Max Suppression


JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b590
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

2 Activation Function:
In this model, we have probably used two different kinds of activation functions in deep learning: one for hidden layers
and one for output layers. The ReLU activation function [6] is applied at hidden layers. Rectified Linear Unit is referred
to as ReLU. In neural networks or deep neural networks with several layers, it is a non-linear activation function. Relu's
formula is: f(x)=max (0, x). In output layers we have used the softMax activation function [7], multiclass classification
issues are usually solved with the softMax activation function.

3 Pooling Layers:
A technique for downsampling feature maps known as "pooling layers" [7] involves adding up the number of features
over various feature mapping areas. There are two well-liked pooling strategies named average pooling and max
pooling, respectively, which describe the average presence of a feature and the most activated existence of a feature.
This pooling layer is introduced after the convolution layer.

4 Fully Connected Layer: The Fully Connected Layer is the outermost layer of the convolutional neural network.. The
combination of the Linear function (y = Wx + b) and the Non-Linear function (Sigmoid and ReLu) gives a Fully
Connected layer or one Hidden layer. The Fully Connected Layer first receives input from the Flatten Layer, after which
the neuron in the Fully Connected Layer linearly changes the input vector using the weights matrix. The output vector
is then created by computing the probability distribution across all of the classes in the final set after the product has
undergone a non-linear transformation using a non-linear function.

Fig5: Fully Connected Layer


5 Intersection over Union (IoU):
IoU is a common metric used in object identification models to judge localization accuracy and spot localization errors.
Start with the ground truth bounding boxes of the same region and the ground truth to compute the IoU using the
predictions and the region where the bounding boxes of a specific prediction overlap. The Union is commonly known
as the combined region of the two bounding boxes can get a good concept of how nearly the bounding box matches the
initial prediction by dividing the intersection by the union, It provides the overlap to overall area ratio. If the IoU number
is greater than 0.5, the prediction is positive; if it is below, the prediction is negative. Figure 6. Intersection over union

Fig6: Intersection over union

The average precision is calculated using the area under a precision vs recall curve for a set of predictions. In a precision-
recall curve, the relationship between recall and precision is displayed for each potential cut-off. The x-axis gives recall
(= sensitivity = TP / (TP + FN)). The y-axis gives precision (= positive predictive value = TP / (TP + FP)). The recall
is determined by the ratio of the model's total predictions under a class to the total of the class's existing labels. On the
other hand, accuracy is the proportion of true positives to all of the model's predictions.. By examining just the area
under the precision vs the recall curve, we can know the average precision for every class in the model. The average of
this value across all classes is referred to as Mean Average Precision (mAP). Precision and recall in object detection are
used to assess the effectiveness of decisions rather than class predictions of the object classes.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b591
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

6 Confusion Matrix: Count values are used to describe the number of right and wrong predictions for each class. True
Positive (TP): The result was appropriately identified as being in a positive class by the model.
True Negative (TN): Results that match whatever the model predicted to be in the negative class.
False Positive (FP): A type 1 error is when the model predicts a result incorrectly as positive when it is actually
negative.
False Negative (FN): A type 2 error is when the model predicts a result incorrectly as negative when it is actually
positive.

Fig7: Confusion matrix and formulae of different metrics

7 Residual Network:
A residual deep-learning neural network model with 50 layers is called ResNet50 [8]. The architecture of ResNet50 has
4 stages. Residual Network is abbreviated as ResNet. ResNet is a typical neural network that serves as the framework
for computer vision applications. ResNet enabled us to successfully train very deep neural networks with more than 150
layers. Skip connection was first conceived by ResNet. In the left-hand figure, convolution layers are layered one on
top of the other. On other hand, after stacking the convolution layers original input is also given to the convolution
block as output. This phenomenon is nothing but skip connection. Skip Connections allow the model to acquire an

identity function, ensuring that the higher layer performs at least as well as the lower layer, if not better.

Fig8: Block Diagram of Resnet-50

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b592
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

IV. RESULTS AND DISCUSSION

Fig9: Images with object detection using YOLO-V2


These are the results of the proposed model, these images are the outputs of the system which is trained with the
parameters as follows: BatchSize-4, LearnRate-0.0001, and Epoch-100 and 50.

Fig10: Performance with learning rate 0.0001 using YOLO-V2

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b593
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Table 4.1: Tabular format of the different metric values for 0.0001 learn rate

Fig11: Images with object detection using YOLO-V2


These are the results of the proposed model, these images are the outputs of the system which is trained with the
parameters as follows: BatchSize-4, LearnRate-0.001, and Epoch-100 and 50.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b594
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Fig12: Performance with learning rate 0.001 using YOLO-V2

Table 4.1: Tabular format of the different metric values for 0.0001 learn rate

4.1 Comparison of Results:

In this project, we have used learning rates (0.001 and 0.0001) and various epoch values (50 and 100), from the
comparison of Table 1 and Table 2 our interpretation is the model with a low learning rate (0.0001) gives high accuracy
irrespective of the epoch value with constant batch size. Our model gives an accuracy of 0.9749 for a learning rate of
0.0001 and an accuracy of 0.945 for a learning rate of 0.001.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b595
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

4.2 Confusion Matrix:

Fig13: Bottle class- Confusion matrix

Fig14: Cover class- Confusion matrix

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b596
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Fig15: Fighter Fish class- Confusion matrix

Fig16: Fish class- Confusion matrix

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b597
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Fig17: Round Fish class- Confusion matrix

Fig18: Sea Diver class- Confusion matrix

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b598
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Fig19: Starfish class- Confusion matrix

Fig20: Confusion matrix for Fish and Fighter Fish class in the same image

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b599
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

Fig21: Confusion matrix for Fish and Starfish class in the same image
V Conclusion:

This model is proposed to recognize underwater objects. The model can be used to identify divers, marine life, plastic, and several
other objects. Because of its excellent accuracy, fast execution, and efficient performance, the YOLOV2 model is used. Classifying
an object into specified classes and isolating it from other objects depending on their locations are both aspects of object localization
and object classification. To remove boxes with the lowest item probabilities and bounding boxes with the biggest common area,
we employed non-max suppression. The intersection over union metric is used in object identification models to evaluate
localization accuracy and calculate localization errors. Using algorithms and learned models, Deep Learning Toolbox is a tool for
constructing and deploying deep neural networks. We have trained our model with different learning rates and epochs values among
them our models stand at the best accuracy of 97.4%. In the future, we want to do with a Large Dataset and with an advanced
version of YOLO to enhance Accuracy and other Metrics.

REFERENCES
[1] S. Sumahasan , Udaya Kumar Addanki , Navya Irlapati , Amulya Jonnala, " Object Detection using Deep
Learning Algorithm CNN. ‘’, International Journal for Research in Applied Science & Engineering Technology
(IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 8 Issue VII July 2020-
Available at www.ijraset.com.
[2] Tien-Szu Pan, Huang-Chu Huang,Jen-Chun Lee, Chung-Hsien Chen, "Multi-scale ResNet for real-time
underwater object detection." Signal, Image and Video Processing, 15(5), 941–949. doi:10.1007/s11760020-
01818-w, Received: 16 July 2020 / Revised: 2 November 2020 / Accepted: 6 November 2020 © Springer-
Verlag London Ltd., part of Springer Nature 2020
[3] Chia-Hung Yeh, Chu-Han Lin, Li-Wei Kang, Chu-Han Lin, Li-Wei Kang, "Lightweight Deep Neural Network
for Joint Learning of Underwater Object Detection and Color Conversion. ", IEEE Transactions on Neural
Networks and Learning Systems, 1–15. doi:10.1109/tnnls.2021.3072414
[4] Gautam Tataa, Jay Lowea, Olivier Poirionb, and Sarah-Jeanne Royerc,” A Robotic Approach towards
Quantifying Epipelagic Bound Plastic Using Deep Visual Models” arXiv:2105.01882v4 cs.CV19 Oct 2021
[5] Han, F., Yao, J., Zhu, H., & Wang, C. (2020). Underwater image processing and object detection based on the
deep CNN method. Journal of Sensors, 2020.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b600
© 2022 JETIR November 2022, Volume 9, Issue 11 www.jetir.org (ISSN-2349-5162)

[6] S H Shabbeer Basha, Shiv Ram Dubey, Viswanath Pulabaigari, Snehasis Mukherjee,'' Impact of Fully
Connected Layers on Performance of Convolutional Neural Networks for Image Classification".
Neurocomputing. doi:10.1016/j.neucom.2019.10.008
[7] Dingjun Yu, Hanli Wang- Peiqiu Chen, and Zhihua Wei, "Mixed Pooling for Convolutional Neural
Networks.", Lecture Notes in Computer Science, 364–375. doi:10.1007/978-3-319-11740-9_34.
[8] Liu, P., Wang, G., Qi, H., Zhang, C., Zheng, H., & Yu, Z. (2019). Underwater image enhancement with a deep
residual framework. IEEE Access, 7, 94614-94629.
[9] Redmon, J., & Farhadi A.(2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE conferences
on computer vision and pattern recognition (pp 7263-7271).
[10] Mohammed, M. S., Khater, H. A., Hassan, Y. F., & Elsayed, A. PROPOSED APPROACH FOR
AUTOMATIC UNDERWATER OBJECT CLASSIFICATION.

JETIR2211189 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org b601

You might also like