You are on page 1of 6

2020 IEEE REGION 10 CONFERENCE (TENCON)

Osaka, Japan, November 16-19, 2020

Water Level Detection from CCTV Cameras


using a Deep Learning Approach
Punyanuch Borwarnginn∗ , Jason H. Haga† , Worapan Kusakunniran∗
∗ Faculty of Information and Communication Technology, Mahidol University, Nakhon Pathom, Thailand
† Digital Architecture Promotion Center, National Institute of Advanced Industrial Science and Technology
Tsukuba, Ibaraki, Japan
Email: punyanuch.bor@mahidol.edu, jh.haga@aist.go.jp, worapan.kun@mahidol.edu

Abstract—Natural disasters are a global problem that causes proposed a method using temporal motion changes in water
widespread losses and damage. A system to provide timely levels. This approach is based on the difference in features
information is required in order to help reduce losses. Flooding between the previous image frame and the current image
is one of the major natural disasters that requires a monitoring
and detection system. The traditional flood detection systems frame. The method uses a Gaussian filter and an averaging
use remote sensors such as river water levels and rainfall to filter to reduce the noise, then the water level is calculated
provide information to both disaster management professionals from the horizontal edge. Hiroi et al. [5] proposed using
and the general public. There is an attempt to use visual similar techniques on finding water flow based on temporal
information such as CCTV cameras to detect extreme flooding changes using RGB values.
events; however, it requires human experts and consistent
attention to monitor any changes. In this paper, we introduce In contrast, Lin et al. [6] proposed an automatic water
an approach to the automatic river water level detection using level detection by using a collinear equation and the Hough
deep learning to determine the water level from surveillance transform technique to find a water line and location of water
cameras. The model achieves 93% accuracy using a single gauges. Lastly, deep learning techniques were introduced in
camera location and 83% accuracy using multiple camera
[7]. The study compared the dictionary learning model with
locations.
Index Terms—deep learning, water level detection, CCTV, a convolutional neural network (CNN) on the water level
CNN detection model and demonstrated that CNN achieves a better
result. However, most of these studies used CCTV cameras
I. I NTRODUCTION that were focused at the wall or the water gauges for visual
Over the past few years, many countries are experiencing interpretation of a water line and used a combination of
natural disasters such as extreme floods, tsunami, storms, different feature extraction techniques to detect water levels.
fires, and landslides. According to the Asia Disaster Re- These approaches assume an ideal setting for the imaging.
duction Center [1], Typhoon Hagibis hit Japan in October Recently, there are attempts to gather information in
2019. It caused over 90 deaths, 480 injured, and over 50,000 various types of sensors and making it available for the public
damaged houses. The impact of natural disasters on human to support the monitoring of natural disasters. In Thailand, a
lives and property losses can potentially have widespread water crisis prevention center [8] attempts to create a portal
effects on the global economy since natural disasters can website for river and floold monitoring including over 125
pass through multiple countries. A disaster detection and websites from different organization and different sensors.
monitoring system is one of the key technologies to help For example, there are 3 websites related to live streams from
minimize damages and losses from natural disasters. CCTV cameras and 7 websites about river water levels. Also,
Traditionally, many disaster management systems are in Japan, a public portal website [9] provides information
based on remote sensors specific for the type of natural from different sensors throughout the country related to
disaster. For example, flood monitors [2] usually use in- rivers such as river water levels, water qualities, rainfall
formation from numerical data sensors such as river wa- radars, and snowfall levels. In addition to numerical data,
ter levels and rainfall levels. Ko et al. [3] discussed the it also provides live-stream images of CCTV cameras that
common disadvantages with these remote sensors, including are installed along the river. However, these images are
equipment costs and the reliability of sensors. The sensors mainly for visualizing the current status of the river. They
usually provide data limited to a single type, such as text- do not include any meaningful information, which leads to
based data or numerical data. To ensure accurate warnings, it difficulties in interpreting and extracting information. Users
would require additional information or combining multiple viewing the images require their own experience to judge
sources of information. However, data from different sources whether the current images are in the regular or abnormal
cannot be straightforwardly combined into a given system. It stage. So, it would be beneficial to both disaster management
requires some analysis to find the relationship between these professionals and the public to make use of these data and
data since remote sensors are typically deployed by different provide meaningful information by combining different data
organizations. sources in an automatic process.
Over the past several decades, many studies applied As alluded to above, in the real-world, the cameras might
surveillance cameras in disaster management, in particular not be installed in the best position for capturing water levels,
water level detection for flood monitoring. Yu et al. [4] i.e., CCTV cameras cannot clearly capture water gauges in

978-1-7281-8455-5/20/$31.00 ©2020 IEEE 1283

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.
the images. It is possible to have several sensors around
the same area, such as CCTV camera, and a water level
sensor, but some locations might be limited to just one.
In an emergency situation, a water sensor can malfunction,
resulting in no river water level measurement unless there
are other sensors nearby to compensate for this.
This research project aims to study the use of public
data from several sensors to demonstrate the application
of computer vision in disaster management, especially for
river height information. It is a challenging problem because
we can neither control the equipment settings nor ensure
that it is adequately maintained. This results in missing,
imbalanced, and noisy data. Although water level detection
can be intuitively viewed as a regression task, it requires an Fig. 2. A sample image with the corresponding water levels.
extensive, continuous dataset to create a model. Therefore,
we propose to use a classification task to predict a river
region for a month from 22 November – 20 December 2016.
water level from a given CCTV image to avoid insufficient
There are 6 types of sensors that provide numerical data such
data and evaluate how visual information can be used to
as rainfall level, water levels, dam storage levels, snowfall
classify river water levels. This paper proposes a method
levels, tide levels, and water qualities. River images from
based on a convolutional neural network (CNN) approach
CCTV cameras are also provided. All sensors except CCTV
with an attention mechanism to detect the water level from
cameras were downloaded every 10 minutes. CCTV cameras
an image.
images were downloaded approximately every 2 minutes.
The remainder of this paper is organized as follows.
2) Data Preparation: In this study, we mainly focus on
Section II describes the details of the proposed method.
water level sensors and CCTV cameras. The raw data are
The experiments are described and discussed in Section III.
mapped using the geolocation of the sensors. We built a
Finally, the conclusion is drawn in Section IV.
ground truth based on using a nearby water level sensor to the
II. P ROPOSED M ETHOD CCTV camera. Fig. 2 shows the example of images with their
The framework of the proposed method is illustrated in corresponding time mapping. Since the interval collection
Fig.1. The overall framework consists of two main processes, time is different, we assume within 10 minutes that the water
which are data harmonization and water level detection level usually remains consistent and does not change rapidly.
model. Data harmonization includes acquiring data from a We chose daytime images from 7 am to 7 pm to reduce
public source and creating a ground truth. The water level the number of low-quality images since images captured
detection model mainly focuses on network construction outside of these times were usually too dark to clearly see
(MultiConvAttnNet) and model training to produce a water any features. The water level sensor data were used to define
level detection model. Then we apply a test set to evaluate class labels. Then each CCTV image is mapped into one of
the classification performance on trained models. Details of these classes based on the timestamp. Finally, the dataset is
each process are explained in the following sub-sections. split into a training and testing sets.
B. Water level detection model
1) Model training: Recently, deep learning has become a
widespread technique in computer vision. In deep learning
[10], [11], convolutional neural networks (CNN) have shown
promising results in image classification, image segmenta-
tion, and object recognition tasks. According to the ImageNet
large scale visual recognition challenge in 2015 [12], most of
the top 5 best models use CNN as part of their architecture.
However, CNN is known to be computational expensive
because it consists of several convolutional layers, pooling
layers, and fully connected layers. A transfer learning tech-
nique [13], [14] was introduced to reduce computational
time, where the key idea is to reuse pre-trained weights
Fig. 1. The overview of the proposed framework.
from other datasets such as ImageNet. This allows users
the flexibility to train the whole model or some layers with
the new training set. Many CNN models InceptionV3 [15],
A. Data Harmonization AlexNet [16], and VGGNet [17] have existing pre-trained
1) Data Acquisition: Since our propose is to address the weights to use in the initial model training.
challenges of using publicly available data, the dataset is In this paper, we choose InceptionV3 as the base archi-
acquired from a Japanese website for river disaster prevention tecture because it has achieved a similar result with less
information [9]. The dataset contains information from over trainable parameters compared to other CNN models. The
20,000 remote sensors across Japan, mainly in the Kanto base architecture is connected with a fully connected layer

1284

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. The overview of transfer learning using InceptionV3 architecture. The fully connected layers are replaced based on the output class. The initial
weight is from the existing dataset, such as ImageNet. Then we can retrain the whole or some part of the network with the new dataset.

(a) MultiConvAttnNet A

(b) MultiConvAttnNet B
Fig. 4. The overview of adding attention mechanism to the network in different locations (a) Attention layers is added after Inception Module C (b)
Attention layers are replaced Inception Module C.

Fig. 5. MultiConvAttnNet C, some of the convolutional layers in Inception module B are replaced with Attention layers.

to predict a water level as in Fig. 3. The total trainable memory block based on the method described in [19].
parameters in the model are over 21 million. Therefore, we add the attention mechanism to our base
2) MultiConvAttnNet: In this work, CCTV images from model on the problem and evaluated its performance in
different locations are not positioned in the same view, but various settings. It is advised to start adding an attention
they could have the same water levels. If the trained model layer to the high-level features[19]; however, we evaluated
can predict a water level from multiple cameras, it can several different locations for attention and replaced some of
be useful whenever nearby water sensors are broken. With the convolutional layers with attention layers. An example
multiple cameras and imbalanced classes, training the model of connecting an attention layer is illustrated in Fig. 4. In
can result in overfitting. There is an attempt to find alternative the subsequent Fig. 5, some of the inception module B
solutions to prevent overfitting, reducing the number of convolutional layers are replaced by attention layers. Lastly,
parameters and number of layers without losing classification our trained models are implemented using a test set to
performance. In the past year, the attention mechanism in evaluate the prediction results.
Machine Translation [18] has been applied in Computer
Vision. In [19], [20], a self-attention mechanism is applied III. E XPERIMENTS
as the extension or replacement of convolutional layers to
achieve a reduction in the number of model parameters. Our experiments consisted of two scenarios: 1) a single
Similar to convolution, the output from self-attention layers camera and 2) multiple cameras from different locations.
is based on small neighbourhood pixels called a memory Each scenario is trained and evaluated using 5-fold cross-
block, then the single-head attention is computed in every validation.

1285

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.
(a) Accuracy and cross entropy (a) Accuracy and cross entropy

(b) Confusion matrix of the highest accuracy (b) Confusion matrix of the highest accuracy
Fig. 6. Results from a single camera model Fig. 7. Results from MultiConvAttnNet C model of muliple camera dataset.

TABLE I
ACCURACY RESULTS OF A SINGLE CAMERA tion. The combined dataset contains 6,652 images with 13
classes of water levels. Since our single-camera achieves over
Trained layer Accuracy(%) Standard Deviation
93% accuracy, we applied the same architecture with this
Fully connected layer 13.48 1.06 dataset. However, this resulted in a 75.60% accuracy, which
Whole network 93.03 0.65 is much lower than the single camera. We then increased the
dataset by using data augmentation techniques because both
cameras capture images in different positions and angles,
A. Single Camera which may result in overfitting to a single camera during
In this scenario, the dataset contains 6,382 images from training. This approach increased the accuracy to 82.80%.
a single CCTV camera with 27 classes of water levels. All the training settings were the same as with the single-
We retrained the base model from Fig. 3 using the Keras camera except that the image size is reduced to 200 x 200
library with the initial weights from ImageNet on 50 epochs, due to the available computational resources. Furthermore,
50 images per batch, 299 x 299 pixels using the SGD the Augmented dataset is trained with MultiConvAttnNet to
optimizer. Table I reports the results from 5-fold cross- reduce the complexity of the network, as shown in Fig. 4.
validation. The confusion matrix and training graph of the Table II reports the results of each setting. MultiConvAttnNet
best fold are shown in Fig. 6. It is clearly evident that the A achieves slightly better accuracy, but it increases the
model outperforms in terms of accuracy and stability when number of parameters to 25.3 million because the attention
the whole network is retrained with the single-camera data. mechanism adds extra layers. Therefore, we also removed
The result from training only the fully connected layer is some layers in the base model such as Inception Module C
significantly low because the pre-trained weight from the as in MultiConvAttnNet B. It achieves a comparable result
ImageNet dataset may not contain relevant features to our and reduces over half of the parameters. Finally, MultiCon-
CCTV images. vAttnNet C is similar to MultiConvAttnNet B, but we replace
some convolutional layers inside Inception Module B with
B. Multiple cameras the Attention layer, as in Fig. 5. The testing result is 81.7%
In this scenario, we combine two CCTV cameras from accuracy, with 7.82 million parameters. The performance is
different locations, but they have water levels of the same similar to the previous setting, but it significantly reduces
magnitude. This was done to simulate a situation where the number of parameters. The confusion matrix and training
a single model could detect a water level from multiple graph of MultiConvAttnNet C are shown in Fig. 7.
locations instead of training each camera separately. This
approach would be beneificial in cases where there are C. Comparison with the actual water level sensors
missing data, for example, when water sensors are broken In each scenario, the developed model is evaluated with
and the CCTV cameras become a single source of informa- test images to predict river water levels. Then, the results

1286

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.
(a) Single Camera model

(b) Multiple Camera: Camera A

(c) Multiple Camera: Camera B


Fig. 8. Predicted Level vs Actual Level of test images by the timestamps .

are compared against their actual levels from a water level model, as shown in Table III. This is because the model
sensor. Fig. 8 shows the comparison between predicted is based on a single location yielding less variation among
results from each model against the ground-truth at the training and testing images. Although the multiple camera
same timestamps. It clearly shows that the majority of data model does not clearly distinguish between 2 cameras, it
overlaps with its ground-truth. The single-camera model has could be useful for reducing the number of model trainings
lower error rates, when compared with the multiple camera and a better performance for unseen cameras.

1287

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.
TABLE II
C OMPARISON OF DIFFERENT RESULTS .

Settings Accuracy(%) Standard Deviation Parameters (M)


InceptionV3 (Without Augmentation) 75.6 2.44 21.79
InceptionV3 (Fig. 4) 82.8 1.53 21.79
MultiConvAttnNet A (Fig. 5a) 83.1 2.00 25.30
InceptionAB 81.35 1.33 9.95
MultiConvAttnNet B (Fig. 5b) 81.70 1.75 9.27
MultiConvAttnNet C (Fig. 6) 81.70 1.59 7.82

TABLE III R EFERENCES


M EAN ABSOLUTE ERROR ( MAE ) AND ROOT MEAN SQUARE ERROR
( RMSE ) FROM THE PREDICTION WITH ITS ACTUAL VALUES [1] “Disaster information archive.” [Online]. Available: https://www.adrc.
asia/view disaster en.php?NationCode=&Lang=en&Key=2357
[2] M. A. Islam, T. Islam, Minhaz Ahmed Syrus, and N. Ahmed,
Test Data MAE RMSE “Implementation of flash flood monitoring system based on wireless
sensor network in bangladesh,” in 2014 International Conference on
Single Camera 0.001 0.007
Informatics, Electronics Vision (ICIEV), May 2014, pp. 1–6.
Multiple Camera (Cam A) 0.003 0.012 [3] B. Ko and S. Kwak, “Survey of computer vision-based natural disaster
warning systems,” Optical Engineering, vol. 51, no. 7, p. 070901,
Multiple Camera (Cam B) 0.002 0.009
2012.
[4] J. Yu and H. Hahn, “Remote detection and monitoring of a water
level using narrow band channel.” Journal of Information Science and
Engineering, vol. 26, no. 1, pp. 71–82, 2010.
IV. C ONCLUSION [5] K. Hiroi and N. Kawaguchi, “Floodeye: Real-time flash flood predic-
tion system for urban complex water flow,” in 2016 IEEE SENSORS.
Our goal was to make use of real-world data to detect IEEE, 2016, pp. 1–3.
river water levels using multiple sensor information such [6] Y.-T. Lin, Y.-C. Lin, and J.-Y. Han, “Automatic water-level detection
using single-camera images with varied poses,” Measurement, vol.
as CCTV cameras and water level sensors. Traditionally, 127, pp. 167–174, 2018.
image-based water level detection systems are constrained [7] J. Pan, Y. Yin, J. Xiong, W. Luo, G. Gui, and H. Sari, “Deep learning-
with ideal environments for image capture. For example, based unmanned surveillance systems for observing water levels,”
IEEE Access, vol. 6, pp. 73 561–73 571, 2018.
if CCTV cameras are focused at water gauges then any [8] “Water Crisis Prevention Center.” [Online]. Available: http://mekhala.
derived models can only be applied at that particular location. dwr.go.th/weblinks.php
This paper proposes a deep learning approach of water level [9] “Disaster prevention information of river.” [Online]. Available:
http://www.river.go.jp/kawabou/ipTopGaikyo.do
detection in uncontrolled settings, such as when there is [10] J. Schmidhuber, “Deep learning in neural networks: An overview,”
no visible water gauge in an image. The proposed method Neural networks, vol. 61, pp. 85–117, 2015.
introduces two approaches, which are CNN using transfer [11] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, pp. 436–444, 2015.
learning from pre-trained InceptionV3 and CNN with the [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
integration of attention mechanism. From the experimental Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
results, we achieve the following contributions: 1) a single scale visual recognition challenge,” International journal of computer
vision, vol. 115, no. 3, pp. 211–252, 2015.
location model achieved a high accuracy of 93%. Since we [13] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
have no control over CCTV camera settings and many CCTV features in deep neural networks?” in Advances in neural information
cameras are not positioned at water gauges, our result is processing systems, 2014, pp. 3320–3328.
[14] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
better than current approaches in terms of practical use 2) architectures for scalable image recognition,” in Proceedings of the
CNN architectures require a large number of images and IEEE conference on computer vision and pattern recognition, 2018,
balanced data in each class. However, we show that using a pp. 8697–8710.
[15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
small number of images and imbalanced data from multiple ing the inception architecture for computer vision,” in Proceedings of
cameras with different positions and angles can achieve a the IEEE conference on computer vision and pattern recognition, 2016,
better result by applying data augmentation techniques. 3) pp. 2818–2826.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
we introduced the integration of the attention mechanism to with deep convolutional neural networks,” in Advances in neural
the existing CNN architecture called MultiConvAttnNet to information processing systems, 2012, pp. 1097–1105.
help capture features from multiple cameras and resulted in [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in International Conference on
a reduction of computational costs and the number of model Learning Representations, 2015.
parameters by more than half of the base model. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advances in neural information processing systems, 2017, pp. 5998–
ACKNOWLEDGMENT 6008.
[19] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention
augmented convolutional networks,” in Proceedings of the IEEE
This research project was partially supported by the Fac- International Conference on Computer Vision, 2019, pp. 3286–3295.
ulty of Information and Communication Technology, Mahi- [20] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, and
dol University, and the ICT International Collaboration Fund J. Shlens, “Stand-alone self-attention in vision models,” in Advances
in Neural Information Processing Systems, 2019, pp. 68–80.
of the National Institute of Advanced Industrial Science and
Technology.

1288

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 25,2021 at 22:57:06 UTC from IEEE Xplore. Restrictions apply.

You might also like