Professional Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Keywords: Despite the availability of preventive and protective systems, accidents involving falling overhead objects,
Projection detection particularly load-bearing cranes, still occur and can lead to severe injuries or even fatalities. Therefore, it
Deep learning has become crucial to locate the projection of heavy overhead objects to alert those beneath and prevent
Deep regression networks
such incidents. However, developing a generalized projection detector capable of handling various overhead
Camera modeling
objects with different sizes and shapes is a significant challenge. To tackle this challenge, we propose a novel
Image formation
Overhead object approach called OverProjNet, which uses camera frames to visualize the overhead objects and the ground-level
surface for projection detection. OverProjNet is designed to work with various overhead objects and cameras
without any location or rotation constraints. To facilitate the design, development, and testing of OverProjNet,
we provide two datasets: CraneIntenseye and OverheadSimIntenseye. CraneIntenseye comprises actual facility
images, positional data of the overhead objects, and their corresponding predictions, while OverheadSimIntenseye
contains simulation data with similar content but generated using our simulation tool. Overall, OverProjNet
achieves high detection performance on both datasets. The proposed solution’s source code and our novel
simulation tool are available at https://github.com/intenseye/overhead_object_projector. For the dataset and
model zoo, please send an email to the authors requesting access at https://drive.google.com/drive/folders/1to-
5ND7xZaYojZs1aoahvu6BkLlYxRHP?usp=sharing.
1. Introduction cil (NSC) estimated the total cost of both fatal and nonfatal preventable
injuries as $163.9 billion in the United States in 2020. This cost in-
Overhead objects, such as cranes, represent crucial equipment used cludes $44.8 billion in wage and productivity losses, $34.9 billion in
extensively across various industrial sectors for executing vertical and medical expenses, and $61.0 billion in administrative expenses. NSC
horizontal lifting operations. Despite its widespread utilization, crane- also estimated the total number of days lost due to work-related in-
involved incidents resulting in serious injuries and fatalities have been juries as sixty-five million in 2020, excluding the days lost because of
documented due to the intricate nature of such lifting operations. The the injuries that happened in the previous years. The number rises to
Census of Fatal Occupational Injuries (CFOI) reported 83 crane-involved ninety-nine million if these days are also taken into account (National
fatalities on overage in the United States from 1997 to 2005. In 2006, Safety Council, 2020).
the number of incidents decreased to 72, with 42% of them resulting To enhance the occupational safety and well-being of workers, as
from being struck by falling objects, while 8% were caused by other well as prevent any accident, operators controlling overhead objects
types of object strikes (Bureau of Labor Statistics, 2008). From 2011 to such as cranes must exhibit a heightened awareness of the proximity of
2017, CFOI reported a total of 297 crane-involved deaths. Being struck the object to other objects and individuals in the workplace. However,
by a falling object or equipment caused 154 deaths, with 79 of them this presents a challenge as operators may occasionally carry out lifting
involving an object falling from or being put in motion by a crane (Bu- operations without having a full view of the situation. A signal person
reau of Labor Statistics, 2019). Furthermore, crane-related accidents, can give instructions to the operator via either hand signals or electronic
whether fatal or not, lead to substantial monetary losses and decreased communication methods, such as radio, to increase workplace safety,
productivity, similar to other types of accidents. National Safety Coun- but this operation is also prone to some failures because of its manual
✩
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. All research and data acquisition
activities are conducted using the sources of Intenseye Research Department.
* Corresponding author.
E-mail addresses: poyraz@intenseye.com (P.U. Hatipoglu), ufuk@intenseye.com (A.U. Yaman), okan@intenseye.com (O. Ulusoy).
https://doi.org/10.1016/j.iswa.2023.200269
Received 23 April 2023; Received in revised form 22 July 2023; Accepted 11 August 2023
Available online 18 August 2023
2667-3053/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 1. Flows of processing steps to estimate the projection point of an overhead object (while the red and dashed line flow demonstrates the traditional computer
vision approaches, the blue line flow shows the proposed solution).
2
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
model replaces several challenging processing steps needed to make fore, to detect all the overhead objects’ extent accurately, including
the same prediction using the traditional computer vision approaches, corners and edges, these methods can be used. Gählert et al. (2020)
as shown in Fig. 1. introduced a visibility-guided non-maximum suppression algorithm to
The major contributions of this paper are the following: improve the amodal object detection performance of highly occluded
objects. Ke et al. (2021) designed a two-stage instance segmentation
• We develop a deep learning model named OverProjNet that esti- network, separating the layers responsible for detecting the occluding
mates the overhead object projection in the 2D images without and occluded objects. Similarly, Reddy et al. (2022) detect the occlud-
requiring a camera projection matrix and depth estimation of the ing and occluded objects with different networks, but their work also
objects. Our model infers the latent relationships between the 2D focuses on generating a self-supervised amodal instance segmentation
image plane and the 3D scene space. dataset.
• We design a simulation tool that generates images of overhead Traditional computer vision methods. To determine the camera
objects in 2D pixel coordinates by configuring the camera, lens, projection matrix, Liu et al. (2006) and Chen et al. (2009) rely on
and object properties. To reproduce realistic object detection op- the correspondence between a soccer or basketball field and the im-
erations, we consider distortions and disturbances in the designed age. They then use this matrix to compute the 3D world coordinates
tool. of the objects. However, unlike our proposed approach, their method
• We produce and quantitatively validate the effectiveness of this requires an explicit mapping between the 2D image plane and the 3D
approach on two datasets: OverheadSimIntenseye and CraneIntens- world space. To construct this mapping, they use the standard dimen-
eye. The first dataset is collected from a simulation environment, sions of the fields and balls as prior knowledge, whereas our method
whereas the second is collected from actual facility cameras. does not depend on such information. Additionally, their work focuses
on spherical objects, while we focus on objects having different shapes
Outline of the paper. Section 2 provides a comprehensive review of like a rectangular prism. Unlike spherical objects, which have a uniform
the related work concerning amodal object detection, instance segmen- appearance irrespective of the viewing angle, prism-shaped objects can
tation methods, and projection detection solutions for overhead objects. have varied appearances based on the viewing angle. Thus establishing
In Section 3, the details of the simulation tool designed to produce simu- a direct correlation between the size of the detection bounding box and
lation datasets are presented. Section 4 offers the architectural specifics the object’s depth is not viable for prism-like objects, even though it is
of the proposed method, OverProjNet, along with the loss functions em- a valid technique for spherical objects.
ployed for its training. Moving to Section 5, we introduce the details Neural networks. Instead of relying on traditional computer vision
of both the actual and simulation datasets provided for addressing the techniques, Wu et al. (2016) utilizes a convolutional neural network
overhead object projection problem. In Section 6, we present the ex- to predict the world coordinates of pixels. Their model requires RGB-D
perimental procedures, setup, and performance results, supplemented images, whereas our model operates solely in the RGB domain. Another
by detailed discussions of the proposed solution. Lastly, the paper con- related approach is the work of Pedrazzini (2018), which uses several
cludes in Section 7, summarizing the key findings, insights derived from neural network models to predict the 3D world coordinates of a pixel
the study, and the directions of future works. from an RGB image. However, unlike these models, our approach does
not require establishing correspondence between 2D image coordinates
2. Related works and 3D world coordinates.
Bird’s eye view cameras. Several researchers, including Yang et al.
In this section, we explore projection detection techniques across dif- (2015), Neuhausen et al. (2018), Price et al. (2021), have focused on im-
ferent domains. However, before delving into these techniques, we first proving workplace safety using bird’s eye view cameras. For example,
examine the latest advancements in amodal object detection and seg- Yang et al. (2015) and Price et al. (2021) have developed systems that
mentation models. These models are capable of detecting the positions generate alerts when a camera mounted on an overhead crane detects a
of overhead objects and people in the 2D image plane. Given that the moving object, while Neuhausen et al. (2018) focuses on detecting and
projection detection methodology is a crucial part of the overall warn- tracking workers in bird’s-eye view images. However, these methods
ing mechanism, it is essential to first detect the location of the overhead require at least one camera per overhead crane, resulting in additional
object. While not the main focus of our study, amodal object detection installation and maintenance costs. Moreover, the images captured by
or instance segmentation models serve as processing blocks for detect- bird’s-eye view cameras mounted on cranes can be blurry due to vi-
ing the extent and position of overhead objects in the 2D image plane. bration, reducing object detection accuracy. In contrast, our approach
The field of computer vision has seen extensive research and propos- can utilize cameras positioned at any location and does not require spe-
als of amodal object detection models as well as instance segmentation cific camera setups or locations. Therefore, our method offers greater
models, which can accurately detect objects, including occluded parts. flexibility and applicability compared to bird’s-eye view camera-based
Therefore, our proposed approach can be complemented by incorporat- approaches.
ing these deep learning models for object detection and segmentation, Point cloud-based methods. Instead of using a 2D vision-based
providing the necessary information for the warning system. safety method, Cheng and Teizer (2014) employs three-dimensional
Amodal object detection and instance segmentation models. terrestrial laser scanners (TLS) to generate point cloud data of the con-
The fields of object detection (Wang, Bochkovskiy, et al., 2022, Xu et al., struction site. Additionally, an ultra-wideband (UWB) real-time location
2022, Li et al., 2022, Wang, Dai, et al., 2022, Su et al., 2022, Fang et al., tracking sensing (RTLS) system is utilized to predict workers’ positions
2022, Wei et al., 2022) and instance segmentation (Fang et al., 2022, and crane hooks. These systems increase crane operator awareness in
Wang, Bao, et al., 2022) have gained significant attention among re- blind spots. Similarly, Fang et al. (2016) uses TLS to generate point
searchers in recent years. Some researchers (Wang, Bochkovskiy, et al., cloud data of a lifting site, which is used to identify potential colli-
2022, Xu et al., 2022, Li et al., 2022) focus on developing real-time ob- sions between the crane parts and obstructions on the site. While 3D
ject detection models, while others (Wang, Dai, et al., 2022, Su et al., point cloud-based systems provide more accurate distance measure-
2022, Fang et al., 2022, Wei et al., 2022) aim to achieve better accuracy ments compared to 2D vision systems, their real-time processing ability
by fine-tuning a pre-trained neural network for either object detection is limited due to their higher computational resource requirements. In
or instance segmentation. contrast, 2D cameras are generally more affordable and accessible than
Amodal object detection (Gählert et al., 2020) and instance segmen- 3D sensors.
tation (Ke et al., 2021, Reddy et al., 2022) models are more effective at Radio-frequency (RF) methods. Instead of a vision-based method,
detecting any parts of the objects, even when partially occluded. There- Li et al. (2013) uses Radio Frequency Identification (RFID) and Global
3
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 4. Pitch, Yaw, and Roll angles (arrows demonstrate positive directions).
𝒙 = ℙ𝑿. (1)
Positioning System (GPS) to predict the worker and crane positions
in construction sites. An alert is raised when a worker is detected The projection matrix, which maps 3D world points to 2D pixel space,
inside a dangerous zone. Similarly, Hwang (2012) uses the UWB sys- can be achieved by multiplying the intrinsic and extrinsic matrices. The
tem to understand the collision risk between crane booms and prevent intrinsic matrix, also known as the camera calibration matrix (Hartley
equipment-to-equipment collisions, while Park et al. (2017) focuses on & Zisserman, 2003), is used to transform map points from the camera
developing a multi-modal tracking system composed of Bluetooth Low coordinate system to the pixel coordinate system. On the other hand,
Energy (BLE) sensors, Inertial Measurement Unit (IMU) sensors, and the extrinsic matrix is also a transformation matrix that maps points
Building Information Model (BIM) to track the worker positions. RF from the world coordinate system to the camera coordinate system. The
signals are generally affected by radio-frequency interference, which extrinsic matrix stores information about the camera’s orientation and
reduces the signal-to-noise ratio and can result in information loss or position, while the intrinsic matrix represents internal camera and lens
complete data loss in some extreme cases (Sue, 1981). Compared to properties such as focal length, principal point, and pixel pitch. The in-
cameras, these solutions have a higher installation and maintenance trinsic matrix, denoted by 𝕂 in homogeneous coordinates space (Hartley
cost. & Zisserman, 2003), is defined as shown in Eq. (2).
⎡𝑓 0 𝑝𝑐,𝑥 ⎤
3. Camera modelling & simulation tool 𝕂=⎢0 𝑓 𝑝𝑐,𝑦 ⎥ (2)
⎢ ⎥
⎣0 0 1 ⎦
This section describes our simulation tool, which is developed to
where the coordinates of the principal point are (𝑝𝑐,𝑥 , 𝑝𝑐,𝑦 )𝑇 , and 𝑓 sym-
generate simulation datasets for use in this study.
bolizes the focal length. The extrinsic matrix is responsible for mapping
Annotating images with object and projection point labels is a time-
points from the world coordinate system to the camera coordinate sys-
consuming and costly process. Moreover, acquiring data with various
tem using rotation and translation operations. To form the rotation
camera and object placements may not be feasible due to operational
component, ℝ, of the extrinsic matrix, Euler angles (Slabaugh, 1999)
constraints. As we have limited access to annotated real-world data,
around each axis must be defined. This paper uses the convention dis-
we developed a simulation tool to generate simulated data with vari-
played in Fig. 4.
ous camera and object properties to evaluate the effectiveness of our
The Pitch, Yaw, and Roll angles define the rotation angles around
proposed approach.
the 𝑥, 𝑦, and 𝑧 axes, and the corresponding rotations are represented by
To create the simulation tool, we first model the camera and lens
ℝ𝑥 (), ℝ𝑦 (), and ℝ𝑧 () in 3 × 3 matrix form. To form the rotation matrix,
distortion characteristics. We then model configurable-sized overhead
a sequence of multiplication operations using all three rotation matri-
objects to generate visuals and positional data. Additionally, we incor-
ces is performed. It is important to note that the order of multiplication
porate random rotations for the overhead objects to ensure a compre-
affects the resulting rotation matrix due to the non-commutativity of
hensive study. We provide a brief overview of the camera model used
matrix multiplication. In this simulation design, the camera is rotated
in the simulation tool in Sec. 3.1, followed by a detailed explanation of
first about the z-axis, then the y-axis, and finally the x-axis. Thus, the re-
the simulation tool and objects in Sec. 3.2.
sulting rotation matrix is expressed as shown in Eq. (3). The open-form
matrix representation of ℝ𝑥 (), ℝ𝑦 (), and ℝ𝑧 () is presented in Appendix
3.1. Camera geometry & modeling
Sec. A.2.
The simulation camera is modeled using the pinhole camera model, ℝ = ℝ𝑥 (𝛼)ℝ𝑦 (𝛽)ℝ𝑧 (𝛾) (3)
which assumes that the image coordinates are Euclidean coordinates
and have equal scales in both 𝑥 and 𝑦 directions. In order to map a where 𝛼 (pitch), 𝛽 (yaw), and 𝛾 (roll) are the angle differences between
point 𝑿 in the world coordinate system to the image plane and obtain the world and camera coordinate system. The second component of the
the corresponding pixel coordinate 𝒙, a series of projection operations extrinsic matrix, translation 𝒕, can be formulated as
are performed. The image plane is positioned on the principal axis such ̃
𝒕 = −ℝ𝑪, (4)
that 𝑍𝑐 = 𝑓 (focal length), and the principal point 𝒑𝒄 is located at the
center of the image plane. The camera coordinate system and the world where 𝑪̃ represents the camera center coordinates in the world coor-
coordinate system are centered at 𝑪 and 𝑾 , respectively. Fig. 3 illus- dinate frame (Hartley & Zisserman, 2003). After defining each compo-
trates the camera coordinate system, the world coordinate system, the nent, the projection matrix can be represented as the multiplication of
pixel coordinates, and the mapping of an arbitrary point. the intrinsic and extrinsic matrices, as given in Eq. (5).
4
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
ℙ = 𝕂[ℝ|𝒕]. (5)
In order to find the 3D world coordinate of a pixel, back-projection
operations are employed. However, as the back-projection operation of
a pixel point generates a ray that passes through the camera center
and the pixel point, it is necessary to trace the ray until it intersects
with an opaque object in the 3D world coordinate system. The ray that
passes through the given pixel coordinate and the camera center can be
represented as shown in Eq. (6).
where 𝑝𝑖,𝑥 and 𝑝𝑖,𝑦 are used to denote the x-axis and y-axis locations
of the arbitrary undistorted point. The same subscript notation is also
employed for the principal point, 𝒑𝒄 , too. The relation between the dis-
torted location, 𝑑𝑖 , of an undistorted arbitrary pixel point, 𝑝𝑖 , can be
seen in Eq. (9).
𝑑𝑖 = (𝑢𝑖 , 𝑣𝑖 ),
5
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
6
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 8. Illustrative training workflow of OverProjNet. Alternative ways of the annotation process are also provided to offer a comprehensive view of the complete
training process.
Fig. 9. Illustrative testing workflow of OverProjNet. In addition to the part that focused on the paper, amodal object detection/instance segmentation and operations
that need to be involved for the warning system are also provided to offer a comprehensive view of a possible end-to-end solution.
Fig. 10. Flowchart illustrating the key steps in the projection detection process, highlighting the distinctions between real and simulation data.
7
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
achieved through flexible architecture, allows users to tailor it to their 𝑀𝑆𝐸 = (𝑝𝑖,𝑥 − 𝑝̂𝑖,𝑥 )2 + (𝑝𝑖,𝑦 − 𝑝̂𝑖,𝑦 )2 , (11)
specific resource and time requirements. All architecture alternatives
𝑀𝐻𝑂𝐸 = (𝑝𝑖,𝑥 − 𝑝̂𝑖,𝑥 ) + (𝑝𝑖,𝑦 − 𝑝̂𝑖,𝑦 ) ,
4 4
(12)
use ReLu activations after each fully connected layer, except for the
output layers. ReLu activation function effectively replaces negative val- where 𝑝𝑖 and 𝑝̂𝑖 are predicted and target projection locations for the 𝑖𝑡ℎ
ues with zero while leaving positive values unchanged (Agarap, 2018). overhead object sample.
By setting negative inputs to zero, ReLU introduces non-linearity into Incorporating the fourth power within the MHOE loss function
the network. This non-linear activation enables the network to learn serves to amplify the influence of outliers or extreme errors, thereby as-
complex non-linear relationships and represent more intricate decision signing them greater significance during the optimization process. This
boundaries. One key advantage of ReLU is its simplicity and computa- can be particularly useful in scenarios where reducing high errors is of
tional efficiency compared to other common activation functions like utmost importance, such as safety-critical applications or systems where
sigmoid or tanh. Additionally, ReLU helps address the vanishing gradi- false alerts must be minimized. Taking the fourth power instead of three
or five offers a significant advantage by ensuring that the resulting loss
ent problem, which can hinder the training of deep neural networks.
values are always non-negative. This characteristic of non-negativity is
Since ReLU only saturates at positive values and does not saturate for
highly desirable as it provides a clear and unambiguous indication of
negative inputs, it mitigates the gradient vanishing issue and facilitates
the error’s magnitude and direction.
more effective gradient flow during backpropagation. A detailed and
comparative performance review of the network alternatives is pro- 5. Datasets
vided in Sec. 6.
To validate the need for a solution with networks capable of model- The proposed solution is intended to work on various datasets, in-
ing non-linearities, we also test and report the projection detection per- cluding OverheadSimIntenseye (OSI) and CraneIntenseye (CRI), as
formances of the linear perceptron in Sec. 6.3.5. Additionally, we test well as any other datasets that offer the required inputs and target
OverProjNet by adding batch normalization layers before each ReLu ac- information. The OverheadSimIntenseye dataset is generated using the
tivation and inserting dropout layers just after the last ReLu activations simulation tool that we design, as explained in detail in Sec. 3. In con-
to examine the models’ response to additional regularization operations. trast, CraneIntenseye is composed of images captured from real cameras
in different facilities and viewpoints. Both datasets comprise positional
The detection performances of OverProjNet with these modifications are
and visual data that indicate the pixel location of overhead objects as
reported in Sec. 6.3.4. To train OverProjNet, two loss functions are used
inputs and the center projection point of the overhead object as targets.
alternatively: mean squared error (MSE) loss (Eq. (11)) and mean high-
While overhead objects are represented by bounding boxes, the center
order error (MHOE) loss (Eq. (12)), where the latter uses the 4𝑡ℎ power
projection points of the overhead object are denoted by points. The data
of the error. In Sec. 6.3.3, we analyze the effect of these loss functions
formats provided are as follows:
on the solution’s performance. While MSE loss is a common choice for
regression problems, MHOE loss is also tested in this work to evaluate • Inputs: Pixel coordinates of bounding boxes (covering the objects
its ability to minimize the maximum observation error. This is impor- in the tightest way):
tant as high errors in some observations may lead to false alerts in a – Top-left of the bounding box in the x-axis
warning system designed with OverProjNet. Limiting the maximum er- – Top-left of the bounding box in the y-axis
rors can enhance the system’s robustness; hence we include MHOE loss – Bottom-right of the bounding box in the x-axis
in the experiments. – Bottom-right of the bounding box in the y-axis
8
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 1
Parameters used to generate sets of OverheadSimIntenseye. HFOV: Horizontal field of view (FOV) angle of the camera (in degrees), CRAD: Radial distortion parameters
of the camera (unitless), CDIM: Pixel dimensions of the camera (in pixels), CPOS: 3D position of the camera w.r.t. the reference point in the world coordinate system
(in meters), CROT: Rotation angles of the camera (in degrees), ODIM: 3D dimensions of the overhead object (in meters), OPOS_X: 3D position range of the object
w.r.t. the reference point in the x-axis of the world coordinate system (in meters), OPOS_Y: 3D position range of the object w.r.t. the reference point in the y-axis of
the world coordinate system (in meters), OPOS_Z: 3D position range of the object w.r.t. the reference point in the z-axis of the world coordinate system (in meters),
OROT: Rotation range of the object about Y-axis (in degrees), OSIG: Sigma value used to deviate the edges of the overhead object bounding box (unitless), PSIG:
Sigma value used to deviate the projection point of the overhead object (unitless), NOBS: The number of observations produced (unitless).
Parameters/Sets Set01 Set02 Set03 Set04 Set05
HFOV 60.0 120.0 60.0 60.0 60.0
CRAD (𝑘1 ∕𝑘2 ) -0.05/0.0 -0.05/0.01 -0.05/0.0 -0.05/0.0 -0.05/0.0
CDIM (width/height) 1280/720 1920/1080 1280/720 1280/720 1280/720
CPOS (𝐶𝑋 ∕𝐶𝑌 ∕𝐶𝑍 ) -5.0/-8.0/0.0 -5.0/-8.0/0.0 10.0/-12.0/0.0 -5.0/-8.0/0.0 -5.0/-8.0/0.0
CROT (𝑅𝑜𝑙𝑙∕𝑃 𝑖𝑡𝑐ℎ∕𝑌 𝑎𝑤) 0.0/-10.0/-15.0 0.0/-10.0/-15.0 5.0/-15.0/20.0 0.0/-10.0/-15.0 0.0/-10.0/-15.0
ODIM (𝑤∕ℎ∕𝑙) 2.0/2.5/1.0 2.0/2.5/1.0 1.0/2.5/5.0 2.0/2.5/1.0 2.0/2.5/1.0
OPOS_X (min/max) -40.0/40.0 -40.0/40.0 -40.0/40.0 -80.0/80.0 -40.0/40.0
OPOS_Y (min/max) -10.0/-1.25 -10.0/-1.25 -10.0/-1.25 -20.0/-1.25 -10.0/-1.25
OPOS_Z (min/max) 20.0/50.0 20.0/50.0 20.0/50.0 40.0/70.0 20.0/50.0
OROT (min/max) -10.0/10.0 -10.0/10.0 -10.0/10.0 -30.0/30.0 -10.0/10.0
OSIG 1.0 1.0 1.0 1.0 2.0
PSIG 1.0 1.0 1.0 1.0 2.0
NOBS (train/val./test) 298/746/746 267/669/669 299/747/747 294/741/741 294/740/741
The utilization of simulation datasets in computer vision research are used during the collection of the CraneIntenseye dataset. Unlike the
has become a well-established technique across various fields, including OverheadSimIntenseye dataset, manual annotation is required to label
object detection (Mittal et al., 2022, Boikov et al., 2021) and classifica- the overhead objects and projection points on the images collected for
tion (Wong et al., 2019). By leveraging our simulation tool presented in CraneIntenseye. The same annotation format used for OverheadSimIntens-
Sec. 3.2, we enabled the generation of realistic scenarios for overhead eye is also adopted for CraneIntenseye. Since most of the parameters used
object projection. to generate the sets of OverheadSimIntenseye are unknown for CraneIn-
The OverheadSimIntenseye comprises five distinct sets, each encom- tenseye, only the available ones are presented in Table 2 for CraneIntens-
passing different camera placements, camera rotations, and variable- eye.
sized objects, along with other camera and lens parameters. The over-
head objects are positioned randomly in 3D space, and data is collected 5.3. Data preprocessing
at these random locations for each dataset. During collection, the size
of the overhead objects, the 3D camera location, rotational position,
Coordinate transformation. To ensure consistency in the data pre-
and internal camera and lens properties are fixed for each set. How-
processing for both OverheadSimIntenseye and CraneIntenseye, the same
ever, random orientations around the Y-axis are applied to the overhead
set of preprocessing procedures are applied to both datasets. Initially,
objects to simulate rotational variation in each individual observation
the bounding box coordinates, which are in the format of (top-left
position. The fixed and variable parameters and their allowed ranges
x-axis, top-left y-axis, bottom-right x-axis, bottom-right y-axis), are con-
are presented in Table 1 to generate each set of OverheadSimIntenseye.
verted to the format of (bottom-center x-axis, bottom-center y-axis,
Simulation images are also provided for visual demonstrations and in-
width, height). Since the overhead objects are captured with cameras
vestigations, along with positional data.
that have low roll angles, the bottom-center points of the bounding
Bounding box boundaries and projection points are manipulated by
boxes are selected as the input, along with the width and height of the
adding random deviations within predefined limits, as mentioned in
bounding boxes, as they are considered to be most directly related to
Sec. 3.2. To analyze the effect of these deviations, we generate sets
the projection point. The impact of this decision on performance is dis-
with and without applied deviations to the edges of the bounding boxes
cussed in Sec. 6.3.2. The projection point of the overhead objects is
and the projection points of the objects. The impact of deviations on
represented by the relative displacement of the projection point with
performance is tested, and detailed results are presented in Sec. 6.3.1.
respect to the bottom-center point of the bounding box.
Normalization. Subsequently, all input and target pixel information
5.2. CraneIntenseye
is normalized based on the camera pixel dimensions to fit the data into
the [0, 1] range. Finally, the normalized input and target information is
The CraneIntenseye dataset consists of two sets, where the inputs are
utilized for training and testing operations in OverProjNet.
obtained from actual cameras on facilities. A laser pointer is mounted
at the center bottom of the cranes to indicate the projection points in
the images and assist with annotation operations. The images are cap- 6. Experiments & results
tured from fixed-position cameras while the cranes are in operation. The
cameras used for data collection have different lens properties, view- In this section, we evaluate the performance of the proposed solution
points, and rotational angles, and different overhead objects (cranes) on different datasets and present the results and findings.
9
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 3
Optimal hyperparameters for architecture alternatives of OverProjNet and search ranges. TBDS: Time-based decay scheduler, RPS: Reduce plateau scheduler, OCS:
One cycle scheduler.
Architectures of OverProjNet Search Ranges
XS S M L XL
Batch Size 16 64 8 8 8 8, 16, 32, 64, 128
Initial Learning Rate 5.81e-3 1.18e-2 1.00e-3 5.10e-3 3.86e-4 [4.54e-5, 1.73e-2]
Parameters Learning Rate Scheduler OCS OCS OCS OCS OCS TBDS, RPS, OCS
Epsilon of AdamW 1.43e-10 1.35e-08 1.30e-10 1.18e-09 7.14e-09 [1.83e-12, 6.91e-8]
Weight Decay of AdamW 1.52e-05 2.07e-05 3.26e-07 3.76e-05 2.66e-05 [3.06e-7, 4.54e-5]
6.1. Experimental details of learning rate schedulers: time-based decay (TBDS), reduce plateau
(RPS), and one cycle (OCS) (Smith & Topin, 2019). For each of the alter-
First, we describe the setup used in the experiments, the perfor- native architectures of OverProjNet, we conduct independent sweeping
mance criteria, and the details of the training and hyperparameter operations to determine the optimal hyperparameters. We use the first
search operations. set of OverheadSimIntenseye for all sweeping operations and compare
the validation set performances to find the best hyperparameter val-
6.1.1. Setup ues. The optimized parameters are then set for all training and testing
All model developments are made using PyTorch (Paszke et al., operations, including those conducted using CraneIntenseye. The maxi-
2019), and all hyperparameter tuning operations have been conducted mum epoch count is set to 5000 for all training experiments. The search
using the sweeping ability of W&B (Biewald, 2020). The training and ranges and optimal hyperparameter values are presented in Table 3. We
testing operations are performed on a computer with an RTX 3080 Lap- also swept the hyperparameters one more time for the ablation study,
top GPU, Intel 12-core CPU, and 32 GB of RAM on Ubuntu 20.04 OS. where we used the mean high-order error (MHOE) loss instead of MSE
loss to ensure fair comparisons.
6.1.2. Performance evaluation
The performance of the model is evaluated by measuring the dis- 6.2. Experiment 1: projection detection performance
tance between the estimated projection point and the actual projection
point. To classify the predictions as true or false positives, a threshold This section presents a comparison of five different architectures of
value proportional to the image dimensions is set. Predictions where the OverProjNet on the OverheadSimIntenseye and CraneIntenseye datasets.
distance between the predicted and the actual projection pixel location The training and validation subsets are used for training, while the
is less than or equal to the specified threshold value are considered true test subset is utilized to evaluate the projection detection performance
positives, while the remaining predictions are evaluated as false posi- of the trained model. The detection accuracies obtained are reported
tives. The threshold value used to determine the state of the detections in Table 4, along with the mean pixel error and maximum pixel er-
is expressed in Eq. (13). ror measures. Pixel errors represent the Euclidean distance between the
√ estimated and target locations of the projection points in the pixel do-
𝑡ℎ𝑟 = c ∗ (2 ∗ 𝑝𝑐,𝑥 )2 + (2 ∗ 𝑝𝑐,𝑦 )2 , (13) main.
The accuracy values in Table 4 indicate that the detection perfor-
where 𝑐 is a constant value and is selected as 0.005 in this study. In
mances of the different architectures can vary depending on the sets
other words, with this selected 𝑐 value, the threshold gets the value of
of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets. The
approximately 7,3 and 11,0 pixels for images with dimensions of 1280 ×
highest detection accuracies are achieved for the second set (Set02)
720 and 1920 × 1080 pixels, respectively. Following the determination of
of OverheadSimIntenseye with various architecture options of OverPro-
the threshold value, the detection accuracy, which represents the ratio
jNet, whereas the lowest accuracy values are obtained for the fifth set
of true positive counts to all detections, is reported and computed using
(Set05) of OverheadSimIntenseye. The best performance values obtained
Eq. (14).
for the sets of CraneIntenseye (below the dashed line in Table 4) are rela-
tively close compared to those achieved for OverheadSimIntenseye. When
𝑎𝑐𝑐 = |𝑇 𝑃 |∕(|𝑇 𝑃 | + |𝐹 𝑃 |), (14)
comparing the architectures of OverProjNet, it is observed that differ-
where || denotes the cardinality of set ; 𝑇 𝑃 denoted the set of true ent architectures achieve the best and second-best detection values for
positives, and 𝐹 𝑃 is used to abbreviate the set of false positives. In different sets. In general, the shallow architecture options (XS and S ver-
addition to the detection accuracy, mean and maximum values for the sions of OverProjNet) achieve better or comparable accuracy scores on
detection errors are also reported for the detailed investigation. the sets of OverheadSimIntenseye than other architectures, whereas the
Since we know all the required information, the error distances be- deeper architectures (M, XL versions of OverProjNet) are more accurate
tween the actual and predicted projection points are also measured in in detecting projection points on the sets of CraneIntenseye.
the world coordinate system using flat surface assumption and back- Upon comparing the accuracy scores of Set01 and Set02 of Over-
projection operation as expressed in Eq. (6) just for OverheadSimIntens- headSimIntenseye, we observe that higher accuracy values are achieved
eye dataset. However, for the CraneIntenseye dataset, which lacks geo- for Set02. This implies that projection points are detected more accu-
graphic information about its scenes, we cannot perform distance-based rately when higher resolution and wide-angle cameras are used, despite
calculations. the challenging distortion behavior due to the non-zero 𝑘2 value. In
fact, the accuracies on Set02 are significantly high, where projection
6.1.3. Hyperparameter and training details points are detected for nearly all samples in the test split of Set02 using
In order to achieve the best performance in projection detection, we all architectures of OverProjNet. For a detailed description of the sets
employ hyperparameter tuning techniques using the Bayesian Search- generated for OverheadSimIntenseye, please refer to Sec. 5.1.
ing (Dewancker et al., 2016) capabilities of Weight & Biases (Biewald, We analyze the effects of camera rotation, camera placement, and
2020). Our aim is to obtain optimal validation performance by sweeping overhead object dimensions on the model performance by comparing
the search parameters within defined ranges and distributions. Specifi- the scores achieved for Set01 and Set03. It is observed that the detec-
cally, we tune the batch size, initial learning rate, learning rate sched- tion performance can significantly decrease when attempting to detect
uler type, epsilon value, and weight decay parameters of the AdamW projection points of longer and thinner overhead objects from a camera
(Loshchilov & Hutter, 2018) optimizer. To this end, we use three types mounted at higher points with steeper angles in each rotational axis.
10
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 4
Performance results of the proposed architectures on the sets of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets. The first columns of each architecture
are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE), respectively. The dashed line is used to split
the performance measures of OverheadSimIntenseye and CraneIntenseye datasets. Bold and underlined numbers are used to show the best and second-best accuracy
values in each set, respectively.
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .891 4.13 16.00 .870 4.36 28.05 .894 4.15 17.32 .878 4.21 18.23 .845 4.38 20.31
Set02 .991 3.99 15.21 .991 3.81 41.27 .993 4.24 15.44 .981 4.17 16.74 .993 4.04 14.10
OSI Set03 .681 6.39 39.23 .665 6.45 29.51 .641 6.93 36.43 .621 7.23 45.01 .645 7.09 45.26
Set04 .900 3.89 12.63 .875 4.04 31.24 .896 3.88 14.29 .883 4.12 14.37 .855 4.19 20.48
Set05 .577 7.56 26.40 .587 7.56 28.30 .591 7.56 24.54 .575 7.73 33.00 .604 7.54 29.63
Set01 .667 9.98 39.21 .803 8.52 47.97 .742 8.22 45.39 .727 9.13 40.35 .864 6.44 38.85
CRI
Set02 .920 6.31 17.23 .880 5.89 17.66 .960 5.14 15.49 .840 6.37 21.85 .947 4.95 29.72
Table 5 MHOE loss into the solution to evaluate its ability to minimize maximal
The best-performed architectures (BPA) and corresponding mean pixel errors errors, as discussed in Sec. 6.3.3.
in pixels (ME) and mean metric distance errors in meters (MME) on the sets of In conclusion, since we report high-performance scores for both sets
OverheadSimIntenseye (OSI) dataset. of CraneIntenseye, we can confidently state that the parameters tuned on
Sets BPA ME MME Set01 of OverheadSimIntenseye work well on CraneIntenseye’s sets too.
Set01 M 4.15 0.17
Set02 M, XL 4.24, 4.04 0.07, 0.07
OSI Set03 XS 6.39 0.29 6.3. Ablation study
Set04 XS 3.89 0.50
Set05 XL 7.54 0.57
In this study, we conduct a comparative analysis to investigate the
impact of several factors on the accuracy of object projection detec-
However, when monitoring the overhead object from a further point tion. Specifically, we examine the effects of deviations in the position
and letting the object visit higher altitudes and wider ranges, the detec- of bounding boxes and projection points of objects, coordinate transfor-
tion performance of the projection points does not change much (Set01 mation, loss functions, and architectural regularization modifications.
vs. Set04 of OverheadSimIntenseye). The last set of OverheadSimIntens- In addition, we evaluate the importance of designing a solution with
eye (Set05) demonstrates a decrease in detection performance as the networks that can model non-linearities by testing the performance of
dislocation amount in the bounding box and projection point positions a linear perceptron, which is also reported in the scope of our ablation
increases. study.
As discussed in Sec. 6.1.2, we measure the Euclidean distance er-
rors between the predicted and actual projection points in the world 6.3.1. Experiment 2: the effect of deviations
coordinate system for OverheadSimIntenseye. We assume a flat surface In Sec. 3.2, we have described the manipulation of bounding box
and use the back-projection operation to calculate these errors. Ta- boundaries in OverheadSimIntenseye by introducing random deviations
ble 5 presents the mean metric errors for the best-performing models to simulate annotation and amodal object/segmentation defects. Addi-
of each set in OverheadSimIntenseye. Upon inspection of the last col- tionally, we have introduced annotation errors in the projection points
umn of the table, we note that the mean metric distance error ranges of the simulated object’s center by randomly dislocating these points.
from a few centimeters to around half a meter on average, which is In this section, we investigate the effects of these deviations on the per-
an acceptable range for designing a protection system with a warn- formance of OverProjNet. We ask two questions: “What if we do not
ing mechanism using OverProjNet. Examining the mean pixel errors and apply these deviations?” and “How do these deviations affect the per-
mean metric distance errors together, we observe correlations between formance of OverProjNet?” To answer these questions, we cancel out
the two measurements, with some exceptions. For instance, Set04 has the deviations in OverheadSimIntenseye and compare the projection de-
the lowest pixel error but relatively higher metric distance errors. This tection results with the original dataset.
inconsistency can be explained by the fact that the object travels in ar- We find that when we eliminate the deviation operation on the edges
eas relatively far from the camera, where the ground sampling distances of the bounding boxes covering the overhead objects, the accuracy of
are higher than those closer to the camera. As a result, the mean metric all sets of OverheadSimIntenseye increases dramatically for all architec-
distance error naturally becomes larger. ture alternatives of OverProjNet (see the first columns [acc] for each
The accuracy levels achieved in CraneIntenseye are remarkably high, architecture alternative in Table 4 and Table 6 (a)). In parallel, the
with scores of .964 for Set01 and .960 for Set02, exceeding the best mean pixel errors are also decreased greatly (see the second columns
values achieved in certain sets of OverheadSimIntenseye (.681 for Set03 [ME] for each architecture alternative in Table 4 and Table 6 (a)). How-
and .604 for Set05) by a significant margin. This is because the sets ever, when we eliminate the deviations in the projection points but keep
in OverheadSimIntenseye include challenging scenarios that cannot be the deviation for the edge locations of the overhead objects’ bounding
easily covered in real-world data, such as different camera placements, boxes at the same level, we do not achieve the same amount of perfor-
rotations, variable object sizes, and various camera and lens parameters. mance increase as observed for the case of no deviations in the edges
The mean pixel error measures (ME columns in Table 4) also reflect of the bounding boxes (see the first columns [acc] for each architecture
similar findings to the accuracy scores. Generally, the sets with higher alternative in Table 4, Table 6 (a) and Table 6 (b)). Therefore, we con-
accuracy scores achieve lower mean pixel errors. It is worth noting that clude that the precision of the bounding box annotation is more critical
the maximum pixel error (third column [MXE] in Table 4) can reach compared to the projection annotations while training and producing
values of up to 50 pixels depending on the set and the architecture of OverProjNet models for better estimations. When we cancel out all devi-
OverProjNet, even when the mean pixel error values are less than 10 pix- ations, we observe significant increases in the accuracy values of Set05
els. This is expected as distance-based measurement techniques usually of OverheadSimIntenseye compared to the case with no deviation on the
produce right-skewed errors. However, minimizing the maximal error edges of the bounding boxes (see the last row of the first columns for
can help reduce potential false alarms and missed detections of a warn- each architecture alternative in Table 6 (a) and Table 6 (c)). This is due
ing system developed using OverProjNet. Therefore, we integrate the to the higher deviation values used for this set.
11
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 6
Performance results of proposed architectures without deviations on bounding box and/or projection points of OverheadSimIntenseye’s (OSI) samples. The first
columns of each architecture are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE), respectively.
Bold and underlined numbers are used to show the best and second-best accuracy values.
(a) No deviations on the edges of the bounding box (i.e., OSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .993 2.26 9.69 .983 2.37 30.20 .992 2.62 12.78 .988 2.32 12.80 .991 2.22 11.16
Set02 1.00 1.97 10.39 .999 3.46 16.76 1.00 2.63 8.69 1.00 2.24 10.51 .999 3.40 11.13
OSI Set03 .893 4.11 20.45 .870 4.29 17.57 .693 6.10 48.20 .656 6.57 38.25 .683 6.49 44.74
Set04 .997 2.12 7.47 .992 3.76 24.96 1.00 2.47 6.30 .997 2.40 7.68 .997 1.72 11.59
Set05 .972 3.21 9.89 .975 3.31 18.05 .992 2.94 8.73 .964 3.33 16.29 .966 3.20 14.38
(b) No deviations on the projection point of the overhead object (i.e., PSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .906 3.74 16.04 .887 4.03 30.76 .900 3.64 15.95 .900 3.67 15.71 .879 3.88 18.41
Set02 .988 3.80 13.49 .985 3.75 30.36 .990 3.45 19.03 .981 3.94 23.38 .984 3.68 14.98
OSI Set03 .696 6.54 40.62 .703 5.92 27.92 .625 7.04 38.86 .629 7.52 42.66 .657 6.69 40.71
Set04 .888 3.89 15.04 .882 4.17 28.62 .890 3.94 14.69 .856 4.21 23.83 .871 4.04 20.82
Set05 .622 7.04 24.23 .608 6.95 26.02 .608 7.07 27.45 .608 7.08 31.49 .593 7.06 26.33
(c) No deviations on any of the edges of the bounding box and the projection point of the overhead object (i.e., OSIG = 0.0 and PSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .999 2.26 13.59 .984 2.50 30.94 .997 1.42 11.73 .995 2.83 12.39 .989 2.05 18.52
Set02 .999 3.14 14.38 .999 3.32 13.46 .999 2.42 11.12 1.00 3.07 7.86 1.00 3.23 8.77
OSI Set03 .827 4.75 20.04 .869 4.23 20.30 .742 5.49 40.50 .687 6.43 38.23 .680 6.65 47.71
Set04 .999 3.78 7.48 .995 3.75 25.97 1.00 2.68 6.27 .993 2.70 8.32 .999 1.53 10.66
Set05 .988 1.81 9.81 .997 1.67 8.82 .997 1.89 8.54 .999 1.33 9.91 .999 1.27 7.61
Fig. 12. Overall effect of coordinate transformation on accuracy values for Over- Fig. 13. Effect of coordinate transformation on accuracy values for each alter-
headSimIntenseye dataset. native of architectures of OverProjNet for OverheadSimIntenseye dataset.
12
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 7
Performance results of proposed architectures on the sets of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets when coordinate transformation operation
is skipped. The first columns of each architecture are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors
(MXE), respectively. The dashed line is used to split the performance measures of OverheadSimIntenseye and CraneIntenseye datasets. Bold and underlined numbers
are used to show the best and second-best accuracy values.
Deviations are enabled for both the edges of the bounding boxes and projection points
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .873 4.39 15.98 .854 4.35 22.37 .885 4.07 14.20 .814 5.15 39.18 .853 4.63 24.17
Set02 .979 4.28 24.84 .991 3.97 31.26 .988 3.79 33.76 .975 4.45 25.05 .987 4.37 33.24
OSI Set03 .639 6.87 29.93 .633 7.07 32.78 .635 6.99 29.14 .613 7.51 48.86 .615 7.90 54.44
Set04 .890 3.99 14.19 .882 3.95 13.31 .872 3.95 16.05 .867 4.16 18.13 .866 4.25 17.15
Set05 .557 7.63 25.03 .553 7.89 50.16 .540 8.45 77.32 .549 8.80 71.53 .520 8.83 59.21
Set01 .561 12.44 49.97 .652 10.12 51.44 .788 8.41 52.77 .758 8.71 45.53 .758 7.97 42.75
CRI
Set02 .893 6.58 16.47 .880 5.95 17.99 .907 5.96 14.62 .813 7.51 24.54 .867 6.28 19.14
No deviations on the edges of the bounding boxes (i.e., OSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .988 2.79 16.71 .993 2.20 13.91 .996 2.42 11.23 .969 2.72 40.62 .985 2.82 20.82
Set02 .994 3.82 27.17 .990 3.49 26.36 .996 3.23 17.98 .997 3.31 28.55 .997 3.89 23.75
OSI Set03 .700 6.27 28.29 .679 6.45 28.92 .687 6.40 25.74 .681 6.87 37.42 .673 6.87 59.70
Set04 .997 2.19 8.92 .993 2.24 14.30 .999 2.59 7.48 .989 2.01 10.64 .992 2.46 13.91
Set05 .968 3.37 24.12 .952 3.68 43.44 .940 3.85 66.10 .911 4.72 54.65 .930 4.00 42.80
No deviations on the projection points of the overhead object (i.e., PSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .873 4.11 14.91 .885 4.04 16.08 .910 3.68 12.63 .863 4.12 23.87 .866 4.17 22.07
Set02 .979 4.34 24.88 .981 4.01 31.17 .984 4.40 31.78 .973 4.48 16.81 .984 3.75 35.63
OSI Set03 .639 6.77 31.71 .647 6.80 32.10 .633 6.74 26.32 .605 7.45 36.38 .655 7.08 31.90
Set04 .884 3.97 17.14 .887 3.89 15.87 .886 3.86 14.54 .880 4.25 17.58 .872 4.14 15.57
Set05 .605 7.09 24.99 .608 7.32 44.48 .560 8.34 81.16 .557 8.41 74.63 .565 7.93 48.45
No deviations on any of the edges of the bounding boxes and the projection points of the overhead object (i.e., OSIG = 0.0 and PSIG = 0.0)
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .991 2.46 12.79 .979 2.82 21.98 .996 1.96 15.49 .973 2.98 16.80 .984 2.51 17.56
Set02 .999 4.26 17.38 .990 4.05 28.62 .994 3.36 24.33 .999 4.04 12.51 .994 3.79 13.25
OSI Set03 .708 6.16 30.55 .691 6.22 29.56 .707 6.03 39.45 .675 6.55 39.25 .665 6.85 51.61
Set04 1.00 2.83 6.47 .997 3.35 9.22 1.00 1.90 5.68 .999 2.55 7.62 .995 2.35 15.57
Set05 .973 2.54 42.69 .961 2.48 44.17 .964 2.99 59.82 .948 3.41 69.35 .960 2.20 37.06
Table 8
Performance results of OverProjNet (M) with (w/CT) and without (w/o CT) coordinate transformation using MSE and (MHOE) loss function on CraneIntenseye (CRI)
dataset. The first columns for each loss mode are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE),
respectively. Bold numbers are used to show the best accuracies.
Arch.∖Sets MSE (w/CT) MSE (w/o CT) MHOE (w/CT) MHOE (w/o CT)
acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .742 8.22 45.39 .788 8.41 52.77 .576 11.57 47.40 .636 10.16 50.29
CRI
Set02 .960 5.14 15.49 .907 5.96 14.62 .907 6.31 20.15 .747 7.61 20.09
Fig. 15. Effect of coordinate transformation on accuracy values for each devia-
Table 8 indicates that the performance of estimating projection
tion mode (deviations are enabled for the Default mode).
points is better with MSE loss compared to the MHOE loss on the
CraneIntenseye dataset. Furthermore, no significant reduction is ob-
6.3.3. Experiment 4: the effect of the loss function served in the maximum pixel error when we use the MHOE loss instead
In this experiment, we conduct a parameter tuning process for all of the MSE loss.
hyperparameters using the mean high-order error (MHOE) loss func-
tion on the Set01 of OverheadSimIntenseye. Subsequently, we train and 6.3.4. Experiment 5: the effect of regularization modifications
evaluate OverProjNet (M) on the CraneIntenseye dataset using the op- In this experiment, we examine the impact of adding batch normal-
timized parameters. The obtained testing results from these trials are ization or dropout layers to OverProjNet (M), as outlined in Sec. 6.3.4.
reported in Table 8. After applying these modifications and training the model, we evalu-
13
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Table 11
Performance results of perceptron with (w/CT) and without (w/o CT) coor-
dinate transformation on CraneIntenseye dataset (CRI). The first columns for
each transformation mode are for accuracy values (acc). The second and third
columns are for the mean (ME) and maximum pixel errors (MXE), respectively.
Bold numbers are used to show the best accuracies.
Arch.∖Sets Perceptron (w/CT) Perceptron (w/o CT)
acc ME MXE acc ME MXE
Set01 .333 19.04 74.02 .333 19.36 67.15
CRI
Set02 .213 22.18 47.04 .213 38.55 108.78
ate the modified versions of OverProjNet on the CraneIntenseye. We also 6.4. Experiment 7: throughput and elapsed time analysis
vary the drop probability levels (𝑝𝑟 = 0.10, 0.25, 0.50) in the dropout
layers to investigate the effect of the drop probability ratio on detection As the architectural complexity levels of the alternative architec-
performance. The results of these experiments are presented in Tables 9 tures of OverProjNet differ, their computational performances also vary.
and 10. To analyze this, we measure the average processing time during the
Comparing the detection rates presented in Table 10, it can be inference operation and the throughput performances over 100 itera-
observed that the best detection performance is obtained when the tions and five different batch size levels. These results are presented in
dropout layers are not used (i.e., drop ratio 𝑝𝑟 = 0.0) for both sets of Table 12.
CraneIntenseye. Similarly, the detection performance decreases signif- Table 12 presents the computational cost analysis of OverProjNet for
icantly when batch normalization layers are added to modify Over- different model sizes. Notably, deeper models incur higher computa-
ProjNet. Thus, it can be concluded that the regularization modifica- tional costs and lower throughput counts compared to shallower ones,
14
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 18. Qualitative results of different architectures of OverProjNet on samples from OverheadSimIntenseye and CraneIntenseye datasets.
which is expected due to their higher complexity. However, even the XL 6.5. Experiment 8: qualitative analysis
model has a very low processing time, as it can process 5721 samples in
1 millisecond on average when the batch size is set to 65,536, while the We present visual results of OverProjNet on samples from Overhead-
XS model can process up to 6,941,134 samples in the same time frame. SimIntenseye and CraneIntenseye datasets in Fig. 18 to illustrate the
We also observe that the batch processing time remains constant for all performance of the proposed method. The actual projection points are
versions of OverProjNet except for the XL model, which experiences an indicated with red stars, while cyan circles represent the estimated pro-
increase when the batch size exceeds 256. jection points in all visual outputs. Additionally, we use green bounding
15
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
boxes to enclose the overhead objects in the images. More qualitative vi-
suals from the remaining sets of both datasets are provided in Appendix
Fig. 20. It is important to note that all printed samples displayed in
Appendix Fig. 20 are selected randomly without considering the detec-
tion performance. Additionally, Appendix Fig. 21 showcases the visual
outputs of frames with maximal error cases from CraneIntenseye. The
maximal error, the distance between the cyan circle and the red star,
for Set02 is relatively small, indicating that OverProjNet performs well
on this set. Although the maximal error observed for Set01 is relatively
higher than Set02, the prediction performance is considered sufficient
for constructing a warning system employing OverProjNet.
7. Conclusion
16
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 20. Qualitative results of different architectures of OverProjNet on various set samples.
17
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
18
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 21. Qualitative results of OverProjNet on samples with maximal errors from CraneIntenseye.
19
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
A.2. Matrix representations of rotations Liu, Y., Liang, D., Huang, Q., & Gao, W. (2006). Extracting 3d information from broadcast
soccer video. Image and Vision Computing, 24, 1146–1162.
Loshchilov, I., & Hutter, F. (2018). Fixing weight decay regularization in Adam. Retrieved
Rotation about x (pitch), y (yaw), and z (roll) axes can be repre-
from https://openreview.net/forum?id=rk6qdGgCZ.
sented as demonstrated in Eq. (15), Eq. (16), and Eq. (17) in the matrix Ma, L., Chen, Y., & Moore, K. L. (2003). A family of simplified geometric distortion models
form. for camera calibration. arXiv preprint. Retrieved from arXiv:cs/0308003.
Mittal, P., Sharma, A., & Singh, R. (2022). A simulated dataset in aerial images using
⎡1 0 0 ⎤ simulink for object detection and recognition. International Journal of Cognitive Com-
ℝ𝑥 (𝛼) = ⎢ 0 cos 𝛼 − sin 𝛼 ⎥ (15) puting in Engineering, 3, 144–151.
⎢ ⎥
⎣ 0 sin 𝛼 cos 𝛼 ⎦ National Safety Council (2020). Work injury costs. Retrieved from https://injuryfacts.nsc.
org/work/costs/work-injury-costs/.
⎡ cos 𝛽 0 − sin 𝛽 ⎤ Neuhausen, M., Teizer, J., & König, M. (2018). Construction worker detection and track-
ℝ𝑦 (𝛽) = ⎢ 0 1 0 ⎥ (16) ing in bird’s-eye view camera images. In ISARC, proceedings of the international sym-
⎢ ⎥
⎣ sin 𝛽 0 cos 𝛽 ⎦ posium on automation and robotics in construction. IAARC publications (pp. 1–8).
Park, J., Chen, J., & Cho, Y. K. (2017). Self-corrective knowledge-based hybrid track-
⎡ cos 𝛽 − sin 𝛽 0⎤ ing system using BIM and multimodal sensors. Advanced Engineering Informatics, 32,
ℝ𝑧 (𝛾) = ⎢ sin 𝛽 cos 𝛽 0⎥ (17) 126–138.
⎢ ⎥
⎣ 0 0 1⎦ Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,
Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-
performance deep learning library. Advances in Neural Information Processing Systems,
A.3. OverProjNet output visuals 32.
Pedrazzini, F. (2018). 3D position estimation using deep learning. Ph.D. thesis. KTH Royal
See Figs. 20 and 21. Institute of Technology.
Price, L. C., Chen, J., Park, J., & Cho, Y. K. (2021). Multisensor-driven real-time crane
monitoring system for blind lift operations: Lessons learned from a case study. Au-
References tomation in Construction, 124, Article 103552.
Reddy, N. D., Tamburo, R., & Narasimhan, S. G. (2022). Walt: Watch and learn 2d amodal
Agarap, A. F. (2018). Deep learning using rectified linear units (ReLU). arXiv preprint. representation from time-lapse imagery. In Proceedings of the IEEE/CVF conference on
Retrieved from arXiv:1803.08375. computer vision and pattern recognition (pp. 9356–9366).
Biewald, L. (2020). Experiment tracking with weights and biases. Retrieved from wandb. Slabaugh, G. G. (1999). Computing Euler angles from a rotation matrix. Retrieved from
com. http://eecs.qmul.ac.uk/~gslabaugh/publications/euler.pdf.
Boikov, A., Payor, V., Savelev, R., & Kolesnikov, A. (2021). Synthetic data generation for Smith, L. N., & Topin, N. (2019). Super-convergence: Very fast training of neural networks
steel defect detection and classification using deep learning. Symmetry, 13, 1176. using large learning rates. In Artificial intelligence and machine learning for multi-domain
Bureau of Labor Statistics (2008). Crane-related occupational fatalities. Retrieved from operations applications (pp. 369–386). SPIE.
https://www.bls.gov/iif/factsheets/archive/crane-related-occupational-fatalities- Su, W., Zhu, X., Tao, C., Lu, L., Li, B., Huang, G., Qiao, Y., Wang, X., Zhou, J., & Dai, J.
2006.pdf. (2022). Towards all-in-one pre-training via maximizing multi-modal mutual informa-
Bureau of Labor Statistics (2019). Fatal occupational injuries involving cranes. On- tion. arXiv preprint. Retrieved from arXiv:2211.09807.
line. Retrieved from https://www.bls.gov/iif/factsheets/fatal-occupational-injuries- Sue, M. K. (1981). Radio frequency interference at the geostationary orbit. Final report.
cranes-2011-17.htm. Pasadena: Jet Propulsion Lab., California Inst. of Tech.
Chen, H. T., Tien, M. C., Chen, Y. W., Tsai, W. J., & Lee, S. Y. (2009). Physics-based ball Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022). YOLOv7: Trainable bag-of-freebies
tracking and 3d trajectory reconstruction with applications to shooting location esti- sets new state-of-the-art for real-time object detectors. arXiv preprint. Retrieved from
mation in basketball video. Journal of Visual Communication and Image Representation, arXiv:2207.02696.
20, 204–216. Wang, J., Shi, F., Zhang, J., & Liu, Y. (2008). A new calibration model of camera lens
Cheng, T., & Teizer, J. (2014). Modeling tower crane operator visibility to minimize the distortion. Pattern Recognition, 41, 607–615.
risk of limited situational awareness. Journal of Computing in Civil Engineering, 28, Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K.,
Article 04014004. Singhal, S., Som, S., et al. (2022). Image as a foreign language: Beit pretraining for all
Dewancker, I., McCourt, M., & Clark, S. (2016). Bayesian optimization for machine learn- vision and vision-language tasks. arXiv preprint. Retrieved from arXiv:2208.10442.
ing: A practical guidebook. arXiv preprint. Retrieved from arXiv:1612.04858. Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.
Fang, Y., Cho, Y. K., & Chen, J. (2016). A framework for real-time pro-active safety assis- (2022). Internimage: Exploring large-scale vision foundation models with deformable
tance for mobile crane lifting operations. Automation in Construction, 72, 367–379. convolutions. arXiv preprint. Retrieved from arXiv:2211.05778.
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., & Cao, Y. Wei, Y., Hu, H., Xie, Z., Zhang, Z., Cao, Y., Bao, J., Chen, D., & Guo, B. (2022). Contrastive
(2022). EVA: Exploring the limits of masked visual representation learning at scale. learning rivals masked image modeling in fine-tuning via feature distillation. arXiv
arXiv preprint. Retrieved from arXiv:2211.07636. preprint. Retrieved from arXiv:2205.14141.
Gählert, N., Hanselmann, N., Franke, U., & Denzler, J. (2020). Visibility guided NMS: Wong, M. Z., Kunii, K., Baylis, M., Ong, W. H., Kroupa, P., & Koller, S. (2019). Synthetic
Efficient boosting of amodal object detection in crowded traffic scenes. arXiv preprint. dataset generation for object-to-model deep learning in industrial applications. PeerJ
Retrieved from, arXiv:2006.08547. Computer Science, 5, Article e222.
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge Wu, J., Ma, L., & Hu, X. (2016). Predicting world coordinates of pixels in rgb images using
University Press. convolutional neural network for camera relocalization. In 2016 seventh international
Hwang, S. (2012). Ultra-wide band technology experiments for real-time prevention of conference on intelligent control and information processing (pp. 161–166). IEEE.
tower crane collisions. Automation in Construction, 22, 545–553. Xu, S., Wang, X., Lv, W., Chang, Q., Cui, C., Deng, K., Wang, G., Dang, Q., Wei, S., Du,
Ke, L., Tai, Y. W., & Tang, C. K. (2021). Deep occlusion-aware instance segmentation with Y., et al. (2022). PP-YOLOE: An evolved version of YOLO. arXiv preprint. Retrieved
overlapping bilayers. In Proceedings of the IEEE/CVF conference on computer vision and from arXiv:2203.16250.
pattern recognition (pp. 4019–4028). Yang, J., Huang, M., Chien, W., & Tsai, M. (2015). Application of machine vision to
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et collision avoidance control of the overhead crane. In 2015 international conference on
al. (2022). YOLOv6: A single-stage object detection framework for industrial applica- electrical, automation and mechanical engineering (pp. 361–364). Atlantis Press.
tions. arXiv preprint. Retrieved from arXiv:2209.02976. Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping
Li, H., Chan, G., & Skitmore, M. (2013). Integrating real time positioning systems to extreme and center points. In Proceedings of the IEEE/CVF conference on computer vision
improve blind lifting and loading crane operations. Construction Management and Eco- and pattern recognition (pp. 850–859).
nomics, 31, 596–605.
20