1 s2.0 S2667305323000947 Main

Intelligent Systems with Applications 20 (2023) 200269
Contents lists available at ScienceDirect
Intelligent Systems with Applications

journal homepage: www.journals.elsevier.com/intelligent-systems-with-applications
Overhead object projector: OverProjNet ✩

Poyraz Umut Hatipoglu ∗ , Ali Ufuk Yaman, Okan Ulusoy
Research Department, Intenseye, 34340 Nispetiye Cd. No:24, Istanbul, Turkey
A R T I C L E I N F O A B S T R A C T
Keywords: Despite the availability of preventive and protective systems, accidents involving falling overhead objects,
Projection detection particularly load-bearing cranes, still occur and can lead to severe injuries or even fatalities. Therefore, it
Deep learning has become crucial to locate the projection of heavy overhead objects to alert those beneath and prevent
Deep regression networks
such incidents. However, developing a generalized projection detector capable of handling various overhead
Camera modeling
objects with different sizes and shapes is a significant challenge. To tackle this challenge, we propose a novel
Image formation
Overhead object approach called OverProjNet, which uses camera frames to visualize the overhead objects and the ground-level
surface for projection detection. OverProjNet is designed to work with various overhead objects and cameras
without any location or rotation constraints. To facilitate the design, development, and testing of OverProjNet,
we provide two datasets: CraneIntenseye and OverheadSimIntenseye. CraneIntenseye comprises actual facility
images, positional data of the overhead objects, and their corresponding predictions, while OverheadSimIntenseye
contains simulation data with similar content but generated using our simulation tool. Overall, OverProjNet
achieves high detection performance on both datasets. The proposed solution’s source code and our novel
simulation tool are available at https://github.com/intenseye/overhead_object_projector. For the dataset and
model zoo, please send an email to the authors requesting access at https://drive.google.com/drive/folders/1to-
5ND7xZaYojZs1aoahvu6BkLlYxRHP?usp=sharing.
1. Introduction cil (NSC) estimated the total cost of both fatal and nonfatal preventable
injuries as $163.9 billion in the United States in 2020. This cost in-
Overhead objects, such as cranes, represent crucial equipment used cludes $44.8 billion in wage and productivity losses, $34.9 billion in
extensively across various industrial sectors for executing vertical and medical expenses, and $61.0 billion in administrative expenses. NSC
horizontal lifting operations. Despite its widespread utilization, crane- also estimated the total number of days lost due to work-related in-
involved incidents resulting in serious injuries and fatalities have been juries as sixty-five million in 2020, excluding the days lost because of
documented due to the intricate nature of such lifting operations. The the injuries that happened in the previous years. The number rises to
Census of Fatal Occupational Injuries (CFOI) reported 83 crane-involved ninety-nine million if these days are also taken into account (National
fatalities on overage in the United States from 1997 to 2005. In 2006, Safety Council, 2020).
the number of incidents decreased to 72, with 42% of them resulting To enhance the occupational safety and well-being of workers, as
from being struck by falling objects, while 8% were caused by other well as prevent any accident, operators controlling overhead objects
types of object strikes (Bureau of Labor Statistics, 2008). From 2011 to such as cranes must exhibit a heightened awareness of the proximity of
2017, CFOI reported a total of 297 crane-involved deaths. Being struck the object to other objects and individuals in the workplace. However,
by a falling object or equipment caused 154 deaths, with 79 of them this presents a challenge as operators may occasionally carry out lifting
involving an object falling from or being put in motion by a crane (Bu- operations without having a full view of the situation. A signal person
reau of Labor Statistics, 2019). Furthermore, crane-related accidents, can give instructions to the operator via either hand signals or electronic
whether fatal or not, lead to substantial monetary losses and decreased communication methods, such as radio, to increase workplace safety,
productivity, similar to other types of accidents. National Safety Coun- but this operation is also prone to some failures because of its manual
✩
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. All research and data acquisition
activities are conducted using the sources of Intenseye Research Department.
* Corresponding author.
E-mail addresses: poyraz@intenseye.com (P.U. Hatipoglu), ufuk@intenseye.com (A.U. Yaman), okan@intenseye.com (O. Ulusoy).
https://doi.org/10.1016/j.iswa.2023.200269
Received 23 April 2023; Received in revised form 22 July 2023; Accepted 11 August 2023
Available online 18 August 2023
2667-3053/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
P.U. Hatipoglu, A.U. Yaman and O. Ulusoy Intelligent Systems with Applications 20 (2023) 200269
Fig. 1. Flows of processing steps to estimate the projection point of an overhead object (while the red and dashed line flow demonstrates the traditional computer
vision approaches, the blue line flow shows the proposed solution).
2. A projection relationship must be established to convert the rep-

resentation of the worker and overhead object from the 2D image
plane to the 3D world coordinate space.
Overcoming these challenges and automatically alerting the work-

ers in dangerous situations can proactively prevent struck-by-falling
accidents. These alerts provide valuable insights into where accidents
are most likely to occur, allowing for targeted preventive actions to be
taken. In turn, this can increase overall productivity, profitability, and
efficiency in workplaces.
The first challenge can be overcome using increasingly popular de-
tection techniques like object detection (Wang, Bochkovskiy, et al.,
2022, Xu et al., 2022, Li et al., 2022, Wang, Dai, et al., 2022, Su et al.,
2022, Fang et al., 2022), and instance segmentation (Fang et al., 2022,
Wei et al., 2022, Wang, Bao, et al., 2022). In particular, amodal object
detection (Gählert et al., 2020), and amodal instance segmentation (Ke
et al., 2021, Reddy et al., 2022) are more suitable to tackle this chal-
lenge since even occluded parts of the objects can be detected with these
approaches; which affects the latter processing steps like 3D mappings
Fig. 2. 3D illustration of ray points addressing corners of an arbitrary overhead dramatically.
object and center projection point of the object. The second part of this challenge, however, has not been an active
area of work for the computer vision and deep learning community.
nature. There is a need to automatically predict the projection location
To solve this problem, two main approaches have been proposed in the
of overhead objects, as illustrated in Fig. 2, from cameras or sensors and
literature so far. The first approach involves several steps, such as calcu-
warn the workers in case of unsafe situations.
lating the camera projection matrix with the classical computer vision
In a nutshell, the motivations drive the need for the development of
methods, estimating the depth of the intended objects, and calculating
a warning system that can automatically predict the projection location
the 3D object coordinates (Liu et al., 2006, Chen et al., 2009) as shown
of overhead objects can be listed as:
with the traditional computer vision flow in Fig. 1. However, conduct-
• Enhancing occupational safety and well-being of workers by pre- ing this approach is not always feasible since obtaining all the required
venting struck-by-falling incidents that can result in serious injuries information, such as camera rotation, distortion parameters, and the
and fatalities. size of the overhead objects, may not always be possible. The second
• Mitigating substantial monetary losses and decreased productivity approach is based on a neural network solution to predict the 3D world
associated with crane-related accidents. coordinates from 2D image coordinates (Wu et al., 2016, Pedrazzini,
• Increasing workplace efficiency and productivity by proactively 2018).
alerting workers to unsafe situations. A number of solutions have also been proposed to increase work-
• Addressing the limitations of manual communication methods, place safety using bird’s-eye view cameras (Yang et al., 2015, Neuhausen
such as hand signals or radio, which are prone to failures. et al., 2018, Price et al., 2021), 3D point clouds (Cheng & Teizer, 2014,
• Providing valuable insights into the most likely areas where acci- Fang et al., 2016), and radio-frequencies (RF) (Li et al., 2013, Hwang,
dents may occur, enabling targeted preventive actions. 2012, Park et al., 2017). These approaches aim to mitigate the second
challenge by leveraging either the camera placement or the nature of
However, it’s challenging to create such a warning system using the used sensors.
cameras due to two main reasons: In this paper, we propose a deep-learning model (OverProjNet) to
predict the projection point of an overhead object onto the ground to
1. The coordinates of the overhead object and the worker need to be overcome the second challenge, i.e., the need to establish a projection
predicted in the 2D image plane. relation from the 2D image plane to the 3D world. We show that our
2
model replaces several challenging processing steps needed to make fore, to detect all the overhead objects’ extent accurately, including
the same prediction using the traditional computer vision approaches, corners and edges, these methods can be used. Gählert et al. (2020)
as shown in Fig. 1. introduced a visibility-guided non-maximum suppression algorithm to
The major contributions of this paper are the following: improve the amodal object detection performance of highly occluded
objects. Ke et al. (2021) designed a two-stage instance segmentation
• We develop a deep learning model named OverProjNet that esti- network, separating the layers responsible for detecting the occluding
mates the overhead object projection in the 2D images without and occluded objects. Similarly, Reddy et al. (2022) detect the occlud-
requiring a camera projection matrix and depth estimation of the ing and occluded objects with different networks, but their work also
objects. Our model infers the latent relationships between the 2D focuses on generating a self-supervised amodal instance segmentation
image plane and the 3D scene space. dataset.
• We design a simulation tool that generates images of overhead Traditional computer vision methods. To determine the camera
objects in 2D pixel coordinates by configuring the camera, lens, projection matrix, Liu et al. (2006) and Chen et al. (2009) rely on
and object properties. To reproduce realistic object detection op- the correspondence between a soccer or basketball field and the im-
erations, we consider distortions and disturbances in the designed age. They then use this matrix to compute the 3D world coordinates
tool. of the objects. However, unlike our proposed approach, their method
• We produce and quantitatively validate the effectiveness of this requires an explicit mapping between the 2D image plane and the 3D
approach on two datasets: OverheadSimIntenseye and CraneIntens- world space. To construct this mapping, they use the standard dimen-
eye. The first dataset is collected from a simulation environment, sions of the fields and balls as prior knowledge, whereas our method
whereas the second is collected from actual facility cameras. does not depend on such information. Additionally, their work focuses
on spherical objects, while we focus on objects having different shapes
Outline of the paper. Section 2 provides a comprehensive review of like a rectangular prism. Unlike spherical objects, which have a uniform
the related work concerning amodal object detection, instance segmen- appearance irrespective of the viewing angle, prism-shaped objects can
tation methods, and projection detection solutions for overhead objects. have varied appearances based on the viewing angle. Thus establishing
In Section 3, the details of the simulation tool designed to produce simu- a direct correlation between the size of the detection bounding box and
lation datasets are presented. Section 4 offers the architectural specifics the object’s depth is not viable for prism-like objects, even though it is
of the proposed method, OverProjNet, along with the loss functions em- a valid technique for spherical objects.
ployed for its training. Moving to Section 5, we introduce the details Neural networks. Instead of relying on traditional computer vision
of both the actual and simulation datasets provided for addressing the techniques, Wu et al. (2016) utilizes a convolutional neural network
overhead object projection problem. In Section 6, we present the ex- to predict the world coordinates of pixels. Their model requires RGB-D
perimental procedures, setup, and performance results, supplemented images, whereas our model operates solely in the RGB domain. Another
by detailed discussions of the proposed solution. Lastly, the paper con- related approach is the work of Pedrazzini (2018), which uses several
cludes in Section 7, summarizing the key findings, insights derived from neural network models to predict the 3D world coordinates of a pixel
the study, and the directions of future works. from an RGB image. However, unlike these models, our approach does
not require establishing correspondence between 2D image coordinates
2. Related works and 3D world coordinates.
Bird’s eye view cameras. Several researchers, including Yang et al.
In this section, we explore projection detection techniques across dif- (2015), Neuhausen et al. (2018), Price et al. (2021), have focused on im-
ferent domains. However, before delving into these techniques, we first proving workplace safety using bird’s eye view cameras. For example,
examine the latest advancements in amodal object detection and seg- Yang et al. (2015) and Price et al. (2021) have developed systems that
mentation models. These models are capable of detecting the positions generate alerts when a camera mounted on an overhead crane detects a
of overhead objects and people in the 2D image plane. Given that the moving object, while Neuhausen et al. (2018) focuses on detecting and
projection detection methodology is a crucial part of the overall warn- tracking workers in bird’s-eye view images. However, these methods
ing mechanism, it is essential to first detect the location of the overhead require at least one camera per overhead crane, resulting in additional
object. While not the main focus of our study, amodal object detection installation and maintenance costs. Moreover, the images captured by
or instance segmentation models serve as processing blocks for detect- bird’s-eye view cameras mounted on cranes can be blurry due to vi-
ing the extent and position of overhead objects in the 2D image plane. bration, reducing object detection accuracy. In contrast, our approach
The field of computer vision has seen extensive research and propos- can utilize cameras positioned at any location and does not require spe-
als of amodal object detection models as well as instance segmentation cific camera setups or locations. Therefore, our method offers greater
models, which can accurately detect objects, including occluded parts. flexibility and applicability compared to bird’s-eye view camera-based
Therefore, our proposed approach can be complemented by incorporat- approaches.
ing these deep learning models for object detection and segmentation, Point cloud-based methods. Instead of using a 2D vision-based
providing the necessary information for the warning system. safety method, Cheng and Teizer (2014) employs three-dimensional
Amodal object detection and instance segmentation models. terrestrial laser scanners (TLS) to generate point cloud data of the con-
The fields of object detection (Wang, Bochkovskiy, et al., 2022, Xu et al., struction site. Additionally, an ultra-wideband (UWB) real-time location
2022, Li et al., 2022, Wang, Dai, et al., 2022, Su et al., 2022, Fang et al., tracking sensing (RTLS) system is utilized to predict workers’ positions
2022, Wei et al., 2022) and instance segmentation (Fang et al., 2022, and crane hooks. These systems increase crane operator awareness in
Wang, Bao, et al., 2022) have gained significant attention among re- blind spots. Similarly, Fang et al. (2016) uses TLS to generate point
searchers in recent years. Some researchers (Wang, Bochkovskiy, et al., cloud data of a lifting site, which is used to identify potential colli-
2022, Xu et al., 2022, Li et al., 2022) focus on developing real-time ob- sions between the crane parts and obstructions on the site. While 3D
ject detection models, while others (Wang, Dai, et al., 2022, Su et al., point cloud-based systems provide more accurate distance measure-
2022, Fang et al., 2022, Wei et al., 2022) aim to achieve better accuracy ments compared to 2D vision systems, their real-time processing ability
by fine-tuning a pre-trained neural network for either object detection is limited due to their higher computational resource requirements. In
or instance segmentation. contrast, 2D cameras are generally more affordable and accessible than
Amodal object detection (Gählert et al., 2020) and instance segmen- 3D sensors.
tation (Ke et al., 2021, Reddy et al., 2022) models are more effective at Radio-frequency (RF) methods. Instead of a vision-based method,
detecting any parts of the objects, even when partially occluded. There- Li et al. (2013) uses Radio Frequency Identification (RFID) and Global
3
Fig. 4. Pitch, Yaw, and Roll angles (arrows demonstrate positive directions).
3.1.1. Projection matrix and ray tracing

The process of mapping a 3D world point to a 2D pixel space can
be achieved through matrix multiplication with the camera projection
Fig. 3. Coordinate systems and projection from 3D to 2D. matrix. This matrix, denoted as ℙ, can be expressed as:
𝒙 = ℙ𝑿. (1)
Positioning System (GPS) to predict the worker and crane positions
in construction sites. An alert is raised when a worker is detected The projection matrix, which maps 3D world points to 2D pixel space,
inside a dangerous zone. Similarly, Hwang (2012) uses the UWB sys- can be achieved by multiplying the intrinsic and extrinsic matrices. The
tem to understand the collision risk between crane booms and prevent intrinsic matrix, also known as the camera calibration matrix (Hartley
equipment-to-equipment collisions, while Park et al. (2017) focuses on & Zisserman, 2003), is used to transform map points from the camera
developing a multi-modal tracking system composed of Bluetooth Low coordinate system to the pixel coordinate system. On the other hand,
Energy (BLE) sensors, Inertial Measurement Unit (IMU) sensors, and the extrinsic matrix is also a transformation matrix that maps points
Building Information Model (BIM) to track the worker positions. RF from the world coordinate system to the camera coordinate system. The
signals are generally affected by radio-frequency interference, which extrinsic matrix stores information about the camera’s orientation and
reduces the signal-to-noise ratio and can result in information loss or position, while the intrinsic matrix represents internal camera and lens
complete data loss in some extreme cases (Sue, 1981). Compared to properties such as focal length, principal point, and pixel pitch. The in-
cameras, these solutions have a higher installation and maintenance trinsic matrix, denoted by 𝕂 in homogeneous coordinates space (Hartley
cost. & Zisserman, 2003), is defined as shown in Eq. (2).
⎡𝑓 0 𝑝𝑐,𝑥 ⎤
3. Camera modelling & simulation tool 𝕂=⎢0 𝑓 𝑝𝑐,𝑦 ⎥ (2)
⎢ ⎥
⎣0 0 1 ⎦
This section describes our simulation tool, which is developed to
where the coordinates of the principal point are (𝑝𝑐,𝑥 , 𝑝𝑐,𝑦 )𝑇 , and 𝑓 sym-
generate simulation datasets for use in this study.
bolizes the focal length. The extrinsic matrix is responsible for mapping
Annotating images with object and projection point labels is a time-
points from the world coordinate system to the camera coordinate sys-
consuming and costly process. Moreover, acquiring data with various
tem using rotation and translation operations. To form the rotation
camera and object placements may not be feasible due to operational
component, ℝ, of the extrinsic matrix, Euler angles (Slabaugh, 1999)
constraints. As we have limited access to annotated real-world data,
around each axis must be defined. This paper uses the convention dis-
we developed a simulation tool to generate simulated data with vari-
played in Fig. 4.
ous camera and object properties to evaluate the effectiveness of our
The Pitch, Yaw, and Roll angles define the rotation angles around
proposed approach.
the 𝑥, 𝑦, and 𝑧 axes, and the corresponding rotations are represented by
To create the simulation tool, we first model the camera and lens
ℝ𝑥 (), ℝ𝑦 (), and ℝ𝑧 () in 3 × 3 matrix form. To form the rotation matrix,
distortion characteristics. We then model configurable-sized overhead
a sequence of multiplication operations using all three rotation matri-
objects to generate visuals and positional data. Additionally, we incor-
ces is performed. It is important to note that the order of multiplication
porate random rotations for the overhead objects to ensure a compre-
affects the resulting rotation matrix due to the non-commutativity of
hensive study. We provide a brief overview of the camera model used
matrix multiplication. In this simulation design, the camera is rotated
in the simulation tool in Sec. 3.1, followed by a detailed explanation of
first about the z-axis, then the y-axis, and finally the x-axis. Thus, the re-
the simulation tool and objects in Sec. 3.2.
sulting rotation matrix is expressed as shown in Eq. (3). The open-form
matrix representation of ℝ𝑥 (), ℝ𝑦 (), and ℝ𝑧 () is presented in Appendix
3.1. Camera geometry & modeling
Sec. A.2.
The simulation camera is modeled using the pinhole camera model, ℝ = ℝ𝑥 (𝛼)ℝ𝑦 (𝛽)ℝ𝑧 (𝛾) (3)
which assumes that the image coordinates are Euclidean coordinates
and have equal scales in both 𝑥 and 𝑦 directions. In order to map a where 𝛼 (pitch), 𝛽 (yaw), and 𝛾 (roll) are the angle differences between
point 𝑿 in the world coordinate system to the image plane and obtain the world and camera coordinate system. The second component of the
the corresponding pixel coordinate 𝒙, a series of projection operations extrinsic matrix, translation 𝒕, can be formulated as
are performed. The image plane is positioned on the principal axis such ̃
𝒕 = −ℝ𝑪, (4)
that 𝑍𝑐 = 𝑓 (focal length), and the principal point 𝒑𝒄 is located at the
center of the image plane. The camera coordinate system and the world where 𝑪̃ represents the camera center coordinates in the world coor-
coordinate system are centered at 𝑪 and 𝑾 , respectively. Fig. 3 illus- dinate frame (Hartley & Zisserman, 2003). After defining each compo-
trates the camera coordinate system, the world coordinate system, the nent, the projection matrix can be represented as the multiplication of
pixel coordinates, and the mapping of an arbitrary point. the intrinsic and extrinsic matrices, as given in Eq. (5).
4
ℙ = 𝕂[ℝ|𝒕]. (5)
In order to find the 3D world coordinate of a pixel, back-projection
operations are employed. However, as the back-projection operation of
a pixel point generates a ray that passes through the camera center
and the pixel point, it is necessary to trace the ray until it intersects
with an opaque object in the 3D world coordinate system. The ray that
passes through the given pixel coordinate and the camera center can be
represented as shown in Eq. (6).
𝑿(𝜆) = ℙ+ 𝒙 + 𝜆𝑪, (6)

where ℙ+is the pseudo-inverse of ℙ and can be represented as = ℙ+
ℙ𝑇 (ℙℙ𝑇 )−1 , for which ℙℙ+ = 𝕀 (identity matrix). The variable 𝜆 is a
parameter that can take values in the range of [0, ∞). Adjusting the
value of 𝜆 makes it possible to visit any point along the ray.
3.1.2. Radial distortion

The imaging system’s non-linearity is influenced by lens distortions.
Fig. 5. Side view of the 3D object and the camera.
Depending on the physical characteristics of the lenses used in cameras,
the images they capture may exhibit different types of distortion. Mod-
eling these distortions is important for accurate mapping from 2D to
3D or vice versa since distortions deflect the mapped point’s location.
Therefore, these distortions must be considered when mapping a 3D
world coordinate point to a pixel plane. Fortunately, the most common
type of distortion, radial distortion, has a reliable and simple polyno-
mial or rational function-based representation to model its effects.
Although there are several approaches for modeling radial distor-
tion (Ma et al., 2003), we prefer to use one of the most common and
validated representations to model it. The representation is revealed in
Eq. (7).
𝑓 (𝑟𝑖 , 𝒌) = 1 + 𝑘1 ∗ 𝑟2𝑖 + 𝑘2 ∗ 𝑟4𝑖 + ... (7)

where 𝒌 = [𝑘1 , 𝑘2 , ...] in vector form and 𝑟𝑖 is the normalized radius for
the point 𝑖. The formulation of the normalized radius is given in Eq. (8).
√
( 𝑝 −𝑝 )2 ( 𝑝 −𝑝 )2
𝑟𝑖 = 𝑖,𝑥
𝑝
𝑐,𝑥
+ 𝑖,𝑦𝑝 𝑐,𝑦 , (8)
𝑐,𝑥 𝑐,𝑦
where 𝑝𝑖,𝑥 and 𝑝𝑖,𝑦 are used to denote the x-axis and y-axis locations
of the arbitrary undistorted point. The same subscript notation is also
employed for the principal point, 𝒑𝒄 , too. The relation between the dis-
torted location, 𝑑𝑖 , of an undistorted arbitrary pixel point, 𝑝𝑖 , can be
seen in Eq. (9).
𝑑𝑖 = (𝑢𝑖 , 𝑣𝑖 ),
where 𝑢𝑖 = 𝑝𝑐,𝑥 + (𝑝𝑖,𝑥 − 𝑝𝑐,𝑥 )𝑓 (𝑟𝑖 , 𝒌), (9)
𝑣𝑖 = 𝑝𝑐,𝑦 + (𝑝𝑖,𝑦 − 𝑝𝑐,𝑦 )𝑓 (𝑟𝑖 , 𝒌).
3.2. Simulation tool

Fig. 6. Top view of the 3D object and the camera.
To map a 3D overhead object to 2D pixel coordinates and define
the bounding box that tightly encloses the object, the initial step is to the rotated location of the edges and the corners shown in Fig. 5 and
model the 3D object. First, an arbitrary reference point is chosen as the Fig. 6 are presented in Eq. (10). It should be noted that a fixed and flat
origin of the world coordinate system, and a rectangular prism with elevation model is used to define the projection surface.
dimensions of (𝑤, ℎ, 𝑙) is placed at another arbitrary point (𝑋𝑜 , 𝑌𝑜 , 𝑍𝑜 )𝑇 .
It is assumed that the 3D object can rotate around the Y-axis within the 𝑙′ = 𝑙 ∗ cos 𝜉 + 𝑤 ∗ sin 𝜉,
defined rotation angle limits to achieve a more generalized solution.
𝑙′′ = 𝑙 ∗ cos 𝜉 − 𝑤 ∗ sin 𝜉,
Since it is impractical to rotate a heavy overhead object around the X- (10)
axis and Z-axis during routine operations, modeling around these axes 𝑤′ = 𝑤 ∗ cos 𝜉 + 𝑙 ∗ sin 𝜉,
is skipped in the simulation tool. For an arbitrary rotation angle around
𝑤′′ = 𝑤 ∗ cos 𝜉 − 𝑙 ∗ sin 𝜉.
the Y-axis, denoted as 𝜉, the side and top views of the object in 3D space
are shown in Fig. 5 and Fig. 6, respectively. Appendix Sec. A.1 provides In the next step, it is necessary to position the camera center at an arbi-
details on the calculations for an edge of the object from the top view. trary point (𝑋𝑐 , 𝑌𝑐 , 𝑍𝑐 )𝑇 , and set the rotation angles, namely roll, pitch,
The edge and corner locations of the simulated object, which depend on and yaw. Once the focal length and pixel dimensions of the camera are
the rotation angle 𝜉 and the projection point of the overhead object’s determined, the mapping operation can be performed. Notably, for sim-
center, are also given in the same figures. The notations used to define ulation purposes, the principal point is selected as the exact center of the
5
the preprocessing steps are identical to those employed in the training

flow (Fig. 8) and are further elaborated in Sec. 5.3.
The regression-based approach provided a theoretical foundation for
accurately estimating relative projection locations, benefiting from the
neural network’s ability to capture complex non-linear relationships.
Another potential approach to solving the problem of detecting projec-
tion points is to label the projection surface as a heatmap by centering
the actual projection pixel point. Then by designing a model that gener-
ates a comparable heatmap from the bounding box information, we can
then treat the problem as a pixel-level heatmap classification task. This
alternative approach draws parallels to techniques utilized in various
fields, including keypoint detection (Zhou et al., 2019), where heatmaps
Fig. 7. Directions of deviations applied to both the bounding box of the over- are commonly employed for accurate localization. However, it is im-
head object and the center projection point of the object. While thin and green portant to note that adopting this solution necessitates the storage and
arrows around the edges of the object show possible deviation directions for memory-intensive task of retaining all projection labels as heatmaps.
the edges, the thick and red arrows around the projection point demonstrate Furthermore, increasing the node count in the output layer can nega-
the possible deviation directions for the projection point. tively impact the model’s inference speed. Careful consideration should
be given to these factors when implementing this alternative approach.
Fig. 10 illustrates the workflow of the projection detection process
camera plane. To obtain the 2D undistorted pixel position correspond-
for both real and simulation data. However, it is important to note that
ing to any 3D point in the simulated environment, the projection matrix
there are differences in the dynamics of obtaining data and extracting
is used to map the 3D points. The distorted location of each point is sub-
the position of overhead objects between the simulation environment
sequently calculated based on the determined distortion model and its
and real data. As a result, certain processing blocks in the real data
associated parameters. Since Wang et al. (2008) show that 𝑘1 and 𝑘2
flow need to be replaced or modified to accommodate the simulation
are the predominant terms for estimating the radial distortion of most
environment. In the real data flow, OverProjNet is coupled with a data
camera lenses, higher-order terms used in Eq. (7) are neglected in our
acquisition system to feed data, and an amodal object detection/seg-
simulation studies.
mentation method to detect the position of the objects in the acquired
After determining the camera and lens parameters, the 3D place-
frames. In contrast, image formation and bounding box assignment
ments and rotations of both the object and the simulated camera, the
operations are used instead of data acquisition and amodal object de-
simulated object, and the projection point are printed on the simula-
tection/segmentation methodologies in the simulation flow.
tion image. The tightest bounding box is placed on the simulated object
We propose using an amodal object detection/segmentation method
image to simulate amodal object detection or instance segmentation re-
to detect the position of the objects for real data flow. This choice is mo-
sults. Since amodal detectors may suffer from alignment and boundary
tivated by the method’s ability to detect the complete extent of objects,
estimation issues, the boundaries of the bounding box drawn on the
even when certain parts are occluded. In contrast, if a conventional ob-
simulated object image are manipulated by adding random deviations.
ject detection method is employed, which focuses solely on the visible
Furthermore, annotation errors are also considered to replicate the real
portions of objects, relying on this partial bounding box information can
solution accurately, and the projection points of the simulated objec-
lead to inaccurate projection point detection. This limitation arises due
t’s center are randomly dislocated. The random deviations are applied
to the possible narrowing down of bounding boxes from any side during
to both bounding boxes and projection points in the given directions
occlusion, resulting in the loss of the spatial relationship between the
shown in Fig. 7.
unique bounding box position and the corresponding projection point.
After considering several criteria, we have decided to use the bound-
4. OverProjNet ing box-based representation to indicate the object’s position. Annotat-
ing targets with bounding boxes is relatively easier compared to other
This paper presents OverProjNet, a neural network-based method for annotation styles, such as semantic segmentation or keypoint annota-
detecting the projection points of overhead objects’ centers using re- tions. Since our proposed solution is designed to work with both amodal
gression principles. During the training stage, the network is fed with object detection and amodal instance segmentation results, the output
inputs indicating the locations of overhead objects and correspond- format of amodal detectors also played a critical role in our decision.
ing projection points in the image. As revealed in Fig. 8, the position We believe that converting segmentation maps to bounding boxes is
of the overhead object can be obtained either through amodal object relatively easy, but the reverse is not true. Therefore, for the sake of gen-
detection/instance segmentation methods or via a manual annotation eralizability, we have used bounding box representations in this work.
process. On the other hand, the manual annotation process is necessary The primary goal of this paper is to determine the projection point
to obtain the locations of the projection points of the overhead object. of the object centers. However, the proposed solution can also estimate
Subsequently, all of this positional information is pre-processed via co- the projection of the objects’ corner points or edge points by modifying
ordinate transformation and normalization operations before being fed the target values used to train the network.
into OverProjNet. The details regarding the pre-processing operations OverProjNet replaces several processing steps in the image pixel
applied to the data are provided in Sec. 5.3. Then, the network pre- space while estimating the projection point, as shown in Fig. 1. These
dicts the relative locations of the projection points with respect to the steps involve several non-linear operations and transformations, as ex-
objects, and the network parameters are optimized by minimizing the plained in Sec. 3. Therefore, the proposed solution is designed to model
distance between the predicted and ground truth points. complex non-linear relationships for successful coordinate estimation.
In the inference stage, the object’s location in the image is utilized to Since the computational time and resource demands, as well as the
estimate the projection point. Even though this paper primarily focuses detection performance of the projection detection model, are taken into
on predicting projection points based on the locations of overhead ob- account, we design five versions of OverProjNet with varying levels of
jects, Fig. 9 presents a comprehensive end-to-end inference solution that layer depths. These networks are named OverProjNet (XS), OverProjNet
encompasses the detection of overhead objects and people, as well as (S), OverProjNet (M), OverProjNet (L), and OverProjNet (XL), from the
a proximity-based warning. This diagram provides an overview of the shallowest to the deepest. The diagrams of these architectural alterna-
role of OverProjNet within this potential end-to-end solution. Notably, tives of OverProjNet are presented in Fig. 11. The method’s adaptability,
6
Fig. 8. Illustrative training workflow of OverProjNet. Alternative ways of the annotation process are also provided to offer a comprehensive view of the complete
training process.
Fig. 9. Illustrative testing workflow of OverProjNet. In addition to the part that focused on the paper, amodal object detection/instance segmentation and operations
that need to be involved for the warning system are also provided to offer a comprehensive view of a possible end-to-end solution.
Fig. 10. Flowchart illustrating the key steps in the projection detection process, highlighting the distinctions between real and simulation data.
7
Fig. 11. Diagrams of the proposed architectures for OverProjNet.
achieved through flexible architecture, allows users to tailor it to their 𝑀𝑆𝐸 = (𝑝𝑖,𝑥 − 𝑝̂𝑖,𝑥 )2 + (𝑝𝑖,𝑦 − 𝑝̂𝑖,𝑦 )2 , (11)
specific resource and time requirements. All architecture alternatives
𝑀𝐻𝑂𝐸 = (𝑝𝑖,𝑥 − 𝑝̂𝑖,𝑥 ) + (𝑝𝑖,𝑦 − 𝑝̂𝑖,𝑦 ) ,
4 4
(12)
use ReLu activations after each fully connected layer, except for the
output layers. ReLu activation function effectively replaces negative val- where 𝑝𝑖 and 𝑝̂𝑖 are predicted and target projection locations for the 𝑖𝑡ℎ
ues with zero while leaving positive values unchanged (Agarap, 2018). overhead object sample.
By setting negative inputs to zero, ReLU introduces non-linearity into Incorporating the fourth power within the MHOE loss function
the network. This non-linear activation enables the network to learn serves to amplify the influence of outliers or extreme errors, thereby as-
complex non-linear relationships and represent more intricate decision signing them greater significance during the optimization process. This
boundaries. One key advantage of ReLU is its simplicity and computa- can be particularly useful in scenarios where reducing high errors is of
tional efficiency compared to other common activation functions like utmost importance, such as safety-critical applications or systems where
sigmoid or tanh. Additionally, ReLU helps address the vanishing gradi- false alerts must be minimized. Taking the fourth power instead of three
or five offers a significant advantage by ensuring that the resulting loss
ent problem, which can hinder the training of deep neural networks.
values are always non-negative. This characteristic of non-negativity is
Since ReLU only saturates at positive values and does not saturate for
highly desirable as it provides a clear and unambiguous indication of
negative inputs, it mitigates the gradient vanishing issue and facilitates
the error’s magnitude and direction.
more effective gradient flow during backpropagation. A detailed and
comparative performance review of the network alternatives is pro- 5. Datasets
vided in Sec. 6.
To validate the need for a solution with networks capable of model- The proposed solution is intended to work on various datasets, in-
ing non-linearities, we also test and report the projection detection per- cluding OverheadSimIntenseye (OSI) and CraneIntenseye (CRI), as
formances of the linear perceptron in Sec. 6.3.5. Additionally, we test well as any other datasets that offer the required inputs and target
OverProjNet by adding batch normalization layers before each ReLu ac- information. The OverheadSimIntenseye dataset is generated using the
tivation and inserting dropout layers just after the last ReLu activations simulation tool that we design, as explained in detail in Sec. 3. In con-
to examine the models’ response to additional regularization operations. trast, CraneIntenseye is composed of images captured from real cameras
in different facilities and viewpoints. Both datasets comprise positional
The detection performances of OverProjNet with these modifications are
and visual data that indicate the pixel location of overhead objects as
reported in Sec. 6.3.4. To train OverProjNet, two loss functions are used
inputs and the center projection point of the overhead object as targets.
alternatively: mean squared error (MSE) loss (Eq. (11)) and mean high-
While overhead objects are represented by bounding boxes, the center
order error (MHOE) loss (Eq. (12)), where the latter uses the 4𝑡ℎ power
projection points of the overhead object are denoted by points. The data
of the error. In Sec. 6.3.3, we analyze the effect of these loss functions
formats provided are as follows:
on the solution’s performance. While MSE loss is a common choice for
regression problems, MHOE loss is also tested in this work to evaluate • Inputs: Pixel coordinates of bounding boxes (covering the objects
its ability to minimize the maximum observation error. This is impor- in the tightest way):
tant as high errors in some observations may lead to false alerts in a – Top-left of the bounding box in the x-axis
warning system designed with OverProjNet. Limiting the maximum er- – Top-left of the bounding box in the y-axis
rors can enhance the system’s robustness; hence we include MHOE loss – Bottom-right of the bounding box in the x-axis
in the experiments. – Bottom-right of the bounding box in the y-axis
8
Table 1
Parameters used to generate sets of OverheadSimIntenseye. HFOV: Horizontal field of view (FOV) angle of the camera (in degrees), CRAD: Radial distortion parameters
of the camera (unitless), CDIM: Pixel dimensions of the camera (in pixels), CPOS: 3D position of the camera w.r.t. the reference point in the world coordinate system
(in meters), CROT: Rotation angles of the camera (in degrees), ODIM: 3D dimensions of the overhead object (in meters), OPOS_X: 3D position range of the object
w.r.t. the reference point in the x-axis of the world coordinate system (in meters), OPOS_Y: 3D position range of the object w.r.t. the reference point in the y-axis of
the world coordinate system (in meters), OPOS_Z: 3D position range of the object w.r.t. the reference point in the z-axis of the world coordinate system (in meters),
OROT: Rotation range of the object about Y-axis (in degrees), OSIG: Sigma value used to deviate the edges of the overhead object bounding box (unitless), PSIG:
Sigma value used to deviate the projection point of the overhead object (unitless), NOBS: The number of observations produced (unitless).
Parameters/Sets Set01 Set02 Set03 Set04 Set05
HFOV 60.0 120.0 60.0 60.0 60.0
CRAD (𝑘1 ∕𝑘2 ) -0.05/0.0 -0.05/0.01 -0.05/0.0 -0.05/0.0 -0.05/0.0
CDIM (width/height) 1280/720 1920/1080 1280/720 1280/720 1280/720
CPOS (𝐶𝑋 ∕𝐶𝑌 ∕𝐶𝑍 ) -5.0/-8.0/0.0 -5.0/-8.0/0.0 10.0/-12.0/0.0 -5.0/-8.0/0.0 -5.0/-8.0/0.0
CROT (𝑅𝑜𝑙𝑙∕𝑃 𝑖𝑡𝑐ℎ∕𝑌 𝑎𝑤) 0.0/-10.0/-15.0 0.0/-10.0/-15.0 5.0/-15.0/20.0 0.0/-10.0/-15.0 0.0/-10.0/-15.0
ODIM (𝑤∕ℎ∕𝑙) 2.0/2.5/1.0 2.0/2.5/1.0 1.0/2.5/5.0 2.0/2.5/1.0 2.0/2.5/1.0
OPOS_X (min/max) -40.0/40.0 -40.0/40.0 -40.0/40.0 -80.0/80.0 -40.0/40.0
OPOS_Y (min/max) -10.0/-1.25 -10.0/-1.25 -10.0/-1.25 -20.0/-1.25 -10.0/-1.25
OPOS_Z (min/max) 20.0/50.0 20.0/50.0 20.0/50.0 40.0/70.0 20.0/50.0
OROT (min/max) -10.0/10.0 -10.0/10.0 -10.0/10.0 -30.0/30.0 -10.0/10.0
OSIG 1.0 1.0 1.0 1.0 2.0
PSIG 1.0 1.0 1.0 1.0 2.0
NOBS (train/val./test) 298/746/746 267/669/669 299/747/747 294/741/741 294/740/741
• Targets: The projection coordinates of object centers: Table 2

– Projection position in the x-axis Accessible and known properties of cameras and observation counts of CraneIn-
– Projection position in the y-axis. tenseye. HFOV: Horizontal FOV angle of the Camera (in degrees) CDIM: Pixel
Dimensions of the Camera (in pixels), NOBS: The number of observations pro-
duced (unitless).
Note that all sets of OverheadSimIntenseye and CraneIntenseye
Sets Set01 Set02
datasets are already split into training, validation, and test portions
HFOV 85.0 82.2
using random sampling. CDIM (width/height) 1920/1080 1920/1080
NOBS (train/val./test) 313/66/66 351/75/75
5.1. OverheadSimIntenseye
The utilization of simulation datasets in computer vision research are used during the collection of the CraneIntenseye dataset. Unlike the
has become a well-established technique across various fields, including OverheadSimIntenseye dataset, manual annotation is required to label
object detection (Mittal et al., 2022, Boikov et al., 2021) and classifica- the overhead objects and projection points on the images collected for
tion (Wong et al., 2019). By leveraging our simulation tool presented in CraneIntenseye. The same annotation format used for OverheadSimIntens-
Sec. 3.2, we enabled the generation of realistic scenarios for overhead eye is also adopted for CraneIntenseye. Since most of the parameters used
object projection. to generate the sets of OverheadSimIntenseye are unknown for CraneIn-
The OverheadSimIntenseye comprises five distinct sets, each encom- tenseye, only the available ones are presented in Table 2 for CraneIntens-
passing different camera placements, camera rotations, and variable- eye.
sized objects, along with other camera and lens parameters. The over-
head objects are positioned randomly in 3D space, and data is collected 5.3. Data preprocessing
at these random locations for each dataset. During collection, the size
of the overhead objects, the 3D camera location, rotational position,
Coordinate transformation. To ensure consistency in the data pre-
and internal camera and lens properties are fixed for each set. How-
processing for both OverheadSimIntenseye and CraneIntenseye, the same
ever, random orientations around the Y-axis are applied to the overhead
set of preprocessing procedures are applied to both datasets. Initially,
objects to simulate rotational variation in each individual observation
the bounding box coordinates, which are in the format of (top-left
position. The fixed and variable parameters and their allowed ranges
x-axis, top-left y-axis, bottom-right x-axis, bottom-right y-axis), are con-
are presented in Table 1 to generate each set of OverheadSimIntenseye.
verted to the format of (bottom-center x-axis, bottom-center y-axis,
Simulation images are also provided for visual demonstrations and in-
width, height). Since the overhead objects are captured with cameras
vestigations, along with positional data.
that have low roll angles, the bottom-center points of the bounding
Bounding box boundaries and projection points are manipulated by
boxes are selected as the input, along with the width and height of the
adding random deviations within predefined limits, as mentioned in
bounding boxes, as they are considered to be most directly related to
Sec. 3.2. To analyze the effect of these deviations, we generate sets
the projection point. The impact of this decision on performance is dis-
with and without applied deviations to the edges of the bounding boxes
cussed in Sec. 6.3.2. The projection point of the overhead objects is
and the projection points of the objects. The impact of deviations on
represented by the relative displacement of the projection point with
performance is tested, and detailed results are presented in Sec. 6.3.1.
respect to the bottom-center point of the bounding box.
Normalization. Subsequently, all input and target pixel information
5.2. CraneIntenseye
is normalized based on the camera pixel dimensions to fit the data into
the [0, 1] range. Finally, the normalized input and target information is
The CraneIntenseye dataset consists of two sets, where the inputs are
utilized for training and testing operations in OverProjNet.
obtained from actual cameras on facilities. A laser pointer is mounted
at the center bottom of the cranes to indicate the projection points in
the images and assist with annotation operations. The images are cap- 6. Experiments & results
tured from fixed-position cameras while the cranes are in operation. The
cameras used for data collection have different lens properties, view- In this section, we evaluate the performance of the proposed solution
points, and rotational angles, and different overhead objects (cranes) on different datasets and present the results and findings.
9
Table 3
Optimal hyperparameters for architecture alternatives of OverProjNet and search ranges. TBDS: Time-based decay scheduler, RPS: Reduce plateau scheduler, OCS:
One cycle scheduler.
Architectures of OverProjNet Search Ranges
XS S M L XL
Batch Size 16 64 8 8 8 8, 16, 32, 64, 128
Initial Learning Rate 5.81e-3 1.18e-2 1.00e-3 5.10e-3 3.86e-4 [4.54e-5, 1.73e-2]
Parameters Learning Rate Scheduler OCS OCS OCS OCS OCS TBDS, RPS, OCS
Epsilon of AdamW 1.43e-10 1.35e-08 1.30e-10 1.18e-09 7.14e-09 [1.83e-12, 6.91e-8]
Weight Decay of AdamW 1.52e-05 2.07e-05 3.26e-07 3.76e-05 2.66e-05 [3.06e-7, 4.54e-5]
6.1. Experimental details of learning rate schedulers: time-based decay (TBDS), reduce plateau
(RPS), and one cycle (OCS) (Smith & Topin, 2019). For each of the alter-
First, we describe the setup used in the experiments, the perfor- native architectures of OverProjNet, we conduct independent sweeping
mance criteria, and the details of the training and hyperparameter operations to determine the optimal hyperparameters. We use the first
search operations. set of OverheadSimIntenseye for all sweeping operations and compare
the validation set performances to find the best hyperparameter val-
6.1.1. Setup ues. The optimized parameters are then set for all training and testing
All model developments are made using PyTorch (Paszke et al., operations, including those conducted using CraneIntenseye. The maxi-
2019), and all hyperparameter tuning operations have been conducted mum epoch count is set to 5000 for all training experiments. The search
using the sweeping ability of W&B (Biewald, 2020). The training and ranges and optimal hyperparameter values are presented in Table 3. We
testing operations are performed on a computer with an RTX 3080 Lap- also swept the hyperparameters one more time for the ablation study,
top GPU, Intel 12-core CPU, and 32 GB of RAM on Ubuntu 20.04 OS. where we used the mean high-order error (MHOE) loss instead of MSE
loss to ensure fair comparisons.
6.1.2. Performance evaluation
The performance of the model is evaluated by measuring the dis- 6.2. Experiment 1: projection detection performance
tance between the estimated projection point and the actual projection
point. To classify the predictions as true or false positives, a threshold This section presents a comparison of five different architectures of
value proportional to the image dimensions is set. Predictions where the OverProjNet on the OverheadSimIntenseye and CraneIntenseye datasets.
distance between the predicted and the actual projection pixel location The training and validation subsets are used for training, while the
is less than or equal to the specified threshold value are considered true test subset is utilized to evaluate the projection detection performance
positives, while the remaining predictions are evaluated as false posi- of the trained model. The detection accuracies obtained are reported
tives. The threshold value used to determine the state of the detections in Table 4, along with the mean pixel error and maximum pixel er-
is expressed in Eq. (13). ror measures. Pixel errors represent the Euclidean distance between the
√ estimated and target locations of the projection points in the pixel do-
𝑡ℎ𝑟 = c ∗ (2 ∗ 𝑝𝑐,𝑥 )2 + (2 ∗ 𝑝𝑐,𝑦 )2 , (13) main.
The accuracy values in Table 4 indicate that the detection perfor-
where 𝑐 is a constant value and is selected as 0.005 in this study. In
mances of the different architectures can vary depending on the sets
other words, with this selected 𝑐 value, the threshold gets the value of
of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets. The
approximately 7,3 and 11,0 pixels for images with dimensions of 1280 ×
highest detection accuracies are achieved for the second set (Set02)
720 and 1920 × 1080 pixels, respectively. Following the determination of
of OverheadSimIntenseye with various architecture options of OverPro-
the threshold value, the detection accuracy, which represents the ratio
jNet, whereas the lowest accuracy values are obtained for the fifth set
of true positive counts to all detections, is reported and computed using
(Set05) of OverheadSimIntenseye. The best performance values obtained
Eq. (14).
for the sets of CraneIntenseye (below the dashed line in Table 4) are rela-
tively close compared to those achieved for OverheadSimIntenseye. When
𝑎𝑐𝑐 = |𝑇 𝑃 |∕(|𝑇 𝑃 | + |𝐹 𝑃 |), (14)
comparing the architectures of OverProjNet, it is observed that differ-
where || denotes the cardinality of set ; 𝑇 𝑃 denoted the set of true ent architectures achieve the best and second-best detection values for
positives, and 𝐹 𝑃 is used to abbreviate the set of false positives. In different sets. In general, the shallow architecture options (XS and S ver-
addition to the detection accuracy, mean and maximum values for the sions of OverProjNet) achieve better or comparable accuracy scores on
detection errors are also reported for the detailed investigation. the sets of OverheadSimIntenseye than other architectures, whereas the
Since we know all the required information, the error distances be- deeper architectures (M, XL versions of OverProjNet) are more accurate
tween the actual and predicted projection points are also measured in in detecting projection points on the sets of CraneIntenseye.
the world coordinate system using flat surface assumption and back- Upon comparing the accuracy scores of Set01 and Set02 of Over-
projection operation as expressed in Eq. (6) just for OverheadSimIntens- headSimIntenseye, we observe that higher accuracy values are achieved
eye dataset. However, for the CraneIntenseye dataset, which lacks geo- for Set02. This implies that projection points are detected more accu-
graphic information about its scenes, we cannot perform distance-based rately when higher resolution and wide-angle cameras are used, despite
calculations. the challenging distortion behavior due to the non-zero 𝑘2 value. In
fact, the accuracies on Set02 are significantly high, where projection
6.1.3. Hyperparameter and training details points are detected for nearly all samples in the test split of Set02 using
In order to achieve the best performance in projection detection, we all architectures of OverProjNet. For a detailed description of the sets
employ hyperparameter tuning techniques using the Bayesian Search- generated for OverheadSimIntenseye, please refer to Sec. 5.1.
ing (Dewancker et al., 2016) capabilities of Weight & Biases (Biewald, We analyze the effects of camera rotation, camera placement, and
2020). Our aim is to obtain optimal validation performance by sweeping overhead object dimensions on the model performance by comparing
the search parameters within defined ranges and distributions. Specifi- the scores achieved for Set01 and Set03. It is observed that the detec-
cally, we tune the batch size, initial learning rate, learning rate sched- tion performance can significantly decrease when attempting to detect
uler type, epsilon value, and weight decay parameters of the AdamW projection points of longer and thinner overhead objects from a camera
(Loshchilov & Hutter, 2018) optimizer. To this end, we use three types mounted at higher points with steeper angles in each rotational axis.
10
Table 4
Performance results of the proposed architectures on the sets of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets. The first columns of each architecture
are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE), respectively. The dashed line is used to split
the performance measures of OverheadSimIntenseye and CraneIntenseye datasets. Bold and underlined numbers are used to show the best and second-best accuracy
values in each set, respectively.
Arch.∖Sets XS S M L XL
acc ME MXE acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .891 4.13 16.00 .870 4.36 28.05 .894 4.15 17.32 .878 4.21 18.23 .845 4.38 20.31
Set02 .991 3.99 15.21 .991 3.81 41.27 .993 4.24 15.44 .981 4.17 16.74 .993 4.04 14.10
OSI Set03 .681 6.39 39.23 .665 6.45 29.51 .641 6.93 36.43 .621 7.23 45.01 .645 7.09 45.26
Set04 .900 3.89 12.63 .875 4.04 31.24 .896 3.88 14.29 .883 4.12 14.37 .855 4.19 20.48
Set05 .577 7.56 26.40 .587 7.56 28.30 .591 7.56 24.54 .575 7.73 33.00 .604 7.54 29.63
Set01 .667 9.98 39.21 .803 8.52 47.97 .742 8.22 45.39 .727 9.13 40.35 .864 6.44 38.85
CRI
Set02 .920 6.31 17.23 .880 5.89 17.66 .960 5.14 15.49 .840 6.37 21.85 .947 4.95 29.72
Table 5 MHOE loss into the solution to evaluate its ability to minimize maximal
The best-performed architectures (BPA) and corresponding mean pixel errors errors, as discussed in Sec. 6.3.3.
in pixels (ME) and mean metric distance errors in meters (MME) on the sets of In conclusion, since we report high-performance scores for both sets
OverheadSimIntenseye (OSI) dataset. of CraneIntenseye, we can confidently state that the parameters tuned on
Sets BPA ME MME Set01 of OverheadSimIntenseye work well on CraneIntenseye’s sets too.
Set01 M 4.15 0.17
Set02 M, XL 4.24, 4.04 0.07, 0.07
OSI Set03 XS 6.39 0.29 6.3. Ablation study
Set04 XS 3.89 0.50
Set05 XL 7.54 0.57
In this study, we conduct a comparative analysis to investigate the
impact of several factors on the accuracy of object projection detec-
However, when monitoring the overhead object from a further point tion. Specifically, we examine the effects of deviations in the position
and letting the object visit higher altitudes and wider ranges, the detec- of bounding boxes and projection points of objects, coordinate transfor-
tion performance of the projection points does not change much (Set01 mation, loss functions, and architectural regularization modifications.
vs. Set04 of OverheadSimIntenseye). The last set of OverheadSimIntens- In addition, we evaluate the importance of designing a solution with
eye (Set05) demonstrates a decrease in detection performance as the networks that can model non-linearities by testing the performance of
dislocation amount in the bounding box and projection point positions a linear perceptron, which is also reported in the scope of our ablation
increases. study.
As discussed in Sec. 6.1.2, we measure the Euclidean distance er-
rors between the predicted and actual projection points in the world 6.3.1. Experiment 2: the effect of deviations
coordinate system for OverheadSimIntenseye. We assume a flat surface In Sec. 3.2, we have described the manipulation of bounding box
and use the back-projection operation to calculate these errors. Ta- boundaries in OverheadSimIntenseye by introducing random deviations
ble 5 presents the mean metric errors for the best-performing models to simulate annotation and amodal object/segmentation defects. Addi-
of each set in OverheadSimIntenseye. Upon inspection of the last col- tionally, we have introduced annotation errors in the projection points
umn of the table, we note that the mean metric distance error ranges of the simulated object’s center by randomly dislocating these points.
from a few centimeters to around half a meter on average, which is In this section, we investigate the effects of these deviations on the per-
an acceptable range for designing a protection system with a warn- formance of OverProjNet. We ask two questions: “What if we do not
ing mechanism using OverProjNet. Examining the mean pixel errors and apply these deviations?” and “How do these deviations affect the per-
mean metric distance errors together, we observe correlations between formance of OverProjNet?” To answer these questions, we cancel out
the two measurements, with some exceptions. For instance, Set04 has the deviations in OverheadSimIntenseye and compare the projection de-
the lowest pixel error but relatively higher metric distance errors. This tection results with the original dataset.
inconsistency can be explained by the fact that the object travels in ar- We find that when we eliminate the deviation operation on the edges
eas relatively far from the camera, where the ground sampling distances of the bounding boxes covering the overhead objects, the accuracy of
are higher than those closer to the camera. As a result, the mean metric all sets of OverheadSimIntenseye increases dramatically for all architec-
distance error naturally becomes larger. ture alternatives of OverProjNet (see the first columns [acc] for each
The accuracy levels achieved in CraneIntenseye are remarkably high, architecture alternative in Table 4 and Table 6 (a)). In parallel, the
with scores of .964 for Set01 and .960 for Set02, exceeding the best mean pixel errors are also decreased greatly (see the second columns
values achieved in certain sets of OverheadSimIntenseye (.681 for Set03 [ME] for each architecture alternative in Table 4 and Table 6 (a)). How-
and .604 for Set05) by a significant margin. This is because the sets ever, when we eliminate the deviations in the projection points but keep
in OverheadSimIntenseye include challenging scenarios that cannot be the deviation for the edge locations of the overhead objects’ bounding
easily covered in real-world data, such as different camera placements, boxes at the same level, we do not achieve the same amount of perfor-
rotations, variable object sizes, and various camera and lens parameters. mance increase as observed for the case of no deviations in the edges
The mean pixel error measures (ME columns in Table 4) also reflect of the bounding boxes (see the first columns [acc] for each architecture
similar findings to the accuracy scores. Generally, the sets with higher alternative in Table 4, Table 6 (a) and Table 6 (b)). Therefore, we con-
accuracy scores achieve lower mean pixel errors. It is worth noting that clude that the precision of the bounding box annotation is more critical
the maximum pixel error (third column [MXE] in Table 4) can reach compared to the projection annotations while training and producing
values of up to 50 pixels depending on the set and the architecture of OverProjNet models for better estimations. When we cancel out all devi-
OverProjNet, even when the mean pixel error values are less than 10 pix- ations, we observe significant increases in the accuracy values of Set05
els. This is expected as distance-based measurement techniques usually of OverheadSimIntenseye compared to the case with no deviation on the
produce right-skewed errors. However, minimizing the maximal error edges of the bounding boxes (see the last row of the first columns for
can help reduce potential false alarms and missed detections of a warn- each architecture alternative in Table 6 (a) and Table 6 (c)). This is due
ing system developed using OverProjNet. Therefore, we integrate the to the higher deviation values used for this set.
11
Table 6
Performance results of proposed architectures without deviations on bounding box and/or projection points of OverheadSimIntenseye’s (OSI) samples. The first
columns of each architecture are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE), respectively.
Bold and underlined numbers are used to show the best and second-best accuracy values.
(a) No deviations on the edges of the bounding box (i.e., OSIG = 0.0)
Set01 .993 2.26 9.69 .983 2.37 30.20 .992 2.62 12.78 .988 2.32 12.80 .991 2.22 11.16
Set02 1.00 1.97 10.39 .999 3.46 16.76 1.00 2.63 8.69 1.00 2.24 10.51 .999 3.40 11.13
OSI Set03 .893 4.11 20.45 .870 4.29 17.57 .693 6.10 48.20 .656 6.57 38.25 .683 6.49 44.74
Set04 .997 2.12 7.47 .992 3.76 24.96 1.00 2.47 6.30 .997 2.40 7.68 .997 1.72 11.59
Set05 .972 3.21 9.89 .975 3.31 18.05 .992 2.94 8.73 .964 3.33 16.29 .966 3.20 14.38
(b) No deviations on the projection point of the overhead object (i.e., PSIG = 0.0)
Set01 .906 3.74 16.04 .887 4.03 30.76 .900 3.64 15.95 .900 3.67 15.71 .879 3.88 18.41
Set02 .988 3.80 13.49 .985 3.75 30.36 .990 3.45 19.03 .981 3.94 23.38 .984 3.68 14.98
OSI Set03 .696 6.54 40.62 .703 5.92 27.92 .625 7.04 38.86 .629 7.52 42.66 .657 6.69 40.71
Set04 .888 3.89 15.04 .882 4.17 28.62 .890 3.94 14.69 .856 4.21 23.83 .871 4.04 20.82
Set05 .622 7.04 24.23 .608 6.95 26.02 .608 7.07 27.45 .608 7.08 31.49 .593 7.06 26.33
(c) No deviations on any of the edges of the bounding box and the projection point of the overhead object (i.e., OSIG = 0.0 and PSIG = 0.0)
Set01 .999 2.26 13.59 .984 2.50 30.94 .997 1.42 11.73 .995 2.83 12.39 .989 2.05 18.52
Set02 .999 3.14 14.38 .999 3.32 13.46 .999 2.42 11.12 1.00 3.07 7.86 1.00 3.23 8.77
OSI Set03 .827 4.75 20.04 .869 4.23 20.30 .742 5.49 40.50 .687 6.43 38.23 .680 6.65 47.71
Set04 .999 3.78 7.48 .995 3.75 25.97 1.00 2.68 6.27 .993 2.70 8.32 .999 1.53 10.66
Set05 .988 1.81 9.81 .997 1.67 8.82 .997 1.89 8.54 .999 1.33 9.91 .999 1.27 7.61
Fig. 12. Overall effect of coordinate transformation on accuracy values for Over- Fig. 13. Effect of coordinate transformation on accuracy values for each alter-
headSimIntenseye dataset. native of architectures of OverProjNet for OverheadSimIntenseye dataset.
6.3.2. Experiment 3: the effect of coordinate transformation

As described in Sec. 5.3, we convert the pixel position representa-
tion of bounding boxes from the conventional format of (top-left x-axis,
top-left y-axis, bottom-right x-axis, bottom-right y-axis) to the format of
(bottom-center x-axis, bottom-center y-axis, width, height). This section
examines the impact of these transformations on projection detection
performance. We repeat the experiments performed in Sec. 6.3.1 and
Sec. 6.2 without applying the coordinate transformation and report the
outcomes in Table 7. Generally, omitting the coordinate transforma-
tion operation used for input and target locations reduces the detection
performance. In other words, the coordinate transformation operation
enhances the projection detection ability of OverProjNet models in all
deviation modes. Fig. 14. Effect of coordinate transformation on accuracy values for each set of
To visualize the difference between accuracies with and without the OverheadSimIntenseye.
coordinate transformation, we use box plots. Fig. 12 reflects pairwise
accuracy differences for all sets of OverheadSimIntenseye for all archi-
tecture levels of OverProjNet. We observe that even the first quantile modeled and represented accurately by these architectures. The coordi-
value lies in the positive region of the plot, and the maximum accuracy nate transformation operation increases the performance of the projec-
differences reached up to 0.20, which is a significant difference in accu- tion detection regardless of the deviation modes, as shown in Fig. 15.
racy values. When we plot the accuracy differences per architecture of The per-set plot (Fig. 14) for OverheadSimIntenseye shows that the dif-
OverProjNet in Fig. 13, we notice that the coordinate transformation op- ferences between the accuracies are more visible for Set03 and Set05,
eration provides a significant advantage in estimating projection points although we report the advantage of the coordinate transformation for
for shallow structures (XS, S) more than the other architectures. This Sets 01 and 02 as well. Similarly, significant performance drops are ob-
is because shallow architectures have a limited ability to model non- served in Fig. 16 and Fig. 17 for the sets of CraneIntenseye when we do
linearities, and the coordinate transformation operations may not be not apply the coordinate transformation operation.
12
Table 7
Performance results of proposed architectures on the sets of OverheadSimIntenseye (OSI) and CraneIntenseye (CRI) datasets when coordinate transformation operation
is skipped. The first columns of each architecture are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors
(MXE), respectively. The dashed line is used to split the performance measures of OverheadSimIntenseye and CraneIntenseye datasets. Bold and underlined numbers
are used to show the best and second-best accuracy values.
Deviations are enabled for both the edges of the bounding boxes and projection points
Set01 .873 4.39 15.98 .854 4.35 22.37 .885 4.07 14.20 .814 5.15 39.18 .853 4.63 24.17
Set02 .979 4.28 24.84 .991 3.97 31.26 .988 3.79 33.76 .975 4.45 25.05 .987 4.37 33.24
OSI Set03 .639 6.87 29.93 .633 7.07 32.78 .635 6.99 29.14 .613 7.51 48.86 .615 7.90 54.44
Set04 .890 3.99 14.19 .882 3.95 13.31 .872 3.95 16.05 .867 4.16 18.13 .866 4.25 17.15
Set05 .557 7.63 25.03 .553 7.89 50.16 .540 8.45 77.32 .549 8.80 71.53 .520 8.83 59.21
Set01 .561 12.44 49.97 .652 10.12 51.44 .788 8.41 52.77 .758 8.71 45.53 .758 7.97 42.75
CRI
Set02 .893 6.58 16.47 .880 5.95 17.99 .907 5.96 14.62 .813 7.51 24.54 .867 6.28 19.14
No deviations on the edges of the bounding boxes (i.e., OSIG = 0.0)
Set01 .988 2.79 16.71 .993 2.20 13.91 .996 2.42 11.23 .969 2.72 40.62 .985 2.82 20.82
Set02 .994 3.82 27.17 .990 3.49 26.36 .996 3.23 17.98 .997 3.31 28.55 .997 3.89 23.75
OSI Set03 .700 6.27 28.29 .679 6.45 28.92 .687 6.40 25.74 .681 6.87 37.42 .673 6.87 59.70
Set04 .997 2.19 8.92 .993 2.24 14.30 .999 2.59 7.48 .989 2.01 10.64 .992 2.46 13.91
Set05 .968 3.37 24.12 .952 3.68 43.44 .940 3.85 66.10 .911 4.72 54.65 .930 4.00 42.80
No deviations on the projection points of the overhead object (i.e., PSIG = 0.0)
Set01 .873 4.11 14.91 .885 4.04 16.08 .910 3.68 12.63 .863 4.12 23.87 .866 4.17 22.07
Set02 .979 4.34 24.88 .981 4.01 31.17 .984 4.40 31.78 .973 4.48 16.81 .984 3.75 35.63
OSI Set03 .639 6.77 31.71 .647 6.80 32.10 .633 6.74 26.32 .605 7.45 36.38 .655 7.08 31.90
Set04 .884 3.97 17.14 .887 3.89 15.87 .886 3.86 14.54 .880 4.25 17.58 .872 4.14 15.57
Set05 .605 7.09 24.99 .608 7.32 44.48 .560 8.34 81.16 .557 8.41 74.63 .565 7.93 48.45
No deviations on any of the edges of the bounding boxes and the projection points of the overhead object (i.e., OSIG = 0.0 and PSIG = 0.0)
Set01 .991 2.46 12.79 .979 2.82 21.98 .996 1.96 15.49 .973 2.98 16.80 .984 2.51 17.56
Set02 .999 4.26 17.38 .990 4.05 28.62 .994 3.36 24.33 .999 4.04 12.51 .994 3.79 13.25
OSI Set03 .708 6.16 30.55 .691 6.22 29.56 .707 6.03 39.45 .675 6.55 39.25 .665 6.85 51.61
Set04 1.00 2.83 6.47 .997 3.35 9.22 1.00 1.90 5.68 .999 2.55 7.62 .995 2.35 15.57
Set05 .973 2.54 42.69 .961 2.48 44.17 .964 2.99 59.82 .948 3.41 69.35 .960 2.20 37.06
Table 8
Performance results of OverProjNet (M) with (w/CT) and without (w/o CT) coordinate transformation using MSE and (MHOE) loss function on CraneIntenseye (CRI)
dataset. The first columns for each loss mode are for accuracy values (acc). The second and third columns are for the mean (ME) and maximum pixel errors (MXE),
respectively. Bold numbers are used to show the best accuracies.
Arch.∖Sets MSE (w/CT) MSE (w/o CT) MHOE (w/CT) MHOE (w/o CT)
acc ME MXE acc ME MXE acc ME MXE acc ME MXE
Set01 .742 8.22 45.39 .788 8.41 52.77 .576 11.57 47.40 .636 10.16 50.29
CRI
Set02 .960 5.14 15.49 .907 5.96 14.62 .907 6.31 20.15 .747 7.61 20.09
Fig. 16. Overall effect of coordinate transformation on accuracy values for

CraneIntenseye dataset.
Fig. 15. Effect of coordinate transformation on accuracy values for each devia-
Table 8 indicates that the performance of estimating projection
tion mode (deviations are enabled for the Default mode).
points is better with MSE loss compared to the MHOE loss on the
CraneIntenseye dataset. Furthermore, no significant reduction is ob-
6.3.3. Experiment 4: the effect of the loss function served in the maximum pixel error when we use the MHOE loss instead
In this experiment, we conduct a parameter tuning process for all of the MSE loss.
hyperparameters using the mean high-order error (MHOE) loss func-
tion on the Set01 of OverheadSimIntenseye. Subsequently, we train and 6.3.4. Experiment 5: the effect of regularization modifications
evaluate OverProjNet (M) on the CraneIntenseye dataset using the op- In this experiment, we examine the impact of adding batch normal-
timized parameters. The obtained testing results from these trials are ization or dropout layers to OverProjNet (M), as outlined in Sec. 6.3.4.
reported in Table 8. After applying these modifications and training the model, we evalu-
13
Table 11
Performance results of perceptron with (w/CT) and without (w/o CT) coor-
dinate transformation on CraneIntenseye dataset (CRI). The first columns for
each transformation mode are for accuracy values (acc). The second and third
columns are for the mean (ME) and maximum pixel errors (MXE), respectively.
Bold numbers are used to show the best accuracies.
Arch.∖Sets Perceptron (w/CT) Perceptron (w/o CT)
acc ME MXE acc ME MXE
Set01 .333 19.04 74.02 .333 19.36 67.15
CRI
Set02 .213 22.18 47.04 .213 38.55 108.78
Fig. 17. Overall effect of coordinate transformation on accuracy values for

CraneIntenseye dataset. Table 12
Average processing time (APT) per batch and sample (in msec.) and throughput
Table 9 for architecture alternatives of OverProjNet.
Performance results of OverProjNet (M) on CraneIntenseye (CRI) dataset with Batch size = 1
and without batch normalization layers. The first columns are for accuracy val- Architectures of OverProjNet
ues (acc). The second and third columns are for the mean (ME) and maximum XS S M L XL
pixel errors (MXE), respectively. Bold numbers are used to show the best ac- APT-batch 0.055 0.078 0.108 0.155 0.215
curacies. (w/CT): Coordinate transformation enabled. (w/o CT): Coordinate APT-sample 5.49e-2 7.79e-2 1.08e-1 1.55e-1 2.15e-1
transformation disabled. Throughput 18 13 9 6 5
without batch normalization layers Batch size = 16
Arch.∖Sets M (w/CT) M (w/o CT) Architectures of OverProjNet
acc ME MXE acc ME MXE XS S M L XL

Set01 .742 8.22 45.39 .788 8.41 52.77 APT-batch 0.053 0.078 0.102 0.148 0.235
CRI APT-sample 3.32-e3 4.89-e3 6.40-e3 9.27-e3 1.47-e2
Set02 .960 5.14 15.49 .907 5.96 14.62
Throughput 301 205 156 108 68
with batch normalization layers
Batch size = 256
Arch.∖Sets M (w/CT) M (w/o CT)
Architectures of OverProjNet
XS S M L XL
Set01 .379 14.69 34.52 .182 29.05 115.42
CRI APT-batch 0.059 0.082 0.103 0.167 0.247
Set02 .560 12.10 33.89 .227 24.61 61.35
APT-sample 2.32-e4 3.18-e4 4.02-e4 6.51-e4 9.63-e4
Throughput 4309 3140 2489 1537 1038
Table 10 Batch size = 4096
Performance results of OverProjNet (M) on CraneIntenseye (CRI) dataset after the Architectures of OverProjNet
dropout modification. The first columns are for accuracy values (acc). The sec- XS S M L XL
ond and third columns are for the mean (ME) and maximum pixel errors (MXE), APT-batch 0.055 0.076 0.113 0.176 0.792
respectively. Bold numbers are used to show the best accuracies. (w/CT): Coor- APT-sample 1.33e-5 1.86e-5 2.76e-5 4.29e-5 1.93e-4
dinate transformation enabled. (w/o CT): Coordinate transformation disabled. Throughput 75084 53706 36214 23333 5172
Drop ratio (i.e., 𝑝𝑟) = 0.0 Batch size = 65536
Arch.∖Sets M (w/CT) M (w/o CT) Architectures of OverProjNet
acc ME MXE acc ME MXE XS S M L XL

Set01 .742 8.22 45.39 .788 8.41 52.77 APT-batch 0.094 0.158 0.422 1.22 11.4
CRI
Set02 .960 5.14 15.49 .907 5.96 14.62 APT-sample 1.44e-6 2.41e-6 6.44e-6 1.86e-5 1.75e-4
Drop ratio (i.e., 𝑝𝑟) = 0.10 Throughput 694134 414283 155233 53728 5721
acc ME MXE acc ME MXE tions do not improve the detection performance, and there is no need
Set01 .742 8.42 40.52 .667 11.33 52.19 to increase the complexity of the architectures with additional lay-
CRI
Set02 .907 6.43 18.56 .733 7.94 20.55
ers.
Drop ratio (i.e., 𝑝𝑟) = 0.25
6.3.5. Experiment 6: comparison with linear perceptron
Set01 .636 10.38 35.89 .258 17.07 46.40
To confirm the importance of designing networks capable of mod-
CRI eling non-linearities, we test the projection detection performance of
Set02 .707 9.43 23.21 .240 20.85 70.55
Drop ratio (i.e., 𝑝𝑟) = 0.50 a linear perceptron. As anticipated, the single-layer linear perceptron
Arch.∖Sets M (w/CT) M (w/o CT) result in poor accuracy performances on the sets of CraneIntenseye, as
acc ME MXE acc ME MXE presented in Table 11. Alongside the poor accuracies, the mean pixel
CRI
Set01 .439 16.81 52.69 .076 56.69 114.89 errors are nearly four times higher than those reported in Table 4 for
Set02 .213 17.46 41.67 .053 65.28 183.88 the best detection rate cases.
ate the modified versions of OverProjNet on the CraneIntenseye. We also 6.4. Experiment 7: throughput and elapsed time analysis
vary the drop probability levels (𝑝𝑟 = 0.10, 0.25, 0.50) in the dropout
layers to investigate the effect of the drop probability ratio on detection As the architectural complexity levels of the alternative architec-
performance. The results of these experiments are presented in Tables 9 tures of OverProjNet differ, their computational performances also vary.
and 10. To analyze this, we measure the average processing time during the
Comparing the detection rates presented in Table 10, it can be inference operation and the throughput performances over 100 itera-
observed that the best detection performance is obtained when the tions and five different batch size levels. These results are presented in
dropout layers are not used (i.e., drop ratio 𝑝𝑟 = 0.0) for both sets of Table 12.
CraneIntenseye. Similarly, the detection performance decreases signif- Table 12 presents the computational cost analysis of OverProjNet for
icantly when batch normalization layers are added to modify Over- different model sizes. Notably, deeper models incur higher computa-
ProjNet. Thus, it can be concluded that the regularization modifica- tional costs and lower throughput counts compared to shallower ones,
14
Fig. 18. Qualitative results of different architectures of OverProjNet on samples from OverheadSimIntenseye and CraneIntenseye datasets.
which is expected due to their higher complexity. However, even the XL 6.5. Experiment 8: qualitative analysis
model has a very low processing time, as it can process 5721 samples in
1 millisecond on average when the batch size is set to 65,536, while the We present visual results of OverProjNet on samples from Overhead-
XS model can process up to 6,941,134 samples in the same time frame. SimIntenseye and CraneIntenseye datasets in Fig. 18 to illustrate the
We also observe that the batch processing time remains constant for all performance of the proposed method. The actual projection points are
versions of OverProjNet except for the XL model, which experiences an indicated with red stars, while cyan circles represent the estimated pro-
increase when the batch size exceeds 256. jection points in all visual outputs. Additionally, we use green bounding
15
boxes to enclose the overhead objects in the images. More qualitative vi-
suals from the remaining sets of both datasets are provided in Appendix
Fig. 20. It is important to note that all printed samples displayed in
Appendix Fig. 20 are selected randomly without considering the detec-
tion performance. Additionally, Appendix Fig. 21 showcases the visual
outputs of frames with maximal error cases from CraneIntenseye. The
maximal error, the distance between the cyan circle and the red star,
for Set02 is relatively small, indicating that OverProjNet performs well
on this set. Although the maximal error observed for Set01 is relatively
higher than Set02, the prediction performance is considered sufficient
for constructing a warning system employing OverProjNet.
7. Conclusion
This study focuses on detecting the projections of overhead objects

from camera images, which poses a significant challenge due to the
need for a robust and generalized solution that can accommodate a wide
range of object sizes and shapes. Additionally, various factors, such as
camera and lens parameters, 3D placements, and rotations of both the
object and camera, make the problem even more challenging. We pro-
pose OverProjNet using regression principles and deep layers to address
this challenge. To validate the performance of OverProjNet, we provide
CraneIntenseye and OverheadSimIntenseye datasets. The CraneIntenseye Fig. 19. Distance calculations for the overhead object from the top view.
is collected using actual cameras from real facilities, while the Over-
headSimIntenseye is generated using image formation and ray tracing CRediT authorship contribution statement
techniques with the designed simulation tool to enhance data diversity
by covering various factors. The experimental results and comprehen- Poyraz Umut Hatipoglu: Conceptualization, Data curation, For-
sive analysis demonstrate that OverProjNet can reliably and efficiently mal analysis, Investigation, Methodology, Project administration, Re-
detect the projection points of overhead objects despite the challenging sources, Software, Supervision, Validation, Visualization, Writing –
conditions and disturbances. Therefore, it can be used with confidence original draft. Ali Ufuk Yaman: Conceptualization, Data curation, Re-
to warn people and prevent struck-by-falling accidents. sources, Writing – review & editing. Okan Ulusoy: Validation, Writing
– original draft, Writing – review & editing.
7.1. Limitations and future work
Declaration of competing interest
To detect the projection point of overhead objects, it is necessary
first to identify the objects. If an object is located outside the camera’s The authors declare that they have no known competing financial
field of view or completely blocked by other objects, it becomes im- interests or personal relationships that could have appeared to influence
possible to detect the objects and their corresponding projection points. the work reported in this paper.
Additionally, when considering the implementation of a warning solu-
tion using OverProjNet, it is important to note that individuals in close Data availability
proximity to the projection points must be within the camera’s view. If
people are blocked by other objects or positioned outside the camera’s Data will be made available on request.
field of view, they cannot be detected, and the warning system cannot
issue alerts.
Declaration of generative AI and AI-assisted technologies in the
We assessed the average processing time during the inference oper-
writing process
ation and evaluated the throughput performances on our experimental
setup. Since the computational performance is closely tied to the re-
During the preparation of this work the authors used ChatGPT based
sources available in the setup, future studies may consider measuring
on the GPT-3.5 architecture in order to improve readability and lan-
the computational performance on alternative devices such as embed-
guage. After using this tool/service, the authors reviewed and edited
ded devices and high-powered server GPUs. Furthermore, in our simula-
the content as needed and take full responsibility for the content of the
tion sets, we assumed that the ground level has no variations in altitude.
publication.
However, in future studies, it would be beneficial to relax this assump-
tion and incorporate the ability of the simulation tool to generate spatial
Acknowledgements
surfaces. This would allow for evaluating the detection algorithm’s ac-
curacy under different surface conditions and analyzing the impact of
varying surface behaviors. We wish to express our sincere gratitude to Matay Otomotiv Yan
Our work can also be extended to predict the projections of not only Sanayi A. Ş. and Segezha Ambalaj San. ve Tic. A.Ş. for their kind support
the center points but also the corner and/or edge points of overhead and guidance during the data collection. Moreover, we like to extend
objects. Furthermore, the detection performance of OverProjNet can be our appreciation to FS Plastik San. ve Tic. A.Ş. and Mr. Hüseyin Arslan
evaluated using cameras equipped with different lenses to assess the to help in tests and trials on equipment to be used in the data collection.
impact of other types of distortions on the model’s performance.
Another potential avenue for future research is the design of a novel Appendix A
loss function that considers the ground sampling distance of pixels.
Since the distance per pixel varies based on the camera’s perspective A.1. Distance calculations from top view
characteristics, incorporating this information into the loss function
could improve convergence. See Fig. 19.
16
Fig. 20. Qualitative results of different architectures of OverProjNet on various set samples.
17
Fig. 20. (continued)
18
Fig. 21. Qualitative results of OverProjNet on samples with maximal errors from CraneIntenseye.
19
A.2. Matrix representations of rotations Liu, Y., Liang, D., Huang, Q., & Gao, W. (2006). Extracting 3d information from broadcast
soccer video. Image and Vision Computing, 24, 1146–1162.
Loshchilov, I., & Hutter, F. (2018). Fixing weight decay regularization in Adam. Retrieved
Rotation about x (pitch), y (yaw), and z (roll) axes can be repre-
from https://openreview.net/forum?id=rk6qdGgCZ.
sented as demonstrated in Eq. (15), Eq. (16), and Eq. (17) in the matrix Ma, L., Chen, Y., & Moore, K. L. (2003). A family of simplified geometric distortion models
form. for camera calibration. arXiv preprint. Retrieved from arXiv:cs/0308003.
Mittal, P., Sharma, A., & Singh, R. (2022). A simulated dataset in aerial images using
⎡1 0 0 ⎤ simulink for object detection and recognition. International Journal of Cognitive Com-
ℝ𝑥 (𝛼) = ⎢ 0 cos 𝛼 − sin 𝛼 ⎥ (15) puting in Engineering, 3, 144–151.
⎢ ⎥
⎣ 0 sin 𝛼 cos 𝛼 ⎦ National Safety Council (2020). Work injury costs. Retrieved from https://injuryfacts.nsc.
org/work/costs/work-injury-costs/.
⎡ cos 𝛽 0 − sin 𝛽 ⎤ Neuhausen, M., Teizer, J., & König, M. (2018). Construction worker detection and track-
ℝ𝑦 (𝛽) = ⎢ 0 1 0 ⎥ (16) ing in bird’s-eye view camera images. In ISARC, proceedings of the international sym-
⎢ ⎥
⎣ sin 𝛽 0 cos 𝛽 ⎦ posium on automation and robotics in construction. IAARC publications (pp. 1–8).
Park, J., Chen, J., & Cho, Y. K. (2017). Self-corrective knowledge-based hybrid track-
⎡ cos 𝛽 − sin 𝛽 0⎤ ing system using BIM and multimodal sensors. Advanced Engineering Informatics, 32,
ℝ𝑧 (𝛾) = ⎢ sin 𝛽 cos 𝛽 0⎥ (17) 126–138.
⎢ ⎥
⎣ 0 0 1⎦ Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,
Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-
performance deep learning library. Advances in Neural Information Processing Systems,
A.3. OverProjNet output visuals 32.
Pedrazzini, F. (2018). 3D position estimation using deep learning. Ph.D. thesis. KTH Royal
See Figs. 20 and 21. Institute of Technology.
Price, L. C., Chen, J., Park, J., & Cho, Y. K. (2021). Multisensor-driven real-time crane
monitoring system for blind lift operations: Lessons learned from a case study. Au-
References tomation in Construction, 124, Article 103552.
Reddy, N. D., Tamburo, R., & Narasimhan, S. G. (2022). Walt: Watch and learn 2d amodal
Agarap, A. F. (2018). Deep learning using rectified linear units (ReLU). arXiv preprint. representation from time-lapse imagery. In Proceedings of the IEEE/CVF conference on
Retrieved from arXiv:1803.08375. computer vision and pattern recognition (pp. 9356–9366).
Biewald, L. (2020). Experiment tracking with weights and biases. Retrieved from wandb. Slabaugh, G. G. (1999). Computing Euler angles from a rotation matrix. Retrieved from
com. http://eecs.qmul.ac.uk/~gslabaugh/publications/euler.pdf.
Boikov, A., Payor, V., Savelev, R., & Kolesnikov, A. (2021). Synthetic data generation for Smith, L. N., & Topin, N. (2019). Super-convergence: Very fast training of neural networks
steel defect detection and classification using deep learning. Symmetry, 13, 1176. using large learning rates. In Artificial intelligence and machine learning for multi-domain
Bureau of Labor Statistics (2008). Crane-related occupational fatalities. Retrieved from operations applications (pp. 369–386). SPIE.
https://www.bls.gov/iif/factsheets/archive/crane-related-occupational-fatalities- Su, W., Zhu, X., Tao, C., Lu, L., Li, B., Huang, G., Qiao, Y., Wang, X., Zhou, J., & Dai, J.
2006.pdf. (2022). Towards all-in-one pre-training via maximizing multi-modal mutual informa-
Bureau of Labor Statistics (2019). Fatal occupational injuries involving cranes. On- tion. arXiv preprint. Retrieved from arXiv:2211.09807.
line. Retrieved from https://www.bls.gov/iif/factsheets/fatal-occupational-injuries- Sue, M. K. (1981). Radio frequency interference at the geostationary orbit. Final report.
cranes-2011-17.htm. Pasadena: Jet Propulsion Lab., California Inst. of Tech.
Chen, H. T., Tien, M. C., Chen, Y. W., Tsai, W. J., & Lee, S. Y. (2009). Physics-based ball Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022). YOLOv7: Trainable bag-of-freebies
tracking and 3d trajectory reconstruction with applications to shooting location esti- sets new state-of-the-art for real-time object detectors. arXiv preprint. Retrieved from
mation in basketball video. Journal of Visual Communication and Image Representation, arXiv:2207.02696.
20, 204–216. Wang, J., Shi, F., Zhang, J., & Liu, Y. (2008). A new calibration model of camera lens
Cheng, T., & Teizer, J. (2014). Modeling tower crane operator visibility to minimize the distortion. Pattern Recognition, 41, 607–615.
risk of limited situational awareness. Journal of Computing in Civil Engineering, 28, Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K.,
Article 04014004. Singhal, S., Som, S., et al. (2022). Image as a foreign language: Beit pretraining for all
Dewancker, I., McCourt, M., & Clark, S. (2016). Bayesian optimization for machine learn- vision and vision-language tasks. arXiv preprint. Retrieved from arXiv:2208.10442.
ing: A practical guidebook. arXiv preprint. Retrieved from arXiv:1612.04858. Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.
Fang, Y., Cho, Y. K., & Chen, J. (2016). A framework for real-time pro-active safety assis- (2022). Internimage: Exploring large-scale vision foundation models with deformable
tance for mobile crane lifting operations. Automation in Construction, 72, 367–379. convolutions. arXiv preprint. Retrieved from arXiv:2211.05778.
Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., & Cao, Y. Wei, Y., Hu, H., Xie, Z., Zhang, Z., Cao, Y., Bao, J., Chen, D., & Guo, B. (2022). Contrastive
(2022). EVA: Exploring the limits of masked visual representation learning at scale. learning rivals masked image modeling in fine-tuning via feature distillation. arXiv
arXiv preprint. Retrieved from arXiv:2211.07636. preprint. Retrieved from arXiv:2205.14141.
Gählert, N., Hanselmann, N., Franke, U., & Denzler, J. (2020). Visibility guided NMS: Wong, M. Z., Kunii, K., Baylis, M., Ong, W. H., Kroupa, P., & Koller, S. (2019). Synthetic
Efficient boosting of amodal object detection in crowded traffic scenes. arXiv preprint. dataset generation for object-to-model deep learning in industrial applications. PeerJ
Retrieved from, arXiv:2006.08547. Computer Science, 5, Article e222.
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge Wu, J., Ma, L., & Hu, X. (2016). Predicting world coordinates of pixels in rgb images using
University Press. convolutional neural network for camera relocalization. In 2016 seventh international
Hwang, S. (2012). Ultra-wide band technology experiments for real-time prevention of conference on intelligent control and information processing (pp. 161–166). IEEE.
tower crane collisions. Automation in Construction, 22, 545–553. Xu, S., Wang, X., Lv, W., Chang, Q., Cui, C., Deng, K., Wang, G., Dang, Q., Wei, S., Du,
Ke, L., Tai, Y. W., & Tang, C. K. (2021). Deep occlusion-aware instance segmentation with Y., et al. (2022). PP-YOLOE: An evolved version of YOLO. arXiv preprint. Retrieved
overlapping bilayers. In Proceedings of the IEEE/CVF conference on computer vision and from arXiv:2203.16250.
pattern recognition (pp. 4019–4028). Yang, J., Huang, M., Chien, W., & Tsai, M. (2015). Application of machine vision to
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et collision avoidance control of the overhead crane. In 2015 international conference on
al. (2022). YOLOv6: A single-stage object detection framework for industrial applica- electrical, automation and mechanical engineering (pp. 361–364). Atlantis Press.
tions. arXiv preprint. Retrieved from arXiv:2209.02976. Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping
Li, H., Chan, G., & Skitmore, M. (2013). Integrating real time positioning systems to extreme and center points. In Proceedings of the IEEE/CVF conference on computer vision
improve blind lifting and loading crane operations. Construction Management and Eco- and pattern recognition (pp. 850–859).
nomics, 31, 596–605.
20

1 s2.0 S2667305323000947 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S2667305323000947 Main

Uploaded by

Copyright:

Available Formats

Intelligent Systems with Applications 20 (2023) 200269

Contents lists available at ScienceDirect

Intelligent Systems with Applications

Overhead object projector: OverProjNet ✩

2. A projection relationship must be established to convert the rep-

Overcoming these challenges and automatically alerting the work-

3.1.1. Projection matrix and ray tracing

𝑿(𝜆) = ℙ+ 𝒙 + 𝜆𝑪, (6)

3.1.2. Radial distortion

𝑓 (𝑟𝑖 , 𝒌) = 1 + 𝑘1 ∗ 𝑟2𝑖 + 𝑘2 ∗ 𝑟4𝑖 + ... (7)

where 𝑢𝑖 = 𝑝𝑐,𝑥 + (𝑝𝑖,𝑥 − 𝑝𝑐,𝑥 )𝑓 (𝑟𝑖 , 𝒌), (9)

𝑣𝑖 = 𝑝𝑐,𝑦 + (𝑝𝑖,𝑦 − 𝑝𝑐,𝑦 )𝑓 (𝑟𝑖 , 𝒌).

3.2. Simulation tool

the preprocessing steps are identical to those employed in the training

Fig. 11. Diagrams of the proposed architectures for OverProjNet.

• Targets: The projection coordinates of object centers: Table 2

6.3.2. Experiment 3: the eﬀect of coordinate transformation

Fig. 16. Overall eﬀect of coordinate transformation on accuracy values for

Fig. 17. Overall eﬀect of coordinate transformation on accuracy values for

acc ME MXE acc ME MXE XS S M L XL

acc ME MXE acc ME MXE XS S M L XL

This study focuses on detecting the projections of overhead objects

Fig. 20. (continued)

You might also like