Qwad 042

Journal of Computational Design and Engineering, 2023, 10,
1158–1175
DOI: 10.1093/jcde/qwad042
Advance access publication date: 23 May 2023
Research Article
An improved YOLOX approach for low-light and small

object detection: PPE on tunnel construction sites
1,2
Zijian Wang , Zixiang Cai3 and Yimin Wu1 , *
1
Department of Civil Engineering, Central South University, Changsha 410075, China
2
Faculty of Civil and Environmental Engineering, Israel Institute of Technology, Haifa 3200003, Israel
3
Alibaba Cloud Intelligence, Alibaba Group, Hangzhou 311100, China
∗
Corresponding author. E-mail: wuyimin531@csu.edu.cn
Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Abstract
Tunnel construction sites pose a significant safety risk to workers due to the low-light conditions that can affect visibility and lead to
accidents. Therefore, identifying personal protective equipment (PPE) is critical to prevent injuries and fatalities. A few researches have
addressed the challenges posed by tunnel construction sites whose light conditions are lower and images are captured from a distance.
In this study, we proposed an improved YOLOX approach and a new dataset for detecting low-light and small PPE. We modified the
YOLOX architecture by adding ConvNeXt modules to the backbone for deep feature extraction and introducing the fourth YOLOX head
for enhancing multiscale prediction. Additionally, we adopted the CLAHE algorithm for augmenting low-light images after comparing
it with eight other methods. Consequently, the improved YOLOX approach achieves a mean average precision of 86.94%, which is
4.23% higher than the original model and outperforms selected state-of-the-art. It also improves the average precision of small object
classes by 7.17% on average and attains a real-time processing speed of 22 FPS (Frames Per Second). Furthermore, we constructed a
novel dataset with 8285 low-light instances and 6814 small ones. The improved YOLOX approach offers accurate and efficient detection
performance, which can reduce safety incidents on tunnel construction sites.
Keywords: computer vision, machine learning, deep learning, tunnel management, construction safety
1. Introduction have demonstrated that wearing helmets effectively reduces the

likelihood of skull fractures, neck sprains, and concussions (Hume
The construction industry is labor-intensive, characterized by
& Mills, 1995), including a 95% decrease in head injuries (Sud-
complex working environments and frequent safety accidents.
erman et al., 2014). In many countries, properly worn PPE is re-
According to statistics from the Ministry of Housing and Urban-
quired in construction environments and serves as a crucial eval-
Rural Development of China, in 2019, there were 904 construction-
uation criterion for construction projects. With the installation of
related fatalities, with 54% resulting from falls and 16% caused by
cameras on construction sites and advancements in computer vi-
collisions. Furthermore, the trend of safety accidents continues
sion, efficient and accurate PPE detection based on deep learn-
to rise each year (Ministry of Housing and Urban-Rural Develop-
ing techniques offers a new approach to improve construction
ment, 2020). Similarly, the UK Health and Safety Executive (HSE)
safety. By capturing images through cameras, researchers utilize
reported 147 fatal injuries to workers in 2018–2019, primarily due
pre-trained deep learning models to detect PPE in real time and
to falls from heights without adequate protective equipment (HSE,
notify workers or managers to take appropriate action.
2017).
Tunnel constructions are underground with limited natural
Compared to outdoor construction sites, tunnel environments
light, and surveillance cameras are typically positioned far from
present more complex lighting conditions, with little to no natu-
work areas. Consequently, many low-light and small PPE objects
ral light (Fig. 1). This can affect the vision and consequently lead
appear in the collected images. This study presents an improved
to accidents. Visibility is further compromised by heavy dust gen-
YOLOX approach to address the critical low-light and small ob-
erated during tunnel excavation. Additionally, the situation wors-
ject detection challenges on tunnel construction sites in Section 3.
ens when large construction equipment, such as excavators and
Specifically, to tackle the small object issue, we modify the original
transport trucks, operate in the narrow tunnel space, obstructing
YOLOX architecture by adding neural network modules for fea-
workers’ view and making it harder to identify safety hazards. As a
ture extraction and small object prediction. To enhance detection
result, tunnel construction sites are more likely to face safety ac-
performance in low-light environments, we select a data augmen-
cidents, resulting in injuries and even deaths, along with project
tation algorithm after reviewing a series of traditional computer
delays.
vision and deep learning augmentation methods. Furthermore,
The majority of injuries and fatalities can be prevented by prop-
we construct a novel dataset by collecting real tunnel construc-
erly wearing personal protective equipment (PPE), such as hel-
tion images and define the criteria for multiscale and different
mets, safety glasses, and gloves (OSHA, 2005). Previous studies
Received: December 27, 2022. Revised: May 7, 2023. Accepted: May 11, 2023
© The Author(s) 2023. Published by Oxford University Press on behalf of the Society for Computational Design and Engineering. This is an Open Access article
distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com
Journal of Computational Design and Engineering, 2023, 10(3), 1158–1175 | 1159
Figure 1: Examples of (a) tunnel and (b) outdoor construction sites.
light condition objects in Section 4. And experiments and results of 25 construction sites, and a faster R-CNN model was trained

are listed in Section 5. Discussion and conclusions are followed in on the collected images to achieve a precision of 95.7%. How-
Section 6 and 7. ever, the dataset is not publicly available. Another research used
two faster R-CNN models to predict safety issues in which one
2. Related Works model detected the presence of workers, while the other predicted
helmets and vests (Akbarzadeh et al., 2020). Additionally, a sin-
Traditional approaches for detecting PPE usually adopt sensors gle shot multibox detector (SSD) was trained to predict the col-
such as global position systems (GPSs; Zhang et al., 2015) and ra- ors of helmets (Wu et al., 2019). Nath et al. (2020) compared differ-
dio frequency identification (Kelm et al., 2013; Zhang et al., 2019). ent labeling approaches where the first approach treated workers
Sensor-based PPE detection is not affected by many external fac- and PPE classes separately, while the second regarded a worker
tors, such as weather, but it usually requires users to wear addi- with PPE as a class (e.g., a worker with a helmet, a worker with a
tional equipment for data communication (e.g., sending GPS sig- vest). The experiments found that the second approach had bet-
nals), and long-term maintenance is needed. Another direction is ter accuracy, but its scalability was limited when implemented in
the vision-based approach, which uses cameras to capture images other scenarios. Wang et al. (2021) designed experiments to com-
on construction sites and applies image-processing techniques pare the performance of three YOLO models – YOLOv3, YOLOv4,
to detect PPE objects. Many studies use traditional computer vi- and YOLOv5 – for detecting four helmet colors, vests, and work-
sion algorithms. For example, Rubaiyat et al. (2016) adopted Cir- ers. They found that YOLOv5x had the highest accuracy, while
cle Hough Transform algorithm to detect helmets, and Shrestha YOLOv5s had the fastest speed. Additionally, Ferdous and Ahsan
et al. (2015) used edge detection for the head, face, and helmet. (2022) compared the latest YOLOX anchor-free model with pre-
Instead of recognizing shapes, Du et al. (2011) presented a color- vious YOLO models and found that the mean average precision
based system that sets color thresholds for different PPE classes. (mAP) increased by three percentage points to 90% when com-
Rather than using computer vision algorithms, many deep learn- pared with Wang et al. (2021) on the same datasets. Another re-
ing projects are described in Section 2.1. These experiments typi- search study compared the performances of different deep learn-
cally focus on the outdoor environment without considering light ing models for detecting PPE using the same datasets. Here, the
conditions, like underground tunnels. Therefore, Section 2.2 re- YOLO series models reported the best mAP (53.8%) and the fastest
views the techniques for enhancing low-light images. Finally, re- speed (10 FPS) when compared to SSD and faster R-CNN (Wu et al.,
search gaps and objectives are listed in Section 2.3. 2019).
These previous mentioned studies validate the feasibility of ap-
2.1. Deep learning for PPE detection plying deep learning for PPE detection but researchers soon re-
Deep learning techniques for detecting PPE have been imple- alized that these original models cannot meet various require-
mented in various industries and scenarios. For instance, Choo ments from the specific situations of construction sites. There-
et al. (2023) developed an automated system to identify electronic fore, customized models and techniques are developed for tack-
workers at heights to prevent falls from heights, while Ke et al. ling the problems of reducing model size, speeding up processing
(2022) and Karlsson et al. (2022) proposed lightweight PPE algo- speed, etc. To increase efficiency, Fang et al. (2018c) presented an
rithms for detecting PPE in industrial environments. Similarly, improved faster R-CNN, which shortened the processing time for
these techniques have also been applied to university laboratories each image to 0.101 s. In similar research, Xiong & Tang (2021)
to enhance student safety (Ali et al., 2022). There are also numer- proposed a pose-guided anchoring framework to detect multiclass
ous studies on using deep learning to detect PPE on construction PPE in working spaces. They adopted an OpenPose model to pre-
sites, which is a common scenario where accidents can easily hap- dict the skeleton and inputted its results into the part attention
pen, and successful implementation could replace manual moni- regions for PPE classification, showing high accuracy and speed
toring and improve construction site safety (Lee et al., 2023; Wang in small images. Another direction is to reduce the size and pa-
et al., 2021). rameters of models. For instance, Chen et al. (2023a) improved the
The early-stage research directly applies deep learning models YOLOv4 algorithm by using a lightweight network called PP-LCNet
to construction sites without network architecture improvements. as the backbone network and applying deepwise separable con-
Faster R-CNN, for example, was a popular model for detecting hel- volution to reduce the model parameters. Similarly, Chen et al.
mets (Akbarzadeh et al., 2020; Fang et al., 2018a; Saudi et al., 2020). (2023b) introduced a lightweight architecture based on YOLOv5s
In a study by Fang et al. (2018a), more than 100 000 images of con- with Bi-FPN network that has fewer parameters and a smaller
struction workers were selected from far-field surveillance videos weight size, while still maintaining similar performance. To add
1160 | Improved YOLOX for low-light and small PPE detection
more functionality to PPE detection, researchers have explored while deep learning techniques are data-driven and use historical
the use of spatial relationships to determine if workers are prop- data to find latent patterns for predicting unknown samples.
erly wearing their PPE. For example, Tang et al. (2020) developed a Histogram equalization (HE) algorithms are commonly used
novel human–object interaction recognition method that uses po- computer vision techniques that adjust the output gray levels to
tential worker PPE box pairs to determine compliance. Meanwhile, spread out the most frequent intensity values and increase the
Chen and Demachi (2020) used the Euclidean distance between image contrast, making darker details in the image more visible
the bounding boxes of hard hats and the neck to identify the exis- (Acharya & Ray, 2005; Lee et al., 2015). However, this approach uses
tence of helmets on workers. Cheng et al. (2022) designed a mech- global HE, which can result in some local areas being too dark or
anism for monitoring worker PPE statuses across multiple cam- too bright. To address this limitation, researchers developed adap-
eras with consistent identities. Lastly, Lee et al. (2023) presented a tive histogram equalization (AHE; Kim et al., 1998), which uses lo-
post-processing algorithm that classifies the correlation between cal contrast to improve the technique. However, this approach can
workers and PPE into four statuses to help prevent potential acci- create noise problems. To overcome this issue, contrast-limited
dents. Another approach is to combine it with trade recognition AHE (CLAHE) was developed to avoid excessive enhancement of
output to check if workers meet certification constraints. Fang image contrast that AHE could produce by limiting local contrast
et al. (2018b) demonstrated this by merging PPE detection results enhancement (Reza, 2004).

with trade recognition output. These techniques and models have The retinex theory mimics the way human eyes perceive color
the potential to improve safety in various work environments by and lightness (Land & McCann, 1971). The visual system pro-
ensuring workers are properly equipped with PPE. cesses information during the transmission of visual informa-
It is worth noting that despite the availability of open datasets tion, and retinex algorithms aim to reveal essential character-
for construction sites, none of them are specifically collected istics of objects by accounting for the intensity and uneven-
from tunnels. For instance, the SHWD dataset includes over 7000 ness of light (Wang et al., 2020). These algorithms include single-
images, with 3200 of them collected from outdoor construction scale retinex (Jobson et al., 1997), retinex algorithm with color
sites that have sufficient lighting njvisionpower (2019). Other restoration (MSRCR; Rahman et al., 1997), and MSRCP (Petro et
datasets, such as GDUT (njvisionpower, 2019), pictor (Wu et al., al., 2014). Another approach to low-light image enhancement
2019), and CHV (Wang et al., 2021), are also collected from outdoor is the use of Dehaze algorithms (He et al., 2010; Yang & Sun,
construction sites where instances are under natural lighting con- 2018). These algorithms work by removing the effects of haze,
ditions. Developing low-light datasets will not only facilitate the and researchers found that low-light image enhancement is sim-
training and evaluation of PPE detection models for tunnel work- ilar to the haze removal process because inverted low-light im-
ers, but it will also help to improve the safety and well-being of ages can resemble images in hazy lighting conditions (Dong et
workers by providing better tools for monitoring PPE compliance al., 2011). As a result, algorithms developed for hazy conditions
in tunnels. have been adapted and applied to low-light images with promising
results.
Machine learning algorithms are gaining attention for low-light
2.2. Low-light image enhancement image enhancement. For instance, zero-reference deep curve es-
Images captured under poor illumination often exhibit character- timation (Zero-DCE) is a lightweight deep network that estimates
istics such as low brightness, low contrast, a narrow gray range, pixel-wise and high-order curves to adjust the dynamic range of
color distortion, and noise (Kim et al., 2014; Wang et al., 2020). Es- an image. Zero-DCE is versatile and trains and predicts quickly on
pecially for underground construction sites, their images are cap- diverse lighting conditions (Guo et al., 2020). A SNR-aware frame-
tured without natural light and with limited illumination spots, work employs short- and long-range operations to enhance im-
resulting in low-light and uneven illumination. These low-light ages dynamically in pixel-level (Xu et al., 2022b). Moreover, Enlight-
images affect human understanding and lower the performance enGAN is an unsupervised generative adversarial network that
of image processing (Wang et al., 2020). Thus, a group of algorithms trains without paired low- and normal-light image sets. It uses
has been developed to improve image quality for better visual- extracted input information to regularize unpaired training and
ization and the performance of subsequent processing. Known as performs well in terms of visual quality and generalization on dif-
low-light data enhancement algorithms, they can be divided into ferent real-world images (Jiang et al., 2021). Similarly, an unsuper-
two categories: computer vision and deep learning. The computer vised decomposition and correction network that does not depend
vision algorithms are model-based, while the deep learning ones on paired data for training was proposed by Jiang et al. (2022). The
are data-driven techniques that use historical data to find latent network has three sub-networks that handle image decomposi-
patterns for predicting unknown samples. tion, illumination correction, and noise removal, respectively. In
Images captured under poor illumination often exhibit certain addition, a self-calibrated illumination learning framework was
characteristics such as low brightness, low contrast, a narrow gray presented to balance efficiency and accuracy for low-light data
range, color distortion, and noise (Kim et al., 2014; Wang et al., enhancement while considering flexibility and robustness (Ma et
2020). This is especially problematic for underground construc- al., 2022).
tion sites, where images are captured without natural light and Although existing algorithms and methodologies have been
with limited illumination spots, resulting in low-light and uneven analyzed and compared for their performance on benchmark
illumination. These low-light images can negatively impact hu- datasets by Kim (2022), techniques for enhancing low-light images
man understanding and the performance of image processing al- in construction environments like tunnels are rare. One study
gorithms (Wang et al., 2020). To address this issue, a group of al- adopted traditional computer vision algorithms to enhance tun-
gorithms has been developed to improve image quality for bet- nel low-light images for the detection of cracks (Biswas et al., 2021)
ter visualization and the performance of subsequent processing. but not PPE. Another study observed the low-light situation on
These algorithms are known as low-light data enhancement al- outdoor construction sites at night, but no further technique was
gorithms and can be divided into two categories: computer vision applied to ease the illumination effects (Fang et al., 2018a). Addi-
and deep learning. Computer vision algorithms are model-based, tionally, Zeng et al. (2021) applied CLAHE for data enhancement

Figure 2: Improved YOLOX approach. The CLAHE algorithm is used for enhancing low-light images. The YOLOX architecture is improved with extra
ConvNeXt modules and a fourth YOLOX head for multiscale PPE object detection.
for detecting outdoor equipment, but it lacked comprehensive tiscale PPE instances on tunnel construction sites, (ii) construct
experiment analysis to explain the selection of this augmenta- a novel dataset for tunnel low-light multiscale instance scenario,
tion algorithm. Moreover, the study’s outdoor environment sce- and (iii) comprehensively compare low-light algorithms and select
nario differed from the underground tunnel construction site. a suitable one for enhancing tunnel construction images.
2.3. Research gaps and objectives

Previous studies on using deep learning for PPE detection have
3. Improved YOLOX Approach
shown great performance on outdoor constructions. However, PPE detection in underground construction sites faces challenges
there is still a lack of developing suitable models and tools that due to low-light conditions and multiscale PPE instances. In this
can be applied to tunnel construction sites, where lighting con- study, we adopt a data enhancement algorithm to improve low-
ditions are different from outdoor construction sites, and there light images. We conduct a series of experiments to analyze
are many small-sized instances captured in the images. Specif- different augmentation algorithms from both computer vision
ically, there are three gaps that need to be addressed: (i) there and deep learning approaches in Section 5.1, ultimately select-
are few studies that address the difficulty of detecting low-light ing CLAHE for enhancing tunnel images. Additionally, to im-
and small PPE objects on underground construction sites; (ii) there prove small instance detection performance, we modify the orig-
are no suitable datasets available for low-light multiscale PPE ob- inal YOLOX architecture by adding a ConvNeXt block into the
jects; and (iii) there is a lack of a comprehensive and quantitative backbone and the fourth YOLO prediction head (Section 3.1). In
comparison of low-light data enhancement algorithms under tun- summary, the improved YOLOX approach first uses CLAHE to en-
nel construction conditions. hance low-light images and then applies the modified YOLOX ar-
Therefore, the objectives of this research are to (i) propose a chitecture to predict multiscale instances, including person, hel-
deep learning pipeline designed for detecting low-light and mul- met, and vest classes, as depicted in Fig. 2.
3.1. Modified YOLOX architecture to explore the architectural differences between ConvNets and
3.1.1. Overview of YOLOX ViTs and to test the limits of a pure ConvNet. It incorporates in-
verted bottlenecks and large convolution kernels (Sandler et al.,
As an early anchor-free detector, YOLOv1 employs fully connected
2018), replaces ReLU with GELU (Glorot et al., 2011; Hendrycks &
neural network layers to directly predict bounding boxes on top
Gimpel, 2016), minimizes the use of activation functions and nor-
of the feature extractor. It boasts fast speeds but has a consid-
malization layers, and substitutes BN (Ioffe & Szegedy, 2015) with
erable number of localization errors and a relatively low recall
LN (Ba et al., 2016). Additionally, it employs a grouped convolution
rate (Redmon et al., 2016; Tan et al., 2021b). To address these is-
for the 3 – 3 convolutional layer in a bottleneck block to reduce
sues, anchor mechanisms from models like faster R-CNN were
the FLOPs. Despite being lighter, the network expands its width to
introduced to YOLOv2 (Redmon & Farhadi, 2017). Anchors are pre-
achieve higher capacity, and experiments have demonstrated that
defined boxes with common object shapes that guide the predic-
ConvNeXt outperforms ConvNet in terms of performance.
tion process. Anchor-based models learn to adjust anchor shapes
Moreover, researchers have improved their models by incorpo-
to fit predicted objects. Experiments in YOLOv2 demonstrated
rating ConvNeXt into the backbone to enhance feature extraction.
that introducing the anchor mechanism improved the mAP by
Such designs have been implemented in various fields, including
4.8% (table 2 in Redmon & Farhadi 2017). However, using anchors
medicine (Hassanien et al., 2022; Li et al., 2022), for tasks such as
necessitates determining a set of optimal parameters, which are

image segmentation (Karacan & Yücebas, 2022), classification, ob-
less generalized and increase detection head complexity (Ge et al.,
ject detection (Zhang et al., 2022a), and visual tracking (Xu et al.,
2021).
2022a). In summary, ConvNeXt’s architecture has proven efficient
As a new member of the YOLO family, YOLOX adopts an anchor-
in feature extraction, but it has not yet been applied to PPE detec-
free approach to predict bounding boxes directly from the head
tion.
(Ge et al., 2021). YOLOX reduces the predictions for each location
In the original YOLOX network, the DarkNet is used as the
from three to one and predicts position values directly. In com-
backbone. To enhance visual feature extraction, we opt for the
parison to previous anchor-based YOLO models, this anchor-free
lightweight and efficient ConvNeXt as an addition to the original
approach reduces the number of required parameters for anchors
network. In the improved YOLOX, we incorporate three ConvNeXt
that need heuristic tuning and training tricks. As a result, the
blocks immediately after the basic backbone to boost the model’s
training and decoding phases are significantly simpler (Ge et al.,
feature extraction capacity. Each ConvNeXt block replaces ReLU
2021). Experiments showed that the performance of average preci-
with GELU and employs LN instead of BN. The complete archi-
sion (AP) increased by 0.9% with the new anchor-free mechanism
tecture is depicted in Fig. 2. Although the addition of ConvNeXt
(table 2 in Ge et al. 2021). Consequently, detection head complexity
may slightly increase storage requirements and inference time, it
in YOLOX is substantially reduced, and the model is more gener-
can contribute to more effective feature extraction, which, in turn,
alized with better performance.
aids multiscale object prediction.
Although both YOLOv1 and YOLOX are anchor-free approaches,
YOLOX outperforms YOLOv1 after adopting many other practical
training techniques. Firstly, YOLOX adopts feature pyramid net- 3.1.3. Fourth YOLOX head for small object prediction
works (Lin et al., 2017), which extract features from input images With the decoupled head, YOLOX improves the capacity for pre-
at different scales, similar to a pyramid (Lin et al., 2017). The multi- dicting bounding boxes and classes separately with better perfor-
scale feature extraction guides the anchor-free approaches to pre- mance. However, it does not directly address the small object de-
dict instances of varying sizes. Another technique, focal loss, ad- tection problem, which is crucial for improving PPE detection in
dresses the training imbalance of foreground-background classes tunnel construction sites. In the current YOLOX network, down-
in one-stage detectors by reshaping the standard cross entropy sampling layers used for feature extraction may inadvertently
loss such that it down-weights the loss assigned to well-classified eliminate small object features. For example, an 8 × 8 pixel object
examples (Lin et al., 2017c). becomes a single point in the feature map after down-sampling
Another enhancement in YOLOX is the decoupled head. From three times, and even smaller objects simply vanish during the
YOLOv3 to YOLOv5, the detection head had only one branch to process.
classify the class and regress the locations simultaneously; lead- To enhance our model’s performance for small objects, we in-
ing to conflicts and negatively impacting the training process troduce an extra prediction head connected to the CspLayer (160,
(Song et al., 2020). Moreover, analytical experiments demonstrated 160, 128), as illustrated in Fig. 2. This additional prediction head
that using a decoupled head, which separates a coupled head into results in a larger feature map and retains more small object fea-
different components, could improve convergence speed and ac- tures that might be overlooked in the previous network for sub-
curacy (Ge et al., 2021). Based on these findings, YOLOX was de- sequent processing. However, adding more prediction heads (e.g.,
signed with a decoupled head that includes a 1 – 1 convolutional a fifth one) to the backbone’s shallow part may increase the in-
layer to reduce the feature channel and two 3 – 3 convolutional put size, requiring more computational power. Moreover, it would
layers for classification and regression tasks, respectively. Addi- contain little semantic information due to insufficient feature ex-
tionally, YOLOX incorporates Mosaic (Bochkovskiy et al., 2020) and traction. Thus, the addition of a fourth prediction head repre-
MixUp (Zhang et al., 2017) into its augmentation strategies. Other sents a balanced trade-off decision (Zeng et al., 2021; Zhu et al.,
techniques, such as multiple positives and SimOTA, were also 2021). The modified YOLOX architecture is presented in Fig. 3.
adopted to enhance performance.
3.2. Metrics for performance evaluation

3.1.2. ConvNeXt for deep feature extraction In order to evaluate the correctness of our model for general
Recently, vision transformers (ViTs) have outperformed tradi- classes, we selected the following metrics: precision, recall, F1, AP,
tional ConvNets as state-of-the-art image classification models and mAP. Precision measures the model’s ability to identify only
(Dosovitskiy et al., 2020), particularly when large models and relevant objects. It is the ratio of true positives (correctly identified
datasets are considered (Liu et al., 2022). ConvNeXt was proposed objects) to the sum of true positives and false positives (irrelevant

Figure 3: Modified YOLOX architecture.
objects identified as relevant): AP measures the overall correctness performance for a specific
class. It is calculated as the area under the precision-recall (PR)
TP curve and provides an average value of the model’s precision over
Precision = . (1)
TP + FP the entire range of recall levels. Higher AP values indicate better
performance.
Recall is the ratio of true positives to the sum of true positives
and false negatives: 1
AP = P(R )dR (4)
0
TP
Recall = . (2)
TP + FN where true positive TP is the number of correct predictions, false
positive FP is the number of wrong predictions, and false nega-
The F1 score takes into account both precision and recall, pro-
tive FN is the number of undetected but ground true instances.
viding a single metric that represents the model’s balanced ability
PR curves are plotted by calculating the precision and recall val-
for prediction:
ues of the accumulated TP or FP.
mAP is the average of the AP values for all classes in the dataset.
Precision × Recall
F1 = 2 × . (3) It gives an overall indication of the model’s performance across
Precision + Recall
multiple object classes. A higher mAP value indicates better de-

tection performance across all classes.

AP(class )
mAP = (5)
n
where AP(class) is the AP when considering classes (e.g., AP for
helmet) and n is the number of classes.
In this study, the effects of the network adjustment and data
enhancement algorithms are necessary to analyze quantitatively,
helping us understand the performance of the selected strate-
gies. However, the standard performance evaluation criteria usu-
ally focus on general classes, and they cannot be used directly
to evaluate model performance in different light conditions and
object sizes. Therefore, we adjust the mAP calculation equation

as follows:
Figure 4: Diagrams of different size objects.
AP(class, light )
mAPlight = (6)
n
Table 1: Statistics of the sizes of PPE instances.
where AP(class, light) is the AP for a specific class under a specific
lighting condition, such as the low-light helmet class, and mAPlight
Dataset
is the mAP that considers both classes and lighting conditions. Class Size
Train Validation Test Total

AP(class, size ) Helmet Small 3503 353 424 4280
mAPsize = (7)
n Medium 1459 172 168 1799
Large 406 49 53 508
where AP(class, size) is the AP for a specific class under a specific
size, like a small helmet class. mAPsize is the mAP that considers Person Small 1037 134 158 1329
both classes and instance sizes. Medium 2411 232 280 2923
For efficiency, we adopt FPS as shown in equation (8): Large 2705 333 315 3353
Vest Small 962 88 155 1205

Nimage Medium 979 99 111 1189
F PS = (8)
Ttotal Large 672 86 77 835
where Nimage is the number of images and Ttotal is the processing

time by the model with the unit of seconds.
4.1. Multiscale definition and statistics
In the computer vision domain, a common definition of instance
4. PPE Dark Dataset scales is derived from the COCO dataset (Lin et al., 2014). According
While the majority of public PPE datasets focus on outdoor con- to this definition, a large instance has an area larger than 962 pixel
struction sites, there are limited datasets that cover tunnel con- after resizing the image to 640 × 480 pixel, while a small instance’s
struction sites. A novel dataset, PPE dark, was constructed by col- area is less than 322 pixel. Consequently, a medium instance’s
lecting real tunnel construction background images with multi- area falls between 322 pixel and 962 pixel. Although the COCO
scale PPE instances. PPE dark contains 2371 images, of which 1041 dataset provides a generally accepted definition, the image size in
are from tunnel construction sites (Fig. 1a) captured by mobile our PPE dark dataset is larger, at 640 × 640 pixel. Therefore, we
devices, and 1330 regular outdoor images (Fig. 1b) token from the cannot directly apply the scale threshold values from the COCO
CHV dataset (Wang et al., 2021). The PPE dark dataset contains not dataset.
only dark images but also those taken in normal lighting condi- The key to dividing sizes is the area percentage of the instance
tions. This variety aligns with our objective of developing a model relative to the whole image. With this in mind, we recalculated the
that is applicable in both low- and normal-light conditions. scale threshold values for the PPE dark dataset. Specifically, the
The PPE dark dataset includes three categories: helmet, vest, occupation percentage of a small instance in the COCO dataset
and person. The vest and helmet represent essential protec- is less than 322 ÷(640 × 480) = 0.003 3333. When applied to the
tive equipment for workers, while the person category can be PPE dark dataset, the threshold for small instances should be less
combined with vest and helmet detection results to determine than 0.003 3333 × (640 × 640) = 37 × 37 pixel. Similarly, a large
whether workers are wearing the necessary PPE. All collected im- instance in the PPE dark dataset should be larger than 1112 pixel
ages were annotated using the graphical image annotation soft- (962 ÷(640 × 480) × (640 × 640)). Ultimately, a medium instance
ware LabelImg (Tzutalin, 2015). To minimize manual errors, an falls between these two thresholds. Figure 4 displays examples of
additional researcher verified each labeled image after the ini- different scale instances in the PPE dark dataset.
tial annotation. The labeling process adhered to VOC2011 annota- Table 1 provides statistics for different sizes of instances in the
tion guidelines (VOC, 2016). A random 80% of the images were se- PPE dark dataset. For the helmet class, small instances account for
lected as the training dataset (1904 images), 10% as the validation 65%, constituting the majority. Only 8% of the helmets are consid-
dataset (238 images), and the remaining 10% as the test dataset ered large, with the remaining 27% categorized as medium. This
(238 images). The test dataset was exclusively used for evaluating demonstrates that most helmet instances are small, and the dis-
model performance after training. tribution is uneven. For the person class, both medium and large
Table 2: Statistics of PPE instances under different light

conditions.
Dataset
Class Light condition
Train Validation Test Total
Helmet Low 2463 253 333 3049

Normal 2905 321 312 3538
Person Low 2952 353 413 3718
Normal 3201 346 340 3887
Vest Low 1191 120 207 1518
Normal 1422 153 136 1711
Figure 5: Statistics of the PPE dark dataset.
instances comprise around 40% each, while only 17% of person

instances are small. Additionally, the vest class consists of 40%

ple are 6587 and 7605, respectively, while the number of vests is
small and 40% medium instances, with only a few large ones. In
3229, roughly half that of helmets or people. This result highlights
general, a total of 6814 small objects make up 40% of the multi-
that most workers wear helmets, but only some wear safety vests.
scale instances.
When considering lighting conditions and sizes together, most
large instances are collected from outdoors, and most small in-
4.2. Low-light instances stances come from low-light tunnel construction sites. In sum-
From a human perspective, there is a clear visual difference be- mary, the dataset includes 6814 small instances, representing
tween tunnel construction sites and outdoor normal-light con- 40% of the multiscale instances, and 8285 low-light instances, ac-
struction environments, as illustrated in Fig. 1. With the excep- counting for 48% of the light conditions. The statistical results em-
tion of entrances, tunnel construction sites receive minimal natu- phasize the difficulties in detecting small and low-light objects in
ral light, relying primarily on artificial illumination, which results tunnel construction sites.
in low-light conditions. Therefore, in the PPE dark dataset, all im-
ages collected from tunnel construction sites are classified as low-
light, while all images from outdoor construction environments 5. Experiment and Results
are considered normal-light. Table 5 reveals a 15% to 20% AP dif- Thirteen experiments were conducted to evaluate the improved
ference between low- and normal-light detection. This significant performance in detecting multiscale and low-light objects, as de-
gap further supports the validity of the rule for determining light tailed in Table 3. Experiments 2–9 tested low-light enhancement
conditions. algorithms with different mechanisms. Experiments 2, 3, and 4
Table 2 calculates the number of instances under different light utilized pure computer vision algorithms, while 5 and 6 were deep
conditions. For all three classes, the instances from the two light- learning-based. Experiments 7, 8, and 9 employed combinations of
ing conditions are nearly equal. In total, the PPE dark dataset com- computer vision and deep learning algorithms. Furthermore, Ex-
prises 8285 low-light instances captured from tunnels and 9136 periments 10–12 assessed the modifications to the YOLOX archi-
instances in normal conditions. Instances from different lighting tecture, including ConvNeXt, the fourth YOLOX head, and both.
conditions are evenly distributed in the PPE dark dataset, ensur- Experiment 1 served as a baseline, testing the original YOLOX
ing that the trained model is robust enough for application across model, and Experiment 13 showcased the performance when im-
a wide range of lighting environments. plementing all strategies.
Experiments were conducted on a cloud server equipped with
4.3. Statistics summary two RTX A4000 graphics cards (16 GB of memory per card) and
The statistics, taking into account classes, sizes, and lighting con- four AMD 7502 processors. PyTorch (version 1.11) with CUDA (ver-
ditions, are displayed in Fig. 5. The numbers of helmets and peo- sion 11.3) was selected as the training platform. As mentioned,
Table 3: Experiment plan.
No. Model Data augmentation Architecture modification Goals
1 YOLOX Baseline
2 YOLOX CLAHE Low-light enhancement
3 YOLOX Dehaze
4 YOLOX MSRCP
5 YOLOX Zero-DCE
6 YOLOX EnlightenGAN
7 YOLOX Dehaze + CLAHE
8 YOLOX Zero-DCE + CLAHE
9 YOLOX EnlightenGAN + CLAHE
10 YOLOX ConvNeXt Multiscale detection

11 YOLOX Fourth head
12 YOLOX ConvNeXt + Fourth head
13 YOLOX CLAHE ConvNeXt + Fourth head All strategies

Table 4: Hyperparameters for training experiments. and MSRCP), two deep learning algorithms (EnlightenGAN and
Zero-DCE), and three combinations (CLAHE with Dehaze, Zero-
Parameters Search values Freeze training Unfreeze training DCE, and EnlightenGAN, respectively) are selected for comparison.
The enhanced images are presented in Fig. 7.
Batch size 2, 4, 8, 16, 32 8 4
Learning rate 10x , x ∈ [1, 2, 3, 4, 5] 103 104
According to qualitative observations, the images after De-
haze were similar to the originals. Images from CLAHE were
brighter and showed more details than the originals, e.g., pipes in
the second column images. In addition, the images from MSRCP
80% of the data was used for training, 10% for validation, and 10% were more luminous when compared with the other two com-
for testing. The test dataset was solely employed for performance puter vision algorithms. For example, the ground in the fourth-
evaluation. Each training session consisted of a freezing train- column images was fully illuminated without any shadows, while
ing phase followed by an unfreezing phase. The freezing training their colors were distorted from subjective observations. Images
loaded a pre-trained weight file without changing parameters in from the two deep learning-based algorithms were much brighter
the backbone. During the unfreezing phase, all parameters across than the originals. However, the Zero-DCE images became blur-
the entire model could be adjusted according to the loss. Except rier and grayer. Images from EnlightenGAN were sharper than the

for training epochs, all experiments adopted the same hyperpa- others, with visible shadow edges. The last three combinations
rameters, with the search process outlined in Table 4. with CLAHE performed similarly to the single ones.
Loading a weight that has been trained on large public datasets Table 5 lists the quantitative correctness of the algorithms.
is also known as transfer learning, which can enable the model For a single algorithm, CLAHE had the highest mAP (84.38%),
to quickly obtain the proper training direction. During train- an additional 1.13% compared with the benchmark YOLOX
ing, the weight file with the lowest loss value from the valida- model, which also owned the highest AP of the low-light hel-
tion dataset was retained and used to test the model perfor- met (80.98%). The mAP dropped slightly by adopting MSPCR
mance. Experiments involving YOLOX architecture modifications (−0.22%) and EnlightenGAN (−0.34%). For the combinations,
had larger freezing and unfreezing epochs because the adjusted the EnlightenGAN and CLAHE yielded the best mAP (84.57%)
parts lacked pre-trained weights, necessitating a longer training and had the three highest APs. On the other hand, the
period to achieve the desired result. Consequently, the first nine AP for low-light classes was between 65% and 82%, while
experiments had 75 epochs each for the freezing and unfreez- the normal classes’ APs were much higher at around 90%.
ing phases, while Experiments 10–13 had 100 epochs for each Even after adopting the augmentation techniques, the gap be-
phase. tween different lighting conditions cannot be ignored. The re-
Figure 6 illustrates the freezing–unfreezing training process by sults illustrated that the applied augmentation can raise the
displaying the training loss, validation loss, and validation mAP. detection performance in low-light conditions, while room for fur-
In both experiments, the loss curves for the training and valida- ther improvements remains.
tion datasets decrease as the number of epochs increases. A no- Table 5 lists the quantitative correctness of the algorithms. For
ticeable fluctuation occurs when transitioning from the freezing a single algorithm, CLAHE had the highest mAP (84.38%), 1.13%
to the unfreezing phase, but the trend resumes a downward and higher compared with the benchmark YOLOX model, which also
smooth direction. The loss and mAP curves provide insights into had the highest AP of the low-light helmet (80.98%). The mAP
the training stages. dropped slightly by adopting MSRCP (−0.22%) and EnlightenGAN
(−0.34%). For the combinations, EnlightenGAN and CLAHE yielded
the best mAP (84.57%) and had the three highest APs. On the
5.1. Results of low-light enhancement
other hand, the AP for low-light classes was between 65% and
algorithms
82%, while normal light classes’ APs were much higher at around
In this section, the performance of low-light data enhancement
90%. Even after adopting the augmentation techniques, the gap
algorithms is compared both qualitatively and quantitatively.
between different lighting conditions cannot be ignored. The re-
Three traditional computer vision algorithms (CLAHE, Dehaze,
sults illustrated that the applied augmentation can raise the
Figure 6: Training loss curves.

Figure 7: Enhanced images by different algorithms and their combinations. The green ones are original images. Orange ones are computer vision
algorithms. Blue ones are deep learning algorithms. Purple ones are combinations.
Table 5: Correctness of low-light data enhancement algorithms.
AP
Category Algorithm mAPlight
Helmet, LL Helmet, NL Person, LL Person, NL Vest, LL Vest, NL
Benchmark None 83.42% 78.77% 92.35% 80.14% 92.58% 66.25% 90.43%
Computer vision CLAHE 84.38% 80.98% 94.90% 81.28% 92.18% 70.28% 86.68%
Dehaze 84.28% 77.98% 94.93% 81.28% 91.94% 69.95% 89.59%
MSRCP 83.20% 78.24% 94.29% 80.38% 91.65% 66.31% 88.31%
Deep learning Zero-DCE 83.33% 77.63% 93.13% 81.41% 92.52% 68.46% 86.81%
EnlightenGAN 83.08% 80.08% 94.33% 80.31% 90.90% 63.44% 89.40%
Combination Dehaze + CLAHE 83.83% 77.91% 94.57% 81.62% 93.34% 64.66% 90.88%
Zero-DCE + CLAHE 83.95% 79.55% 94.44% 81.78% 91.63% 67.15% 89.12%
EnlightenGAN + CLAHE 84.57% 79.20% 93.81% 81.96% 93.34% 67.19% 91.90%

Note. LL means low-light; NL means normal-light.
Table 6: Speed of low-light data enhancement algorithms. the second-best performance when compared with all combina-
tions, only 0.19 points less than that of the best performing com-
Algorithm Process time (s)a FPS Platform bination (EnlightenGAN+CLAHE). This difference can be ignored
especially given that CLAHE would be 11 times faster than the
CLAHE 7.21 33.0 CPU
Dehaze 1242.40 0.2 CPU combination with only CPU computation. Based on its high cor-
MSRCP 3948.33 0.1 CPU rectness and efficient processing speed, we selected the CLAHE
algorithm as the low-light data enhancement technique for this
Zero-DCE 2.10 113.5 GPU
study.
EnlightenGAN 73.27 3.2 GPU
Dehaze + CLAHE 1249.65 0.2 CPU

Zero-DCE + CLAHE 9.31 25.6 GPU + CPU
5.2. Results of architecture modification for
EnlightenGAN + CLAHE 80.53 3.0 GPU + CPU multiscale detection
The results of all explored architecture modifications are pre-
Note.
a
Processed on the test dataset, containing 238 images. sented in Table 7. Specifically, the No. 1 and No. 2 modifications
demonstrate that the ConvNeXt and the fourth head can improve
the mAPsize to 79.29% (+1.86%) and 78.43% (+1%), respectively. The
detection performance in low-light conditions, while room for fur- adoption of both strategies achieved the best mAPsize at 79.84%,
ther improvements remains. which is 2.41% higher than that of the original YOLOX model. This
The processing speed for each algorithm is presented in combination also exhibited the highest APs for medium helmets
Table 6. The three computer vision algorithms only used CPUs, (92.99%), small helmets (81.87%), and small vests (61.96%).
and the deep learning algorithms were placed on the GPU plat- Additionally, we explored three other strategies that did not
form. Among the CPU-based algorithms, CLAHE had the fastest contribute positively to the performance. The No. 4 modifica-
speed at 33 FPS, indicating that it could process 33 images per tion inserted the ConvNeXt module into the beginning of each
second. The other two computer vision algorithms were slower, YOLO head to extract more features. However, the results showed
less than 1 FPS. For the deep learning algorithms, Zero-DCE held a that this modification decreased the mAPsize to 76.00%, a de-
faster speed (113.5 FPS) because of its light architecture network cline of 1.43%. Moreover, inspired by Wu et al. (2021), we split the
(Guo et al., 2020). The combination of EnlightenGAN and CLAHE YOLO head into three branches for predicting the type of objects,
achieved an average speed of 3 FPS, which was 11 times slower locations, and certainties, respectively. The final performance de-
than adopting the single CLAHE. creased by 1.84%. Another modification, No. 6, replaced the entire
The single CLAHE had a more balanced performance. Compar- YOLOX backbone with ConvNeXt modules, resulting in an mAPsize
ing it to the other single algorithms, CLAHE had the best mAP and of 65.82%, a decrease of 11.61%.
Table 7: Correctness of architecture modifications.
AP
No. Modification mAPsize
Helmet, L Helmet, M Helmet, S Person, L Person, M Person, S Vest, L Vest, M Vest, S
0 YOLOX 77.43% 86.87% 91.04% 75.12% 89.57% 82.00% 53.94% 89.62% 81.27% 47.45%
1 +ConvNeXt 79.29% (+1.86%) 90.73% 91.93% 79.69% 90.25% 84.27% 56.67% 86.54% 83.59% 49.98%
2 +Fourth head 78.43% (+1.00%) 86.38% 91.18% 80.29% 88.41% 81.40% 54.36% 90.22% 77.92% 55.67%
3 +ConvNeXt + Fourth head 79.84% (+2.41%) 88.27% 92.99% 81.87% 88.83% 80.64% 54.20% 87.67% 82.09% 61.96%
4 +Add ConvNeXt to YOLO head 76.00% (−1.43%) 90.71% 89.96% 72.93% 91.52% 82.87% 50.69% 86.98% 78.84% 39.51%
5 +Split YOLO Head 75.59% (−1.84%) 82.49% 88.32% 75.45% 91.37% 82.65% 50.93% 85.63% 80.28% 43.17%
6 +Replace backbone as ConvNext 65.82% (−11.61%) 84.11% 88.09% 60.71% 87.43% 65.14% 31.86% 80.76% 62.68% 31.56%
Note. L: Large; M: Medium; S: Small.

By adopting the effective strategies (ConvNeXt and the fourth gest that a detector with an FPS greater than 13 can be considered
head), the performance of small object detection was also en- real-time (Redmon & Angelova, 2015). Other researchers propose
hanced. Its AP for small helmets increased by 6.75% to 81.87%, that FPS above 30 should be regarded as real-time detection (Red-
and its AP for small vests rose by 14.51%. The AP for small ob- mon et al., 2016; Tan et al., 2021a). FPS is also affected by hard-
jects increased by an average of 7.17%. On the other hand, the ware, and the value could theoretically improve with more pow-
modifications slightly improved the performance for medium erful equipment. Although the improved YOLOX model is slower
and large-sized instances. This also indicates that the origi- than the original, its processing speed still meets the general re-
nal YOLOX already possesses the ability to detect medium and quirements for real-time detection. Furthermore, the speed could
large instances, and the modifications adopted in this study can increase with more advanced hardware if necessary.
further improve the performance on small objects, providing Considering the trade-off between accuracy and speed, we be-
a more balanced and accurate capability for multiscale object lieve that prioritizing correctness is more crucial, especially when
detection. the model’s application is related to safety concerns. The higher
the model’s ability to predict objects accurately, the greater the
5.3. Results of the improved YOLOX approach chances of preventing worker injuries or even fatalities. Currently,
the improved YOLOX has the highest correctness and achieves the
The results of the previous experiments revealed two key find-

general real-time performance. And if faster speed is necessary,
ings. Firstly, CLAHE is a well-balanced algorithm for enhancing
more advanced hardware can be employed during implementa-
images in low-light conditions. Secondly, using ConvNeXt and a
tion to solve this problem. Therefore, we think the strategies of
fourth head can significantly enhance detection performance for
CLAHE data enhancement, ConvNeXt feature extraction, and the
multiscale objects. Based on these findings, we have developed an
fourth head for multiscale object prediction are practical and nec-
improved YOLOX approach by incorporating all these strategies.
essary.
Figure 8 showcases detection examples from the improved
Table 9 presents another performance comparison between the
YOLOX approach. We have selected six examples that highlight
original model and the improved YOLOX. As mentioned earlier,
various detection challenges, such as multiscale objects in sam-
precision is the fraction of relevant instances among all predic-
ple (a, c, and e), small objects in sample (b), and different gestures
tions, and recall is the fraction of relevant instances that were
in sample (c, e, and f). All of these images were captured from a
retrieved. F1 can evaluate the model’s robustness by considering
relatively far distance and under challenging lighting conditions.
both recall and precision. From Table 9, the improved YOLOX had
However, the results show that the improved YOLOX is able to de-
a better performance under various criteria. Therefore, the im-
tect PPE instances with multiple scales, low-light instances, differ-
proved YOLOX was more accurate and robust in predicting in-
ent gestures and locations.
stances than the original YOLOX model.
While the improved YOLOX approach is robust, there are cer-
tain scenarios where it may struggle to detect objects. To illustrate
this, we have included three error samples in Fig. 8. For exam- 5.4. Comparison with the state-of-the-art object
ple, in Fig. 8g, the worker’s helmet is not detected. This might be detection models
caused that the worker is positioned far away from the camera In order to evaluate the contributions of the improved YOLOX
and the helmet is relatively small compared to the rest of the im- model, we conducted a comprehensive comparison with state-
age. Moreover, due to poor lighting conditions, the yellow helmet of-the-art object detection models, including YOLOv3x, YOLOv4x,
appears to be grey like the background, making it difficult for the YOLOv7x, SSD, faster R-CNN, and the original YOLOX with an X
model to detect. Similarly, in Fig. 8i, a worker is bending on the size version. For the YOLOv3x and YOLOv4x models, the image
ground wearing a black shirt, which blends into the background, size of the dataset was adjusted to 416 × 416 and 640 × 640 pixels,
making it challenging for the model to detect. These examples respectively. For the faster R-CNN models, we replaced the back-
demonstrate that YOLOX may struggle in very complex scenarios bones with ResNet and VGG. The training process for the selected
where the objects are very small, or their color is similar to the comparison models consisted of two stages: a freezing stage with
background. 75 epochs and an unfreezing stage with 75 epochs. Due to the
Table 8 presents the quantitative results for each strategy. more complex architecture of the improved YOLOX, we increased
CLAHE improves overall performance by 0.87%, while ConvNeXt the training process to 100 freezing and 100 unfreezing epochs.
and the fourth head increases the mAP by 1.34% and 2.01%, re- The comparison results are displayed in Table 10. Overall, the
spectively. By combining these two modifications, the mAP rises enhanced YOLOX model achieved the highest individual AP for
by 3.34%. Furthermore, the improved YOLOX model, which incor- each class and the highest mAP of 86.49% in general, which fur-
porated CLAHE, ConvNeXt, and the fourth head, demonstrates the ther validates its detection capabilities. Additionally, the detection
performance with an mAP of 86.49%, an increase of 4.23% com- performance of the YOLO series models was enhanced by increas-
pared to the original YOLOX model. The detection performance ing the size of the training images. For instance, when the training
in the improved YOLOX for each class is also relatively balanced. image size was increased from 416 × 416 to 640 × 640 pixels, the
For instance, it achieves the highest AP for helmets at 90.33%, mAP of YOLOv4 improved from 62% to 72%. We also selected the
while maintaining performance above 82% for the person and YOLOv7x whose mAP was 1.9% lower compared to the enhanced
vest classes. Overall, the employed strategies effectively improve YOLOX. For the faster R-CNN series models, altering the backbone
the detection performance for PPE objects on tunnel construction did not significantly impact performance. Furthermore, the SSD
sites. model performed the worst, achieving only an mAP of 50.21%.
Regarding speed, the original YOLOX has the fastest inference In terms of efficiency, the YOLOv7x model reached the fastest
speed at 42.65 FPS. With the addition of data augmentation and ar- speed with 55 FPS. The improved YOLOX operated at a speed of 22
chitecture improvements, the improved YOLOX model’s process- FPS, which was slower than both the original YOLOX (X) and the
ing speed drops to 22.13 FPS, slower than the original YOLOX. Real- YOLOv7x but faster than all other detectors. Still, the processing
time detection is a relatively blurring concept. Researchers sug- speed of the improved YOLOX met the requirements for real-time
Figure 8: Detection examples from the improved YOLOX approach.

Table 8: Effects of adopted strategies.
AP
Models mAP FPS
Helmet Person Vest
YOLOX 82.26% 85.31% 85.82% 75.65% 42.65

+CLAHE 83.13% 87.74% 86.23% 75.41% 33.00
+ConvNeXt 83.60% 87.89% 87.57% 75.35% 32.19
+4 heads 84.27% 88.61% 85.23% 78.98% 30.72
+ConvNeXt+4 heads 85.60% 89.39% 86.01% 81.40% 27.61
+CLAHE + ConvNeXt + 4 head 86.49% 90.33% 87.03% 82.12% 22.13
(Improved YOLOX approach)
Table 9: Performance comparison between YOLOX and the im- However, this study also faces limitations. There is potential for
proved YOLOX. improvement in detecting small person and small vest classes,

which could be caused by the scarcity of corresponding instances
Model Helmet Person Vest
in the dataset. Collecting more relevant images for the dataset
F1 YOLOX 0.83 0.83 0.72 could alleviate this issue. Secondly, this study is conducted in
Improved YOLOX 0.86 0.84 0.8 a laboratory setting. Although the experiments demonstrate the
performance of the proposed approach, numerous engineering
Recall YOLOX 74.57% 75.43% 60.93%
challenges may arise when implementing it on real construction
Improved YOLOX 78.14% 79.28% 72.30%
sites. Possible implementation factors are discussed in Section 6.2.
Precision YOLOX 94.13% 93.42% 89.32%
Improved YOLOX 94.92% 89.64% 90.18%
6.2. Implementation on practice
AP YOLOX 85.31% 85.82% 75.65% The improved YOLOX approach demonstrates accurate detection
Improved YOLOX 90.33% 87.03% 82.12% capabilities for low-light and multiscale objects with real-time
processing speed. However, the experiments are conducted in lab-
oratories, and several practical factors should be considered when
processing. The speed was affected to the more complex architec- implementing the approach in real-world settings.
ture and the inclusion of additional low-light data enhancement First, we should consider the number and location of light
algorithms. Moreover, the enhanced YOLOX has relatively more sources. Although the lighting conditions in tunnel environments
parameters than other YOLO models due to the extra modifica- cannot be compared to outdoor construction sites with natural
tions made to the architecture. light, it is essential to deploy light sources evenly, ensuring that
When considering the trade-off between accuracy and speed, all areas are well-lit and that the illumination is neither too strong
the priority for detecting PPE on tunnel construction sites is cor- nor too weak at any location. Proper lighting is critical to improve
rectness. Greater prediction accuracy can lead to a higher like- the quality of images and contribute to more accurate detection
lihood of preventing worker injuries. Currently, the improved performance.
YOLOX outperforms all other selected state-of-the-art models in Second, camera modules should be placed strategically to min-
terms of overall accuracy and has a significantly stronger ability imize their impact on nearby operations. The strategic placement
to detect small objects compared to the original YOLOX. Although of cameras is crucial for maximizing coverage while minimizing
it is not the fastest detection model, its processing speed meets any potential obstructions or disruptions to construction activi-
the real-time threshold. Consequently, we maintain the position ties. Furthermore, integrating the camera modules with existing
that our improved YOLOX is a suitable choice for PPE detection on infrastructure can reduce installation costs and streamline sys-
tunnel construction sites, as it highest correctness and real-time tem deployment.
processing speed. Additionally, establishing a monitoring system should be cost-
effective. Factors affecting the final costs include the cloud or local
platforms, the number of camera modules, and the GPU equip-
6. Discussion ment. If the entire system is hosted on the cloud, extra processes
6.1. Strength and weakness and equipment are required to upload images captured by the
cameras and to receive results from the cloud. Alternatively, an
This study has several strengths. Firstly, the improved YOLOX ap-
on-site processing module can be deployed to directly connect
proach demonstrates the ability to accurately detect low-light
to the camera modules. The inference speed should be adjusted
and multiscale objects on tunnel construction sites. Experiments
based on the actual requirement. Usually a faster speed requires
show that it achieves an mAP of 86.94%, which is 4.23% higher
more powerful hardware, potentially increasing costs.
than the original YOLOX and outperforms all selected state-of-
the-art deep learning models. Secondly, the improved YOLOX ap-
proach is a real-time solution, processing 22 images per second 6.3. Future research directions
while maintaining a balanced performance in terms of both ac- In the future, the proposed approach can be further expanded by
curacy and efficiency. Thirdly, this study introduces the first low- incorporating additional techniques to analyze activities, deter-
light and small PPE dataset, consisting of 1041 low-light real tun- mine safety statuses, and explore effective alarm methods. Cur-
nel construction images and 1330 normal ones, with small objects rently, the model can output detection results for on-site workers.
accounting for 40% (6814 instances) and low-light objects consti- We can extend this by tracking workers’ movements and analyz-
tuting 48% (8285 instances). ing their ongoing activities (Nath et al., 2018). The type of activity
Table 10: Comparison with the state-of-the-art models.
AP
Model mAP FPS Params (M)
Helmet Person Vest
YOLOv3x (416)a 71.06% 72.24% 83.44% 57.50% 9.71 61.5

YOLOv3x (640) 76.53% 82.18% 85.28% 62.13% 7.96 61.5
YOLOv4x (416) 62.06% 68.37% 80.70% 37.10% 8.58 63.9
YOLOv4x (640) 71.59% 78.69% 86.01% 50.06% 7.59 63.9
YOLOv7x 84.50% 88.90% 86.50% 78.10% 55.00 71.3
SSD 50.21% 43.85% 67.05% 39.72% 7.09 23.9
Faster RCNN (ResNet) 60.79% 51.03% 77.34% 53.99% 6.68 28.3
Faster RCNN (VGG) 61.51% 51.69% 77.90% 54.94% 7.72 136.7
YOLOX (x) 82.26% 85.31% 85.82% 75.65% 42.65 99.1
Improved YOLOX 86.49% 90.33% 87.03% 82.12% 22.13 118.8
Note.

a
The size of training images are 416 × 416.
could be related to the danger level, allowing the system to en- (iii) We provided a performance reference for low-light aug-
hance monitoring of workers engaged in specific activities. The mentation, including eight different algorithms and com-
activity information can also be used for management purposes, binations. CLAHE has a balanced performance with high
such as analyzing working hours and providing further data for correctness and fast speed.
managers to adjust tasks accordingly. (iv) We constructed a novel low-light multiscale dataset, PPE
Moreover, we can organize prediction results as a scene graph dark, with 8285 low-light instances and 6814 small in-
and apply automated hazard inference methods to alert workers if stances.
they are in danger (Zhang et al., 2022b). This can also be achieved
Expect that, we also proposed the definition for multiscale ob-
by establishing constraint relationships within the scene graph,
ject for specific image sizes, and constructed criteria for evaluat-
such as distance constraints between workers and the tunnel face.
ing performance of correctness for different light conditions and
A separate mechanism can be employed to continuously collect
object sizes. We discussed the possible factors when implement-
the latest information on workers’ positions and assess whether
ing in industry as well as the future research directions.
their status violates these constraints. If any violations are de-
The most important contribution of this paper is the improved
tected, a warning signal is sent correspondingly, helping to main-
YOLOX approach for detecting low-light and multiscale objects
tain worker safety and prevent accidents.
with high accuracy and real-time processing speed. Additionally,
It is also essential to explore suitable alarm methods for work-
PPE dark is the first dataset in the construction domain focusing
ers. Compared to outdoor construction sites, tunnel environments
on low-light and small objects. This study can inspire further re-
are enclosed spaces filled with noise from large machines, making
searcher on how to improve detection performance in similar sit-
it difficult to hear warnings clearly. Consequently, relying on au-
uations, not limited to the construction industry.
ditory alerts may not be effective. On the other hand, developing
a device that notifies workers through vibrations would require
Acknowledgments
them to wear the device, potentially adding extra burden. Investi-
gating appropriate alarm methods for tunnel workers is a practi- This research received financial support from Hubei Provincial De-
cal future research direction to ensure their safety and well-being. partment of Transportation Science and Technology Project (2020-
186-2-5).
7. Conclusions
Conflict of interest statement
This paper proposes a solution for detecting PPE on underground
construction sites, which is challenging due to low-light condi- None declared.
tions and the presence of small objects. The proposed solution
includes a modified deep learning model and a novel dataset. The
main contribution and findings are listed as follows. References
Acharya T., & Ray A. K. (2005). Image processing: Principles and applica-
(i) We propose an improved YOLOX approach for detecting tions. John Wiley & Sons.
low-light multiscale objects. We adjusted YOLOX archi- Akbarzadeh M., Zhu Z., & Hammad A. (2020). Nested network for de-
tecture for deep feature extraction and multiscale object tecting PPE on large construction sites based on frame segmen-
prediction and adopted data augmentation method to en- tation. In Proceedings of the Creative Construction e-Conference 2020
hance the image light conditions. The improved YOLOX ap- (pp. 33–38). Budapest University of Technology and Economics.
proach owns the highest correctness compared to state-of- Ali L., Alnajjar F., Parambil M. M. A., Younes M. I., Abdelhalim Z. I.,
the-art, and a real-time processing speed. & Aljassmi H. (2022). Development of YOLOv5-based real-time
(ii) We validated that inserting the ConvNeXt module into the smart monitoring system for increasing lab safety awareness in
YOLOX backbone and adding an extra prediction head can educational institutions. Sensors, 22(22), 8820. https://doi.org/10
improve the detection ability to small objects. Experiments .3390/s22228820.
show that the AP of small classes has been increased by Ba J. L., Kiros J. R., & Hinton G. E. (2016). Layer normalization. arXiv
7.17% on average. preprint arXiv:1607.06450.
Biswas D., Nayak I., Choudhury S., Acharjee T., & Mishra M. (2021). Guo C., Li C., Guo J., Loy C. C., Hou J., Kwong S., & Cong R. (2020). Zero-
Crack detection on inner tunnel surface using image processing. reference deep curve estimation for low-light image enhance-
In Progress in Advanced Computing and Intelligent Engineering (pp. 3– ment. In Proceedings of the IEEE/CVF Conference on Computer Vision
12). Springer. and Pattern Recognition (CVPR) (pp. 1780–1789). IEEE.
Bochkovskiy A., Wang C.-Y., & Liao H.-Y. M. (2020). YOLOv4: Op- Hassanien M. A., Singh V. K., Puig D., & Abdel-Nasser M. (2022). Pre-
timal speed and accuracy of object detection. arXiv preprint dicting breast tumor malignancy using deep ConvNeXt radiomics
arXiv:2004.10934. and quality-based score pooling in ultrasound sequences. Diag-
Chen S., & Demachi K. (2020). A vision-based approach for ensuring nostics, 12(5), 1053. https://doi.org/10.3390/diagnostics12051053.
proper use of personal protective equipment (PPE) in decommis- He K., Sun J., & Tang X. (2010). Single image haze removal using dark
sioning of Fukushima Daiichi nuclear power station. Applied Sci- channel prior. IEEE Transactions on Pattern Analysis and Machine In-
ences 2020, 10, 5129. https://doi.org/10.3390/APP10155129. telligence, 33(12), 2341–2353. https://doi.org/10.1109/CVPR.2009.
Chen J., Deng S., Wang P., Huang X., & Liu Y. (2023a). Lightweight 5206515.
helmet detection algorithm using an improved YOLOv4. Sensors, Hendrycks D., & Gimpel K. (2016). Gaussian error linear units (GELUs).
23(3), 1256. https://doi.org/10.3390/s23031256. arXiv preprint arXiv:1606.08415.
Chen W., Li C., & Guo H. (2023b). A lightweight face-assisted ob- HSE. (2017). Fatal injuries in Great Britain . Technical report. Health

ject detection model for welding helmet use. Expert Systems with and Safety Executive.
Applications, 221, 119764. https://doi.org/10.1016/j.eswa.2023.11 Hume A., & Mills N. J. (1995). Industrial head injuries and
9764. the performance of the helmets. In Proceedings of the 1995
Cheng J. P., Wong P. K.-Y., Luo H., Wang M., & Leung P. H. (2022). Vision- International IRCOBI Conference on the Biomechanics of Impact (pp.
based monitoring of site safety compliance based on worker re- 217–231). http://www.ircobi.org/wordpress/downloads/irc1995/p
identification and personal protective equipment classification. df _f iles/1995_15.pdf .
Automation in Construction, 139, 104312. https://doi.org/10.1016/j. Ioffe S., & Szegedy C. (2015). Batch normalization: Accelerating deep
autcon.2022.104312. network training by reducing internal covariate shift. In Proceed-
Choo H., Lee B., Kim H., & Choi B. (2023). Automated detection of ings of the 32nd International Conference on Machine Learning (pp. 448–
construction work at heights and deployment of safety hooks us- 456).
ing IMU with a barometer. Automation in Construction, 147, 104714. Jiang Y., Gong X., Liu D., Cheng Y., Fang C., Shen X., Yang J., Zhou P., &
https://doi.org/10.1016/j.autcon.2022.104714. Wang Z. (2021). EnlightenGAN: Deep light enhancement without
Dong X., Wang G., Pang Y., Li W., Wen J., Meng W., & Lu Y. (2011). Fast paired supervision. IEEE Transactions on Image Processing, 30, 2340–
efficient algorithm for enhancement of low lighting video. In Pro- 2349. https://doi.org/10.1109/TIP.2021.3051462.
ceedings of the 2011 IEEE International Conference on Multimedia and Jiang Q., Mao Y., Cong R., Ren W., Huang C., & Shao F. (2022). Un-
Expo (pp. 1–6). IEEE. supervised decomposition and correction network for low-light
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., image enhancement. IEEE Transactions on Intelligent Transportation
Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Systems, 23(10), 19440–19455. https://doi.org/10.1109/TITS.2022.
Uszkoreit J., & Houlsby N. (2020). An image is worth 16x16 3165176.
words: Transformers for image recognition at scale. arXiv preprint Jobson D. J., Rahman Z.-u., & Woodell G. A. (1997). Properties and per-
arXiv:2010.11929. formance of a center/surround retinex. IEEE Transactions on Image
Du S., Shehata M., & Badawy W. (2011). Hard hat detection in video Processing, 6(3), 451–462. https://doi.org/10.1109/83.557356.
sequences based on face features, motion and color information. Karacan M. H., & Yücebas S. C. (2022). A deep learning model with
In Proceedings of the 2011 3rd International Conference on Computer attention mechanism for dental image segmentation. In Proceed-
Research and Development (Vol. 4, pp. 25–29). IEEE. ings of the 2022 International Congress on Human-Computer Interac-
Fang W., Ding L., Zhong B., Love P. E., & Luo H. (2018c). Auto- tion, Optimization and Robotic Applications (HORA) (pp. 1–4). IEEE.
mated detection of workers and heavy equipment on construc- Karlsson J., Strand F., Bigun J., Alonso-Fernandez F., Hernandez-Diaz
tion sites: A convolutional neural network approach. Advanced K., & Nilsson F. (2022). Visual detection of personal protective
Engineering Informatics, 37, 139–149. https://doi.org/10.1016/j.ae equipment and safety gear on industry workers,arXiv preprint
i.2018.05.003. arXiv:2212.04794.
Fang Q., Li H., Luo X., Ding L., Luo H., Rose T. M., & An W. (2018a). Ke X., Chen W., & Guo W. (2022). 100+ FPS detector of personal pro-
Detecting non-hardhat-use by a deep learning method from far- tective equipment for worker safety: A deep learning approach
field surveillance videos. Automation in Construction, 85, 1–9. https: for green edge computing. Peer-to-Peer Networking and Applications,
//doi.org/10.1016/j.autcon.2017.09.018. 15, 950–972. https://doi.org/10.1007/s12083- 021- 01258- 4.
Fang Q., Li H., Luo X., Ding L., Rose T. M., An W., & Yu Y. (2018b). A Kelm A., Laußat L., Meins-Becker A., Platz D., Khazaee M. J., Costin A.
deep learning-based method for detecting non-certified work on M., Helmus M., & Teizer J. (2013). Mobile passive radio frequency
construction sites. Advanced Engineering Informatics, 35, 56–68. ht identification (RFID) portal for automated and rapid control of
tps://doi.org/10.1016/j.aei.2018.01.001. personal protective equipment (PPE) on construction sites. Au-
Ferdous M., & Ahsan S. M. M. (2022). PPE detector: A YOLO-based ar- tomation in Construction, 36, 38–52. https://doi.org/10.1016/j.autc
chitecture to detect personal protective equipment (PPE) for con- on.2013.08.009.
struction sites. PeerJ Computer Science, 8, e999. https://doi.org/10 Kim W. (2022). Low-light image enhancement: A comparative review
.7717/PEERJ-CS.999. and prospects. IEEE Access, 10, 84535–84557. https://doi.org/10.1
Ge Z., Liu S., Wang F., Li Z., & Sun J. (2021). YOLOX: Exceeding YOLO 109/ACCESS.2022.3197629.
series in 2021. arXiv preprint arXiv:2107.08430. Kim T. K., Paik J. K., & Kang B. S. (1998). Contrast enhancement sys-
Glorot X., Bordes A., & Bengio Y. (2011). Deep sparse rectifier neural tem using spatially adaptive histogram equalization with tempo-
networks. In Proceedings of the 14th International Conference on Arti- ral filtering. IEEE Transactions on Consumer Electronics, 44(1), 82–87.
ficial Intelligence and Statistics (Vol. 15, pp. 315–323). https://doi.org/10.1109/30.663733.
Kim M., Park D., Han D. K., & Ko H. (2014). A novel framework for tional Conference on Robotics and Automation (ICRA) (pp. 1316–1322).
extremely low-light video enhancement. In Proceedings of the 2014 IEEE.
IEEE International Conference on Consumer Electronics (ICCE) (pp. 91– Redmon J., & Farhadi A. (2017). YOLO9000: Better, faster, stronger.
92). IEEE. Technical report.arXiv:1612.08242.
Land E. H., & McCann J. J. (1971). Lightness and retinex theory. Josa, Redmon J., Divvala S., Girshick R., & Farhadi A. (2016). You only look
61(1), 1–11. https://doi.org/10.1364/JOSA.61.000001. once: Unified, real-time object detection. In Proceedings of the IEEE
Lee Y.-R., Jung S.-H., Kang K.-S., Ryu H.-C., & Ryu H.-G. (2023). Deep Computer Society Conference on Computer Vision and Pattern Recogni-
learning-based framework for monitoring wearing personal pro- tion (Vol. 2016-Decem, pp. 779–788). IEEE.
tective equipment on construction sites. Journal of Computational Reza A. M. (2004). Realization of the contrast limited adaptive his-
Design and Engineering, qwad019. https://doi.org/10.1093/jcde/q togram equalization (CLAHE) for real-time image enhancement.
wad019. Journal of VLSI Signal Processing Systems for Signal, Image and Video
Lee S., Kim N., & Paik J. (2015). Adaptively partitioned block-based Technology, 38(1), 35–44. https://doi.org/10.1023/B:VLSI.0000028
contrast enhancement and its application to low light-level video 532.53893.82.
surveillance. SpringerPlus, 4(1), 1–11. https://doi.org/10.1186/s400 Rubaiyat A. H., Toma T. T., Kalantari-Khandani M., Rahman S. A.,
64- 015- 1226- x. Chen L., Ye Y., & Pan C. S. (2016). Automatic detection of helmet

Li J., Wang C., Huang B., & Zhou Z. (2022). ConvNeXt-backbone HoV- uses for construction safety. In Proceedings of the 2016 IEEE Interna-
erNet for nuclei segmentation and classification. arXiv preprint tional Conference on Web Intelligence Workshops (pp. 135–142). IEEE.
arXiv:2202.13560. Sandler M., Howard A., Zhu M., Zhmoginov A., & Chen L.-C. (2018).
Lin T.-Y., Dollár P., Girshick R., He K., Hariharan B., & Belongie S. (2017). MobileNetV2: Inverted residuals and linear bottlenecks. In Pro-
Feature pyramid networks for object detection. in Proceedings of ceedings of the IEEE Conference on Computer Vision and Pattern Recog-
the IEEE conference on computer vision and pattern recognition, pp. nition (pp. 4510–4520). IEEE.
2117–2125. Saudi M. M., Ma’arof A. H., Ahmad A., Saudi A. S. M., Ali M. H., Narzul-
Lin T.-Y., Goyal P., Girshick R., He K., & Dollár P. (2017c). Fo- laev A., & Ghazali M. I. M. (2020). Image detection model for con-
cal loss for dense object detection. In Proceedings of the struction worker safety conditions using faster R-CNN. Interna-
IEEE International Conference on Computer Vision (pp. 2980–2988). tional Journal of Advanced Computer Science and Applications, 11, 246–
IEEE. 250. https://doi.org/10.14569/IJACSA.2020.0110632.
Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dol- Shrestha K., Shrestha P. P., Bajracharya D., & Yfantis E. A. (2015). Hard-
lár P., & Zitnick C. L. (2014). Microsoft COCO: Common objects in hat detection for construction safety visualization. Journal of Con-
context. In Proceedings of the European Conference on Computer Vision struction Engineering, 2015, 1–8. https://doi.org/10.1155/2015/721
(pp. 740–755). Springer. 380.
Liu Z., Mao H., Wu C.-Y., Feichtenhofer C., Darrell T., & Xie S. (2022). Song G., Liu Y., & Wang X. (2020). Revisiting the sibling head
A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Confer- in object detector. In Proceedings of the IEEE/CVF Conference
ence on Computer Vision and Pattern Recognition (pp. 11976–11986). on Computer Vision and Pattern Recognition (pp. 11563–11572).
IEEE. IEEE.
Ma L., Ma T., Liu R., Fan X., & Luo Z. (2022). Toward fast, flexible, Suderman B. L., Hoover R. W., Ching R. P., & Scher I. S. (2014). The
and robust low-light image enhancement. In Proceedings of the effect of hardhats on head and neck response to vertical impacts
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. from large construction objects. Accident Analysis and Prevention,
5637–5646). IEEE. 73, 116–124. https://doi.org/10.1016/j.aap.2014.08.011.
Ministry of Housing and Urban-Rural Development. (2020). Tan Y., Cai R., Li J., Chen P., & Wang M. (2021b). Automatic detection of
Announcement on the production safety accidents of sewer defects based on improved you only look once algorithm.
housing and municipal engineering in 2019 (In Chinese). Automation in Construction, 131, 103912. https://doi.org/10.1016/j.
https://www.mohurd.gov.cn/gongkai/zhengce/zhengcefilelib/2 autcon.2021.103912.
02006/20200624_246031.html. Tan L., Huangfu T., Wu L., & Chen W. (2021a). Comparison of Reti-
Nath N. D., Behzadan A. H., & Paal S. G. (2020). Deep learning for site naNet, SSD, and YOLO v3 for real-time pill identification. BMC
safety: Real-time detection of personal protective equipment. Au- Medical Informatics and Decision Making, 21, 1–11. https://doi.org/
tomation in Construction, 112, 103085. https://doi.org/10.1016/j.au 10.1186/s12911- 021- 01691- 8.
tcon.2020.103085. Tang S., Roberts D., & Golparvar-Fard M. (2020). Human–object inter-
Nath N. D., Chaspari T., & Behzadan A. H. (2018). Automated er- action recognition for automatic construction site safety inspec-
gonomic risk monitoring using body-mounted sensors and ma- tion. Automation in Construction, 120, 103356. https://doi.org/10.1
chine learning. Advanced Engineering Informatics, 38, 514–526. ht 016/J.AUTCON.2020.103356.
tps://doi.org/10.1016/j.aei.2018.08.020. Tzutalin. (2015). LabelImg. https://github.com/heartexlabs/labelIm
njvisionpower. (2019). Safety-helmet-wearing-dataset. https://gith g.
ub.com/njvisionpower/Safety- Helmet- Wearing- Dataset. VOC. (2016). VOC2011 annotation guidelines. http://host.robots.ox.a
OSHA. (2005). Worker safety series construction. CreateSpace Indepen- c.uk/pascal/VOC/voc2011/guidelines.html.
dent Publishing Platform. Wang Z., Wu Y., Yang L., Thirunavukarasu A., Evison C., & Zhao Y.
Petro A. B., Sbert C., & Morel J.-M. (2014). Multiscale retinex. (2021). Fast personal protective equipment detection for real con-
Image Processing On Line, 71–88. https://doi.org/10.5201/ipol.201 struction sites using deep learning approaches. Sensors, 21(10),
4.107. 3478. https://doi.org/10.3390/s21103478.
Rahman Z.-U., Woodell G. A., & Jobson D. J. (1997). A comparison of the Wang W., Wu X., Yuan X., & Gao Z., (2020). An experiment-
multiscale retinex with other image enhancement techniques. ht based review of low-light image enhancement methods. IEEE
tp://ntrs.nasa.gov/citations/20040110657. Access, 8, 87884–87917. https://doi.org/10.1109/ACCESS.2020.29
Redmon J., & Angelova A. (2015). Real-time grasp detection using con- 92749.
volutional neural networks. In Proceedings of the 2015 IEEE Interna-
Wu J., Cai N., Chen W., Wang H., & Wang G. (2019). Automatic detec- Zhang H., Cisse M., Dauphin Y. N., & Lopez-Paz D. (2017). MixUp: Be-
tion of hardhats worn by construction personnel: A deep learning yond empirical risk minimization. arXiv preprint arXiv:1710.09412.
approach and benchmark dataset. Automation in Construction, 106, Zhang H., Liu C., Ho J., & Zhang Z. (2022a). Crack detection based on
102894. https://doi.org/10.1016/j.autcon.2019.102894. ConvNeXt and normalization. Journal of Physics: Conference Series,
Wu Y., Feng S., Huang X., & Wu Z. (2021). L4Net: An anchor- 2289(1), 012022. https://doi.org/10.1088/1742-6596/2289/1/01202
free generic object detector with attention mechanism for au- 2.
tonomous driving. IET Computer Vision, 15(1), 36–46. https://doi. Zhang S., Teizer J., Pradhananga N., & Eastman C. M. (2015). Work-
org/10.1049/cvi2.12015. force location tracking to model, visualize and analyze workspace
Xiong R., & Tang P. (2021). Pose guided anchoring for detecting proper requirements in building information models for construction
use of personal protective equipment. Automation in Construction, safety planning. Automation in Construction, 60, 74–86. https://do
130, 103828. https://doi.org/10.1016/J.AUTCON.2021.103828. i.org/10.1016/j.autcon.2015.09.009.
Xu Q., Deng H., Zhang Z., Liu Y., Ruan X., & Liu G. (2022a). A ConvNeXt- Zhang L., Wang J., Wang Y., Sun H., & Zhao X. (2022b). Automatic
based and feature enhancement anchor-free Siamese network construction site hazard identification integrating construction
for visual tracking. Electronics, 11(15), 2381. https://doi.org/10.3 scene graphs with BERT based domain knowledge. Automation in
390/electronics11152381. Construction, 142, 104535. https://doi.org/10.1016/j.autcon.2022.

Xu X., Wang R., Fu C.-W., & Jia J. (2022b). SNR-aware low-light image 104535.
enhancement. In Proceedings of the IEEE/CVF Conference on Computer Zhang H., Yan X., Li H., Jin R., & Fu H. F. (2019). Real-time alarming,
Vision and Pattern Recognition (pp. 17714–17724). IEEE. monitoring, and locating for non-hard-hat use in construction.
Yang D., & Sun J. (2018). Proximal Dehaze-Net: A prior learning-based Journal of Construction Engineering and Management, 145(3), 1–13.
deep network for single image dehazing. In Proceedings of the Eu- https://doi.org/10.1061/(ASCE)CO.1943-7862.0001629.
ropean Conference on Computer Vision (ECCV) (pp. 702–717). Zhu X., Lyu S., Wang X., & Zhao Q. (2021). TPH-YOLOv5: Improved
Zeng T., Wang J., Cui B., Wang X., Wang D., & Zhang Y. (2021). The YOLOv5 based on transformer prediction head for object detec-
equipment detection and localization of large-scale construction on drone-captured scenarios. In Proceedings of the IEEE/CVF
tion jobsite by far-field construction surveillance video based on International Conference on Computer Vision (pp. 2778–2788). IEEE.
improving YOLOv3 and grey wolf optimizer improving extreme
learning machine. Construction and Building Materials, 291, 123268.
https://doi.org/10.1016/j.conbuildmat.2021.123268.
Received: December 27, 2022. Revised: May 7, 2023. Accepted: May 11, 2023
© The Author(s) 2023. Published by Oxford University Press on behalf of the Society for Computational Design and Engineering. This is an Open Access article distributed
under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use,
distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Qwad 042

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Qwad 042

Uploaded by

Copyright:

Available Formats

Journal of Computational Design and Engineering, 2023, 10,

An improved YOLOX approach for low-light and small

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

1. Introduction have demonstrated that wearing helmets effectively reduces the

Figure 1: Examples of (a) tunnel and (b) outdoor construction sites.

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

2.3. Research gaps and objectives

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

3.2. Metrics for performance evaluation

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

multiple object classes. A higher mAP value indicates better de-

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Vest Small 962 88 155 1205

where Nimage is the number of images and Ttotal is the processing

Table 2: Statistics of PPE instances under different light

Helmet Low 2463 253 333 3049

Figure 5: Statistics of the PPE dark dataset.

instances comprise around 40% each, while only 17% of person

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Table 3: Experiment plan.

No. Model Data augmentation Architecture modification Goals

10 YOLOX ConvNeXt Multiscale detection

13 YOLOX CLAHE ConvNeXt + Fourth head All strategies

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Figure 6: Training loss curves.

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Table 5: Correctness of low-light data enhancement algorithms.

Benchmark None 83.42% 78.77% 92.35% 80.14% 92.58% 66.25% 90.43%

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Dehaze + CLAHE 1249.65 0.2 CPU

Table 7: Correctness of architecture modifications.

Note. L: Large; M: Medium; S: Small.

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Figure 8: Detection examples from the improved YOLOX approach.

Table 8: Effects of adopted strategies.

YOLOX 82.26% 85.31% 85.82% 75.65% 42.65

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Table 10: Comparison with the state-of-the-art models.

YOLOv3x (416)a 71.06% 72.24% 83.44% 57.50% 9.71 61.5

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

Downloaded from https://academic.oup.com/jcde/article/10/3/1158/7177527 by guest on 16 July 2023

You might also like