Professional Documents
Culture Documents
1158–1175
DOI: 10.1093/jcde/qwad042
Advance access publication date: 23 May 2023
Research Article
Keywords: computer vision, machine learning, deep learning, tunnel management, construction safety
Received: December 27, 2022. Revised: May 7, 2023. Accepted: May 11, 2023
© The Author(s) 2023. Published by Oxford University Press on behalf of the Society for Computational Design and Engineering. This is an Open Access article
distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com
Journal of Computational Design and Engineering, 2023, 10(3), 1158–1175 | 1159
light condition objects in Section 4. And experiments and results of 25 construction sites, and a faster R-CNN model was trained
more functionality to PPE detection, researchers have explored while deep learning techniques are data-driven and use historical
the use of spatial relationships to determine if workers are prop- data to find latent patterns for predicting unknown samples.
erly wearing their PPE. For example, Tang et al. (2020) developed a Histogram equalization (HE) algorithms are commonly used
novel human–object interaction recognition method that uses po- computer vision techniques that adjust the output gray levels to
tential worker PPE box pairs to determine compliance. Meanwhile, spread out the most frequent intensity values and increase the
Chen and Demachi (2020) used the Euclidean distance between image contrast, making darker details in the image more visible
the bounding boxes of hard hats and the neck to identify the exis- (Acharya & Ray, 2005; Lee et al., 2015). However, this approach uses
tence of helmets on workers. Cheng et al. (2022) designed a mech- global HE, which can result in some local areas being too dark or
anism for monitoring worker PPE statuses across multiple cam- too bright. To address this limitation, researchers developed adap-
eras with consistent identities. Lastly, Lee et al. (2023) presented a tive histogram equalization (AHE; Kim et al., 1998), which uses lo-
post-processing algorithm that classifies the correlation between cal contrast to improve the technique. However, this approach can
workers and PPE into four statuses to help prevent potential acci- create noise problems. To overcome this issue, contrast-limited
dents. Another approach is to combine it with trade recognition AHE (CLAHE) was developed to avoid excessive enhancement of
output to check if workers meet certification constraints. Fang image contrast that AHE could produce by limiting local contrast
et al. (2018b) demonstrated this by merging PPE detection results enhancement (Reza, 2004).
for detecting outdoor equipment, but it lacked comprehensive tiscale PPE instances on tunnel construction sites, (ii) construct
experiment analysis to explain the selection of this augmenta- a novel dataset for tunnel low-light multiscale instance scenario,
tion algorithm. Moreover, the study’s outdoor environment sce- and (iii) comprehensively compare low-light algorithms and select
nario differed from the underground tunnel construction site. a suitable one for enhancing tunnel construction images.
3.1. Modified YOLOX architecture to explore the architectural differences between ConvNets and
3.1.1. Overview of YOLOX ViTs and to test the limits of a pure ConvNet. It incorporates in-
verted bottlenecks and large convolution kernels (Sandler et al.,
As an early anchor-free detector, YOLOv1 employs fully connected
2018), replaces ReLU with GELU (Glorot et al., 2011; Hendrycks &
neural network layers to directly predict bounding boxes on top
Gimpel, 2016), minimizes the use of activation functions and nor-
of the feature extractor. It boasts fast speeds but has a consid-
malization layers, and substitutes BN (Ioffe & Szegedy, 2015) with
erable number of localization errors and a relatively low recall
LN (Ba et al., 2016). Additionally, it employs a grouped convolution
rate (Redmon et al., 2016; Tan et al., 2021b). To address these is-
for the 3 – 3 convolutional layer in a bottleneck block to reduce
sues, anchor mechanisms from models like faster R-CNN were
the FLOPs. Despite being lighter, the network expands its width to
introduced to YOLOv2 (Redmon & Farhadi, 2017). Anchors are pre-
achieve higher capacity, and experiments have demonstrated that
defined boxes with common object shapes that guide the predic-
ConvNeXt outperforms ConvNet in terms of performance.
tion process. Anchor-based models learn to adjust anchor shapes
Moreover, researchers have improved their models by incorpo-
to fit predicted objects. Experiments in YOLOv2 demonstrated
rating ConvNeXt into the backbone to enhance feature extraction.
that introducing the anchor mechanism improved the mAP by
Such designs have been implemented in various fields, including
4.8% (table 2 in Redmon & Farhadi 2017). However, using anchors
medicine (Hassanien et al., 2022; Li et al., 2022), for tasks such as
necessitates determining a set of optimal parameters, which are
objects identified as relevant): AP measures the overall correctness performance for a specific
class. It is calculated as the area under the precision-recall (PR)
TP curve and provides an average value of the model’s precision over
Precision = . (1)
TP + FP the entire range of recall levels. Higher AP values indicate better
performance.
Recall is the ratio of true positives to the sum of true positives
and false negatives: 1
AP = P(R )dR (4)
0
TP
Recall = . (2)
TP + FN where true positive TP is the number of correct predictions, false
positive FP is the number of wrong predictions, and false nega-
The F1 score takes into account both precision and recall, pro-
tive FN is the number of undetected but ground true instances.
viding a single metric that represents the model’s balanced ability
PR curves are plotted by calculating the precision and recall val-
for prediction:
ues of the accumulated TP or FP.
mAP is the average of the AP values for all classes in the dataset.
Precision × Recall
F1 = 2 × . (3) It gives an overall indication of the model’s performance across
Precision + Recall
1164 | Improved YOLOX for low-light and small PPE detection
AP(class )
mAP = (5)
n
where AP(class) is the AP when considering classes (e.g., AP for
helmet) and n is the number of classes.
In this study, the effects of the network adjustment and data
enhancement algorithms are necessary to analyze quantitatively,
helping us understand the performance of the selected strate-
gies. However, the standard performance evaluation criteria usu-
ally focus on general classes, and they cannot be used directly
to evaluate model performance in different light conditions and
object sizes. Therefore, we adjust the mAP calculation equation
Dataset
Class Light condition
Train Validation Test Total
1 YOLOX Baseline
2 YOLOX CLAHE Low-light enhancement
3 YOLOX Dehaze
4 YOLOX MSRCP
5 YOLOX Zero-DCE
6 YOLOX EnlightenGAN
7 YOLOX Dehaze + CLAHE
8 YOLOX Zero-DCE + CLAHE
9 YOLOX EnlightenGAN + CLAHE
Table 4: Hyperparameters for training experiments. and MSRCP), two deep learning algorithms (EnlightenGAN and
Zero-DCE), and three combinations (CLAHE with Dehaze, Zero-
Parameters Search values Freeze training Unfreeze training DCE, and EnlightenGAN, respectively) are selected for comparison.
The enhanced images are presented in Fig. 7.
Batch size 2, 4, 8, 16, 32 8 4
Learning rate 10x , x ∈ [1, 2, 3, 4, 5] 103 104
According to qualitative observations, the images after De-
haze were similar to the originals. Images from CLAHE were
brighter and showed more details than the originals, e.g., pipes in
the second column images. In addition, the images from MSRCP
80% of the data was used for training, 10% for validation, and 10% were more luminous when compared with the other two com-
for testing. The test dataset was solely employed for performance puter vision algorithms. For example, the ground in the fourth-
evaluation. Each training session consisted of a freezing train- column images was fully illuminated without any shadows, while
ing phase followed by an unfreezing phase. The freezing training their colors were distorted from subjective observations. Images
loaded a pre-trained weight file without changing parameters in from the two deep learning-based algorithms were much brighter
the backbone. During the unfreezing phase, all parameters across than the originals. However, the Zero-DCE images became blur-
the entire model could be adjusted according to the loss. Except rier and grayer. Images from EnlightenGAN were sharper than the
Figure 7: Enhanced images by different algorithms and their combinations. The green ones are original images. Orange ones are computer vision
algorithms. Blue ones are deep learning algorithms. Purple ones are combinations.
1168 | Improved YOLOX for low-light and small PPE detection
AP
Category Algorithm mAPlight
Helmet, LL Helmet, NL Person, LL Person, NL Vest, LL Vest, NL
Computer vision CLAHE 84.38% 80.98% 94.90% 81.28% 92.18% 70.28% 86.68%
Dehaze 84.28% 77.98% 94.93% 81.28% 91.94% 69.95% 89.59%
MSRCP 83.20% 78.24% 94.29% 80.38% 91.65% 66.31% 88.31%
Deep learning Zero-DCE 83.33% 77.63% 93.13% 81.41% 92.52% 68.46% 86.81%
EnlightenGAN 83.08% 80.08% 94.33% 80.31% 90.90% 63.44% 89.40%
Combination Dehaze + CLAHE 83.83% 77.91% 94.57% 81.62% 93.34% 64.66% 90.88%
Zero-DCE + CLAHE 83.95% 79.55% 94.44% 81.78% 91.63% 67.15% 89.12%
EnlightenGAN + CLAHE 84.57% 79.20% 93.81% 81.96% 93.34% 67.19% 91.90%
Table 6: Speed of low-light data enhancement algorithms. the second-best performance when compared with all combina-
tions, only 0.19 points less than that of the best performing com-
Algorithm Process time (s)a FPS Platform bination (EnlightenGAN+CLAHE). This difference can be ignored
especially given that CLAHE would be 11 times faster than the
CLAHE 7.21 33.0 CPU
Dehaze 1242.40 0.2 CPU combination with only CPU computation. Based on its high cor-
MSRCP 3948.33 0.1 CPU rectness and efficient processing speed, we selected the CLAHE
algorithm as the low-light data enhancement technique for this
Zero-DCE 2.10 113.5 GPU
study.
EnlightenGAN 73.27 3.2 GPU
AP
No. Modification mAPsize
Helmet, L Helmet, M Helmet, S Person, L Person, M Person, S Vest, L Vest, M Vest, S
0 YOLOX 77.43% 86.87% 91.04% 75.12% 89.57% 82.00% 53.94% 89.62% 81.27% 47.45%
1 +ConvNeXt 79.29% (+1.86%) 90.73% 91.93% 79.69% 90.25% 84.27% 56.67% 86.54% 83.59% 49.98%
2 +Fourth head 78.43% (+1.00%) 86.38% 91.18% 80.29% 88.41% 81.40% 54.36% 90.22% 77.92% 55.67%
3 +ConvNeXt + Fourth head 79.84% (+2.41%) 88.27% 92.99% 81.87% 88.83% 80.64% 54.20% 87.67% 82.09% 61.96%
4 +Add ConvNeXt to YOLO head 76.00% (−1.43%) 90.71% 89.96% 72.93% 91.52% 82.87% 50.69% 86.98% 78.84% 39.51%
5 +Split YOLO Head 75.59% (−1.84%) 82.49% 88.32% 75.45% 91.37% 82.65% 50.93% 85.63% 80.28% 43.17%
6 +Replace backbone as ConvNext 65.82% (−11.61%) 84.11% 88.09% 60.71% 87.43% 65.14% 31.86% 80.76% 62.68% 31.56%
By adopting the effective strategies (ConvNeXt and the fourth gest that a detector with an FPS greater than 13 can be considered
head), the performance of small object detection was also en- real-time (Redmon & Angelova, 2015). Other researchers propose
hanced. Its AP for small helmets increased by 6.75% to 81.87%, that FPS above 30 should be regarded as real-time detection (Red-
and its AP for small vests rose by 14.51%. The AP for small ob- mon et al., 2016; Tan et al., 2021a). FPS is also affected by hard-
jects increased by an average of 7.17%. On the other hand, the ware, and the value could theoretically improve with more pow-
modifications slightly improved the performance for medium erful equipment. Although the improved YOLOX model is slower
and large-sized instances. This also indicates that the origi- than the original, its processing speed still meets the general re-
nal YOLOX already possesses the ability to detect medium and quirements for real-time detection. Furthermore, the speed could
large instances, and the modifications adopted in this study can increase with more advanced hardware if necessary.
further improve the performance on small objects, providing Considering the trade-off between accuracy and speed, we be-
a more balanced and accurate capability for multiscale object lieve that prioritizing correctness is more crucial, especially when
detection. the model’s application is related to safety concerns. The higher
the model’s ability to predict objects accurately, the greater the
5.3. Results of the improved YOLOX approach chances of preventing worker injuries or even fatalities. Currently,
the improved YOLOX has the highest correctness and achieves the
The results of the previous experiments revealed two key find-
AP
Models mAP FPS
Helmet Person Vest
Table 9: Performance comparison between YOLOX and the im- However, this study also faces limitations. There is potential for
proved YOLOX. improvement in detecting small person and small vest classes,
AP
Model mAP FPS Params (M)
Helmet Person Vest
Note.
could be related to the danger level, allowing the system to en- (iii) We provided a performance reference for low-light aug-
hance monitoring of workers engaged in specific activities. The mentation, including eight different algorithms and com-
activity information can also be used for management purposes, binations. CLAHE has a balanced performance with high
such as analyzing working hours and providing further data for correctness and fast speed.
managers to adjust tasks accordingly. (iv) We constructed a novel low-light multiscale dataset, PPE
Moreover, we can organize prediction results as a scene graph dark, with 8285 low-light instances and 6814 small in-
and apply automated hazard inference methods to alert workers if stances.
they are in danger (Zhang et al., 2022b). This can also be achieved
Expect that, we also proposed the definition for multiscale ob-
by establishing constraint relationships within the scene graph,
ject for specific image sizes, and constructed criteria for evaluat-
such as distance constraints between workers and the tunnel face.
ing performance of correctness for different light conditions and
A separate mechanism can be employed to continuously collect
object sizes. We discussed the possible factors when implement-
the latest information on workers’ positions and assess whether
ing in industry as well as the future research directions.
their status violates these constraints. If any violations are de-
The most important contribution of this paper is the improved
tected, a warning signal is sent correspondingly, helping to main-
YOLOX approach for detecting low-light and multiscale objects
tain worker safety and prevent accidents.
with high accuracy and real-time processing speed. Additionally,
It is also essential to explore suitable alarm methods for work-
PPE dark is the first dataset in the construction domain focusing
ers. Compared to outdoor construction sites, tunnel environments
on low-light and small objects. This study can inspire further re-
are enclosed spaces filled with noise from large machines, making
searcher on how to improve detection performance in similar sit-
it difficult to hear warnings clearly. Consequently, relying on au-
uations, not limited to the construction industry.
ditory alerts may not be effective. On the other hand, developing
a device that notifies workers through vibrations would require
Acknowledgments
them to wear the device, potentially adding extra burden. Investi-
gating appropriate alarm methods for tunnel workers is a practi- This research received financial support from Hubei Provincial De-
cal future research direction to ensure their safety and well-being. partment of Transportation Science and Technology Project (2020-
186-2-5).
7. Conclusions
Conflict of interest statement
This paper proposes a solution for detecting PPE on underground
construction sites, which is challenging due to low-light condi- None declared.
tions and the presence of small objects. The proposed solution
includes a modified deep learning model and a novel dataset. The
main contribution and findings are listed as follows. References
Acharya T., & Ray A. K. (2005). Image processing: Principles and applica-
(i) We propose an improved YOLOX approach for detecting tions. John Wiley & Sons.
low-light multiscale objects. We adjusted YOLOX archi- Akbarzadeh M., Zhu Z., & Hammad A. (2020). Nested network for de-
tecture for deep feature extraction and multiscale object tecting PPE on large construction sites based on frame segmen-
prediction and adopted data augmentation method to en- tation. In Proceedings of the Creative Construction e-Conference 2020
hance the image light conditions. The improved YOLOX ap- (pp. 33–38). Budapest University of Technology and Economics.
proach owns the highest correctness compared to state-of- Ali L., Alnajjar F., Parambil M. M. A., Younes M. I., Abdelhalim Z. I.,
the-art, and a real-time processing speed. & Aljassmi H. (2022). Development of YOLOv5-based real-time
(ii) We validated that inserting the ConvNeXt module into the smart monitoring system for increasing lab safety awareness in
YOLOX backbone and adding an extra prediction head can educational institutions. Sensors, 22(22), 8820. https://doi.org/10
improve the detection ability to small objects. Experiments .3390/s22228820.
show that the AP of small classes has been increased by Ba J. L., Kiros J. R., & Hinton G. E. (2016). Layer normalization. arXiv
7.17% on average. preprint arXiv:1607.06450.
Journal of Computational Design and Engineering, 2023, 10(3), 1158–1175 | 1173
Biswas D., Nayak I., Choudhury S., Acharjee T., & Mishra M. (2021). Guo C., Li C., Guo J., Loy C. C., Hou J., Kwong S., & Cong R. (2020). Zero-
Crack detection on inner tunnel surface using image processing. reference deep curve estimation for low-light image enhance-
In Progress in Advanced Computing and Intelligent Engineering (pp. 3– ment. In Proceedings of the IEEE/CVF Conference on Computer Vision
12). Springer. and Pattern Recognition (CVPR) (pp. 1780–1789). IEEE.
Bochkovskiy A., Wang C.-Y., & Liao H.-Y. M. (2020). YOLOv4: Op- Hassanien M. A., Singh V. K., Puig D., & Abdel-Nasser M. (2022). Pre-
timal speed and accuracy of object detection. arXiv preprint dicting breast tumor malignancy using deep ConvNeXt radiomics
arXiv:2004.10934. and quality-based score pooling in ultrasound sequences. Diag-
Chen S., & Demachi K. (2020). A vision-based approach for ensuring nostics, 12(5), 1053. https://doi.org/10.3390/diagnostics12051053.
proper use of personal protective equipment (PPE) in decommis- He K., Sun J., & Tang X. (2010). Single image haze removal using dark
sioning of Fukushima Daiichi nuclear power station. Applied Sci- channel prior. IEEE Transactions on Pattern Analysis and Machine In-
ences 2020, 10, 5129. https://doi.org/10.3390/APP10155129. telligence, 33(12), 2341–2353. https://doi.org/10.1109/CVPR.2009.
Chen J., Deng S., Wang P., Huang X., & Liu Y. (2023a). Lightweight 5206515.
helmet detection algorithm using an improved YOLOv4. Sensors, Hendrycks D., & Gimpel K. (2016). Gaussian error linear units (GELUs).
23(3), 1256. https://doi.org/10.3390/s23031256. arXiv preprint arXiv:1606.08415.
Chen W., Li C., & Guo H. (2023b). A lightweight face-assisted ob- HSE. (2017). Fatal injuries in Great Britain . Technical report. Health
Kim M., Park D., Han D. K., & Ko H. (2014). A novel framework for tional Conference on Robotics and Automation (ICRA) (pp. 1316–1322).
extremely low-light video enhancement. In Proceedings of the 2014 IEEE.
IEEE International Conference on Consumer Electronics (ICCE) (pp. 91– Redmon J., & Farhadi A. (2017). YOLO9000: Better, faster, stronger.
92). IEEE. Technical report.arXiv:1612.08242.
Land E. H., & McCann J. J. (1971). Lightness and retinex theory. Josa, Redmon J., Divvala S., Girshick R., & Farhadi A. (2016). You only look
61(1), 1–11. https://doi.org/10.1364/JOSA.61.000001. once: Unified, real-time object detection. In Proceedings of the IEEE
Lee Y.-R., Jung S.-H., Kang K.-S., Ryu H.-C., & Ryu H.-G. (2023). Deep Computer Society Conference on Computer Vision and Pattern Recogni-
learning-based framework for monitoring wearing personal pro- tion (Vol. 2016-Decem, pp. 779–788). IEEE.
tective equipment on construction sites. Journal of Computational Reza A. M. (2004). Realization of the contrast limited adaptive his-
Design and Engineering, qwad019. https://doi.org/10.1093/jcde/q togram equalization (CLAHE) for real-time image enhancement.
wad019. Journal of VLSI Signal Processing Systems for Signal, Image and Video
Lee S., Kim N., & Paik J. (2015). Adaptively partitioned block-based Technology, 38(1), 35–44. https://doi.org/10.1023/B:VLSI.0000028
contrast enhancement and its application to low light-level video 532.53893.82.
surveillance. SpringerPlus, 4(1), 1–11. https://doi.org/10.1186/s400 Rubaiyat A. H., Toma T. T., Kalantari-Khandani M., Rahman S. A.,
64- 015- 1226- x. Chen L., Ye Y., & Pan C. S. (2016). Automatic detection of helmet
Wu J., Cai N., Chen W., Wang H., & Wang G. (2019). Automatic detec- Zhang H., Cisse M., Dauphin Y. N., & Lopez-Paz D. (2017). MixUp: Be-
tion of hardhats worn by construction personnel: A deep learning yond empirical risk minimization. arXiv preprint arXiv:1710.09412.
approach and benchmark dataset. Automation in Construction, 106, Zhang H., Liu C., Ho J., & Zhang Z. (2022a). Crack detection based on
102894. https://doi.org/10.1016/j.autcon.2019.102894. ConvNeXt and normalization. Journal of Physics: Conference Series,
Wu Y., Feng S., Huang X., & Wu Z. (2021). L4Net: An anchor- 2289(1), 012022. https://doi.org/10.1088/1742-6596/2289/1/01202
free generic object detector with attention mechanism for au- 2.
tonomous driving. IET Computer Vision, 15(1), 36–46. https://doi. Zhang S., Teizer J., Pradhananga N., & Eastman C. M. (2015). Work-
org/10.1049/cvi2.12015. force location tracking to model, visualize and analyze workspace
Xiong R., & Tang P. (2021). Pose guided anchoring for detecting proper requirements in building information models for construction
use of personal protective equipment. Automation in Construction, safety planning. Automation in Construction, 60, 74–86. https://do
130, 103828. https://doi.org/10.1016/J.AUTCON.2021.103828. i.org/10.1016/j.autcon.2015.09.009.
Xu Q., Deng H., Zhang Z., Liu Y., Ruan X., & Liu G. (2022a). A ConvNeXt- Zhang L., Wang J., Wang Y., Sun H., & Zhao X. (2022b). Automatic
based and feature enhancement anchor-free Siamese network construction site hazard identification integrating construction
for visual tracking. Electronics, 11(15), 2381. https://doi.org/10.3 scene graphs with BERT based domain knowledge. Automation in
390/electronics11152381. Construction, 142, 104535. https://doi.org/10.1016/j.autcon.2022.
Received: December 27, 2022. Revised: May 7, 2023. Accepted: May 11, 2023
© The Author(s) 2023. Published by Oxford University Press on behalf of the Society for Computational Design and Engineering. This is an Open Access article distributed
under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use,
distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com