Professional Documents
Culture Documents
Abstract: To improve the efficiency of safety management, it is important to classify massive and complex construction site safety hazard
texts in large-scale projects. High-precision safety hazard text classification is a lengthy and challenging process. Most existing safety
hazard text classification methods capture semantic information using machine learning or deep learning, ignoring the syntactic dependency
between words. However, syntactic dependency contains rich structural information that is useful to alleviate information loss and enrich
text features. To address these issues, this study proposes a graph structure–based hybrid deep learning method to achieve the automatic
classification of large-scale project safety hazard texts. The method uses syntactic dependency and Bidirectional Encoder Representation
from Transformers to express the syntactic structure and semantic information of text, and a graph structure fusing the syntactic structure
and semantic information is constructed to quantify text information. Further, an encoding-decoding mechanism is built using a graph
convolutional neural network and bidirectional long short-term memory to address graph structure data and classify safety hazard texts.
Our proposed method is used to classify hydraulic engineering construction safety hazard texts, and the classification accuracy reaches
86.56%. Meanwhile, the experimental results demonstrate that our model achieves superior performance compared to existing methods. This
proves the ability of our model to capture and analyze text information and verifies the reliability and effectiveness of this method in large-
scale project safety hazard management. DOI: 10.1061/(ASCE)CO.1943-7862.0002382. © 2022 American Society of Civil Engineers.
Author keywords: Large-scale project; Construction safety hazard; Text classification; Graph structure; Bidirectional Encoder
Representations from Transformers (BERT); Graph convolutional network (GCN); Bidirectional long short-term memory (BiLSTM).
tic information for complex safety hazard texts (Zhang et al. 2019; El-Gohary 2016; Liu et al. 2021; Zhang et al. 2020). Machine
Chen et al. 2022). The existing methods still have a large improve- learning–based text classification methods are used to extract fea-
ment space in computational accuracy. Fortunately, there are many tures by analyzing text semantics. Chokor et al. (2016) used unsu-
technical terms in a safety hazard text. Although the semantic dif- pervised machine learning methods to analyze safety incidents and
ference between technical terms is ubiquitous, the function of tech- establish an accident report classification model based on the
nical terms of the same type is similar in syntactic structure, which k-means method. The model evaluated the capabilities of machine
allows for an effective approach to strengthening the relationship learning methods in safety management. Goh and Ubeynarayana
between different expressions of safety hazards. Thus, it is neces- (2017) evaluated six machine learning algorithms, including a
sary to establish an automatic classification method suitable for the support vector machine (SVM), linear regression, random forest,
management of large-scale construction site safety hazards. k-nearest neighbor, decision tree, and naive Bayes, and found that
the SVM outperformed other classifiers, which verified the ap-
plicability of machine learning algorithms for safety management
Proposed Solution text analysis. Zhang et al. (2019) proposed an ensemble model in-
Motivated by the preceding discussion, this study proposes a graph tegrating five machine learning methods to classify the causes of
structure–based hybrid deep learning method (GSHDLM) that accidents, and an actual case was used to verify the accuracy and
integrates syntactic structure and semantic information. It aims to robustness of the method. The proposed joint learning model has
improve text classification accuracy and solve large-scale project been reported to have better accuracy than a single machine learn-
safety hazard management problems. The method uses syntactic ing method. Shallow machine learning can only use shallow fea-
dependency to express text structure. The Bidirectional Encoder tures specified by humans, and the process of feature extraction is
Representations from Transformers (BERT) method is adopted for influenced by human domain knowledge; thus, it is difficult to
training and obtaining word vectors. The text semantic information achieve deep mining of text features (Zhong et al. 2020a; Fang
is expressed as the similarity between words. The syntactic struc- et al. 2020). Unlike shallow machine learning, deep learning algo-
ture and semantic information are used to build a safety hazard rithms can automatically identify text features and use nonlinear
information network graph as the model input. The nodes indicate combinations of multiple functions to learn complex tasks from
words in a document and the edges indicate the syntactic and training data (Lecun et al. 2015; Alam et al. 2020). Zhong et al.
semantic relations from one node to another. Furthermore, a graph (2020b) built a text classification model fusing the Latent Dirichlet
convolutional network (GCN) and bidirectional long short-term Allocation (LDA) method and a convolutional neural network
memory (BiLSTM) are used to establish a text encoding (CNN) model to automatically recognize construction safety haz-
and decoding framework that can extract key features from infor- ards, which confirmed the operability and reliability of deep learn-
mation network graphs and achieve end-to-end safety hazard ing methods in construction text classification. Fang et al. (2020)
recognition. developed an improved deep learning method based on BERT to
The text classification model proposed in this study combines achieve automatic text classification and validate the effectiveness
the safety hazard text features of large-scale projects, not only cap- and feasibility of the method, which proved the superiority of
turing semantic correlations but also considering syntactic struc- BERT for safety text classification. Cheng et al. (2020) proposed
ture relationships. The model strengthens the relationship between a symbiotic gated recurrent unit (GRU) incorporating a GRU and
words, enhances the features of text information, and reduces the symbiotic organisms search (SOS) to classify construction site
impact of information fragmentation for safety hazard classification accident texts, which can search the best parameters of the GRU to
by constructing an information network graph. Furthermore, the ensure optimal performance. Feng and Chen (2021) proposed a
encoding and decoding framework based on a GCN and BiLSTM natural language data augmentation–based small sample training
can fully integrate the node information and discover the potential framework, and then the BiLSTM–conditional random field was
syntactic and semantic relationships, which means the model is used to classify construction safety accident features, which carried
capable of selectively utilizing relevant information that is helpful out the automatic extraction of safety hazard information.
for safety hazard classification. These studies verified the practicability and reliability of ma-
The main innovations and contributions of this paper can be chine learning and deep learning in safety text classification.
summarized as follows. (1) An encoding-decoding structure frame- However, they mainly used semantic analysis to extract text fea-
work based on a GCN and BiLSTM is proposed to fuse syntactic tures, and text features contain not only semantic information but
structure semantic information, which enhances safety hazard text also syntactic structure information (Xu et al. 2021b). A syntactic
features and improves the text understanding effect. (2) To incor- structure provides a rule for a sentence to express the structural re-
porate the syntactic structure and semantic information into text lationship between words. Especially for large-scale project safety
features, BERT is used to calculate the semantic similarity between hazard texts, the syntactic structure can strengthen the text features
words; a graph structure is constructed to capture the syntactic and and improve the accuracy of text analysis. The syntactic structure is
semantic information with words as nodes based on the syntactic the key to extracting the correlation information between words,
HED CMP
ATT
ADV
T1 T2 … Tk … T1 T2 … Tu … T1 T2 … Tn
BRET
… … … … …
Downloaded from ascelibrary.org by SOUTH CHINA UNIVERSITY OF on 08/17/22. Copyright ASCE. For personal use only; all rights reserved.
E1 E2 Ek E1 E2 Eu E1 E2 En
Embedding E[CLS] EThe Elamp Eis E[Mask] Eout E[SEP] EThere Eexist Esafety E[Mask] Eproblem E[SEP]
Masked [CLS] The lamp is [Mask] out [SEP] There exist safety [Mask] problem [SEP]
Input [CLS] The lamp is burnt out [SEP] There exist safety hazard problem [SEP]
correct word for this position (Devlin et al. 2018; Li et al. 2021a). text information graph G ¼ ðN; EÞ, N is a node of the graph, that is,
The other is the prediction of the next sentence, inputting sentences it represents the syntactic structure correlation and semantic infor-
into the BERT model to predict the sequence of sentences. By mation correlation obtained in the section “Construction Text
pretraining on safety hazard text, the semantic features are deeply Information Quantization” (Fig. 3).
extracted, and the word vector is obtained. For a k-layer GCN, the initial input data are x ∈ Rn×m , where n
Inputting the construction site safety hazard text into the is the node number of Graph G and m is the dimension of the word
BERT model calculates the text word vector, and then the word vector. Thus, the input of the GCN is defined as
vector is used to express the correlation between words, which is
defined as X ð0Þ ¼ x ð5Þ
Vi × Vj where X ð0Þ = initial input of the GCN model; and X = word vector
Sij ¼ ð3Þ set calculated by BERT. The normalized matrix X ð0Þ is regarded as
jV i j × jV j j
input, text features are propagated among layers of the GCN, and
where Sij = similarity between words i and j in sentence; and the feature propagation mode is defined as
V i and V j = word vectors of words i and j calculated by BERT.
~ −12 A~ D
X ðkÞ ¼ σðD ~ −12 X ðk−1Þ W ðk−1Þ Þ ð6Þ
A semantic correlation matrix is built by combining the word
co-occurrence and similarity to represent the semantic information,
where X ðkÞ = k th-layer output of GCN; X ðk−1Þ = k th-layer input of
which is defined as follows:
GCN; σð·Þ = activation function; Wðk−1Þ = weight matrix; D ~ =
2 3
1 S12 S13 · · · S1n degree matrix of Graph G; and A~ = adjacency matrix that contains
6 7 the syntactic structure and semantic information, which is deter-
6 S21 1 S23 · · · S2n 7
6 7 mined as
6 7
S ¼ 6 S31 S32 1 · · · S3n 7 ð4Þ
6 7
6 ··· ··· ··· ··· ··· 7 A~ ¼ A þ S ð7Þ
4 5
Sn1 Sn2 Sn3 · · · 1 The quantified construction site safety hazard text is input into
the GCN model, and the text syntactic structure and semantic in-
where S = semantic correlation matrix; and Sij = similarity formation are analyzed by Eq. (6); the encoding result is obtained
between ith and jth words in the sentence. as follows:
ðkÞ ðkÞ ðkÞ ðkÞ
X ¼ fX 1 ; X 2 ; X 3 ; : : : ; X C g ð8Þ
GCN-Based Text Information Encoding
ðkÞ
Based on the syntactic structure–based adjacency matrix and se- where H = text feature with GCN encoding; X 1 = first sentence in
mantic correlation matrix, a GCN is used to encode the syntactic safety hazard text; and C = total number of sentences.
structure and semantic information. The information transfer me-
chanism of the GCN is similar to that of a multilayer perceptron,
BiLSTM-Based Text Information Decoding
and the difference between them is that the GCN has graph data
to capture the information of adjacent nodes (Hu et al. 2021; Considering the long-distance dependencies between text features,
Wang et al. 2020b). Unlike the existing GCN model (Zhou BiLSTM is used to decode the text features obtained using GCN
et al. 2020; Bai et al. 2021), this study takes graph data built by an encoding. BiLSTM provides a bidirectional mechanism that can
adjacency matrix and semantic correlation matrix as input, which deeply analyze text features from two different directions (Zhang
contains the syntactic structure and semantic information. For the et al. 2020; Zhong et al. 2020c). It is composed of multiple long
Ct-1
tanh
short-term memory (LSTM) units, and each LSTM unit includes where Ct = internal state at current moment; Ct−1 = internal state of
an input gate, forget gate, and output gate (Ren et al. 2021) previous LSTM unit; and C~ t = candidate state of text information at
(Fig. 4). The input gate selectively inputs new information into current moment.
the LSTM unit. The forget gate selectively forgets information in The text features encoded by the GCN are input into the
the LSTM unit, controlling the amount of information that needs to ! !
BiLSTM model to calculate the forward hidden state H ¼ f h 1 ;
be forgotten in the previous LSTM unit. The output gate controls ! !
h 2 ; : : : ; h u g and the backward hidden state H ¼ f h 1 ; h 2 ; : : : ;
how much information exists in the current LSTM unit to input into
the external state (Li et al. 2021b). h u g. The forward and backward hidden states are connected to
The conversion relationship of the text feature encoded by the form a text feature decoding result:
GCN model between the three gates is defined as follows: !
H ¼ ½H ; H ð11Þ
ðkÞ
f t ¼ σðwf · ½ht−1 ; X t þ bf Þ where H = text feature decoding result based on BiLSTM model;
ðkÞ and [,] = concatenation operations.
it ¼ σðwi · ½ht−1 ; X t þ bi Þ
ðkÞ
ot ¼ σðwo · ½ht−1 ; X t þ bo Þ ð9Þ Construction Safety Hazard Text Classification Model
ðkÞ
where ft = forget gate; it = input gate; ot = output gate; = text Xt We propose a GSHDLM combining BERT, a GCN, and BiLSTM
feature encoded by GCN; ht−1 = external state of previous LSTM to fuse syntactic structure and semantic information and achieve
unit; and w and b = weight vector and bias vector of each gate, intelligent large-scale project safety hazard recognition (Fig. 5).
respectively. The external state ht of the LSTM unit is calculated The method includes five layers: input layer, graph construction
by combining the input, output, and forget gates, defined as layer, text encoding layer, text decoding layer, and output layer.
They are described as follows:
ht ¼ ot ⊙ tanhðCt Þ Input layer: The input layer includes two aspects. One is the
syntactic structure analysis of the safety hazard text, which uses
Ct ¼ f t ⊙Ct−1 þ it ⊙C~ t ð10Þ Hanlp to analyze and extract the syntactic relationship between
Encoding Layer
w2 w3
… … … … … …
wn
wn 0.5 1.7 0.6 … 1
Graph Structure Layer
w6 w4
…
w1 w2 w3 … wn w1 w2 w3 … wn
w1 0 1 1 …
…
0 w1 1 0.8 0.7 … 0.5
w2 1 0 0 … 1 w2 0.8 1 0.6 … 0.7
w3 1 0 0 … 0 w3 0.7 0.6 1 … 0.6
Downloaded from ascelibrary.org by SOUTH CHINA UNIVERSITY OF on 08/17/22. Copyright ASCE. For personal use only; all rights reserved.
… … … … … … … … … … … …
wn 0 1 0 … 0 wn 0.5 0.7 0.6 … 1
Decoding Layer
LSTM LSTM LSTM … LSTM
Syntactic
actic structure BERT … LSTM
LSTM LSTM LSTM
Input Output
Layer W1 W2 W3 … Wn W1 W2 W3 … Wn … Layer
words. The other is the semantic analysis of safety hazard text. Case Study
BERT is used to analyze the textual context, extract the text
semantic information, and obtain a word vector from the seman- Text Data Collection and Preprocessing
tic level.
Graph construction layer: The graph construction layer uses Taking the safety hazard text derived from a hydraulic engineering
the syntactic structure relationship and the word vector input construction site as the data source, a total of 28,756 safety hazard
from the input layer to build a safety hazard graph structure, which texts were collected; each text recorded a safety hazard on the
takes the words as nodes and uses the syntactic and semantic construction site, which contained the location, time, and content
relationship between words as edges. Notably, semantic relations of safety hazards. Combining the text characteristics and national
between words are expressed as the similarity calculated by word standards, the hydraulic engineering safety hazard was divided
vector. into 12 categories: vehicle damage, electrocution, falling accident,
Text encoding layer: The text encoding layer uses the GCN collapse, incivilization construction, object strike, mechanical
model to analyze the graph structure obtained in the graph con- damage, lifting injury, violation behavior (e.g., illegal operations,
struction layer. It aims to realize safety hazard text encoding from illegal command, violation of labor discipline), fire, explosion, and
the level of the syntactic structure and semantic information and drowning.
convert the graph data into a text feature vector, which is regarded Owing to the wide construction scope and complex procedures
as the key data for feature decoding to realize safety hazard depth of hydraulic engineering, the investigation of safety hazards mainly
analysis. depends on employees. Employees upload the safety hazards at the
Text decoding layer: The text decoding layer takes the en- construction site by the safety hazard management APP, and then
coded text features as input, and the BiLSTM model is used to managers compile a list of safety hazards based on the uploaded
analyze the text features in the forward and backward directions safety hazards (Lin et al. 2019). The safety hazard list can reflect
and obtain long-distance dependency. The purpose of this layer in real time the safety management status of the construction site.
is to strengthen the text feature through the decoding process, However, the safety hazard category is not clearly marked in the
which is helpful to improve the accuracy of safety hazard text list. It is necessary to mark safety hazard texts before classification
classification. calculation. The safety hazard texts are marked by analyzing the
Output layer: The output layer classifies the safety hazard ac- relationship between key information and category. For example,
cording to the extracted text feature. The text feature decoded using “the insulation rubber of the cable is damaged and the cable core is
BiLSTM is taken as input, and Dropout technology is used to pre- exposed” includes key information, for example, “cable,” “insula-
vent overfitting. The Softmax classifier is adopted to calculate the tion rubber,” and “damaged,” which is related to electrocution, so
probability of each safety hazard category and obtain the text the category of this text is defined as electrocution.
classification result: The marked text is divided into three parts: (1) training set,
which is used to extract text features and optimize text data sam-
Y ¼ soft maxðHW 0 Þ ð12Þ ples; the number of training sets is 20,540; (2) validating set, which
is used to adjust the parameter values and evaluate the computing
where W 0 = weight matrix; H = text feature obtained from ability of the model; the data volume of the validating set is 4,108;
encoding-decoding mechanism; and Y = safety hazard text category and (3) testing set, which is used to evaluate the generalization abil-
prediction matrix. The category of each safety hazard text can be ity of the model; the data volume of the testing set is 4,108. Table 1
defined by analyzing the prediction matrix. shows the division of the texts.
Mechanical damage Woodworking chainsaws did not meet the requirements for tool and 1,150 230 230
equipment.
Lifting injury The slag cleaning workers did not take avoidance measures when the excavator 910 182 182
was operated.
Violation behavior Workers did not wear safety helmets correctly. 910 182 182
Fire The place where the welding machine is put was not equipped with a fire 1,615 323 323
extinguisher.
Explosion Oxygen and acetylene cylinder were placed in the open air, which may cause 1,305 261 261
explosion hazards.
Drowning There were no warnings in the water-collecting well, and there were safety 870 174 174
hazards.
Total 20,540 4,108 4,108
Python 3.6 was used as the programming language for devel- The results in Table 2 show that the average precision rate, recall
oping a computing environment, with PyTorch being imported for rate, and F1 value of the GSHDLM proposed in this study are
handling the GCN and BiLSTM models. The model was designed 87.12%, 86.59%, and 86.46%, respectively, which can accurately
with the following parameters: number of LSTM units ¼ 128, identify potential safety hazards on construction sites. This method
embedding dimension ¼ 768, hidden layer size ¼ 128, number of has good performance on safety hazard recognition, such as vehi-
iterations ¼ 100, dropout probability ¼ 0.5, learning rate ¼ 0.001, cle damage, electrocution, incivilization construction, mechanical
and neurons in the GCN layer ¼ 128. damage, lifting injury fire, explosion, and so on. The F1 value ex-
ceeds 85%. The accuracy of the model was verified by comparing
the TextCNN and BiLSTM methods. As seen in Table 2, the aver-
Model Performance Evaluation age F1 value of the GSHDLM was 86.46%, and the accuracy rate
For construction site safety hazard texts, BERT is used to calculate was 86.56%; the average F1 values of the TextCNN and BiLSTM
word vectors and obtain similarities between words. An adjacency methods were 80.42% and 79.56%, respectively. Compared with
matrix is obtained by combining similarity and syntactic structure the TextCNN method, the precision rate, recall rate, and F1 value
correlations. Taking the graph constructed by the correlation matrix of the model in this study were 5.60%, 5.92%, and 6.04% higher,
as the input, the GCN is used to carry out text feature encoding, and respectively. Compared with the BiLSTM method, the accuracy
then BiLSTM is used to complete text feature decoding and output rate, recall rate, and F1 value of the model in this study were
the safety hazard classification result. Precision, recall rates, and F1 6.21%, 6.53%, and 6.90% higher, respectively. Except for electro-
score metrics (F1) are widely used as performance indicators to cution, the F1 value of the GSHDLM was higher than the TextCNN
evaluate a model’s performance (Zhong et al. 2020a; Tian et al. and BiLSTM methods in all categories, demonstrating that the
2021). The results are shown in Table 2. GSHDLM has higher accuracy in safety hazard recognition.
The method proposed in this study includes four parts: syn- construction site safety hazard text. Existing text classification
Downloaded from ascelibrary.org by SOUTH CHINA UNIVERSITY OF on 08/17/22. Copyright ASCE. For personal use only; all rights reserved.
tactic structure, BERT-based semantic information, a GCN, and methods use text semantics to deeply extract potential relation-
BiLSTM. To understand the GSHDLM at length, an ablation study ships. However, syntactic structure and semantic information have
was conducted on the proposed model by removing some compo- the same importance in the text analysis process, which can affect
nents or features from the model to verify the importance of each the text classification effect. There are many technical terms in a
part. Table 3 provides the results. large-scale project safety hazard, the same technical term may
In Table 3, we observe that removing BiLSTM from the have different utilities in the same safety hazard text, and it is
GSHDLM model led to a poor F1 score of 0.8201, the accuracy of difficult using semantic analysis to identify differences in syntac-
text classification dropped by 4.22%, and the precision, recall, and tic structures. Therefore, it is necessary to analyze the syntactic
F1 values dropped by 4.03%, 4.17%, and 4.45%, respectively, structure and obtain the syntactic information of a text. Although
which confirmed the significance of BiLSTM for safety hazard existing studies demonstrated the importance of syntactic struc-
classification. It has been proven that BiLSTM can capture contex- ture in text classification, it is rarely applied in the text analysis
tualized information and strengthen text features. Without BERT- of construction site safety hazards. This study proposes a safety
based semantic information the accuracy rate of the GSHDLM hazard text classification model based on syntactic structure and
decreased by 2.50%. This confirmed the importance of the seman- semantic information that aims to improve the safety hazard clas-
tic information extracted by BERT in safety hazard analysis. When sification efficiency. Taking a hydraulic engineering construction
the GCN was removed from the GSHDLM, the F1 score and ac- site safety hazard as an example, compared with existing text
curacy rate dropped to 0.8040 and 0.8089, respectively. We specu- classification methods, the superiority of our model was con-
late that the GCN can accurately and fully capture the syntactic firmed (Fig. 6).
structure and semantic information and can avoid the loss of in- Fig. 6 shows the classification effect of different methods.
formation in the feature extraction process. Removing the syntactic Note that the model proposed in this study had the best effect
structure degraded the performance of our model’s accuracy rate by on the classification of hydraulic engineering text, and the accu-
nearly 5% because after dropping syntactic structure the GSHDLM racy rate was 86.56%. Compared with deep learning models, such
could not apply all the syntactic information that was necessary to as Region-CNN (RCNN), BiLSTM, bidirectional gated recurrent
the text classification task. This showed the importance of syntactic unit (BiGRU), and BERT, the accuracy of our model is 5.13%,
structure in safety hazard text analysis. 5.89%, 6.50%, 7.06%, and 7.88% higher, respectively. The afore-
mentioned methods only consider the semantic relationship and
lack syntactic structure analysis, which leads to the drop. FastText
Discussion has the lowest accuracy, with an accuracy rate of 72.36%. Fast-
Efficient and accurate safety hazard recognition is an important Text is a shallow network, and it is had a hard time carrying out
approach to improving the efficiency of safety management for the deep mining of safety hazard features. This indicates that a
90.00%
86.56%
85.00%
81.43%
80.67% 80.06% 79.50%
80.00% 78.68%
Accuracy
75.00%
72.36%
70.00%
65.00%
Our model RCNN TextCNN BiLSTM BiGRU BERT FastText
Text classification methods
words. A graph structure fusing syntactic and semantic informa- method is helpful to improve the efficiency of hazard management.
tion is used to quantify safety hazard texts, which enriches the text The safety hazard information of large-scale projects is highly frag-
quantification method and improves the accuracy and comprehen- mented, which increases the difficulty of text classification. Large-
siveness of text feature extraction. An ablation study indicated that scale project construction texts record involve a large number of
when only semantic information or syntactic structure was consid- technical terms, and the accuracy of technical term descriptions
ered, the text classification accuracy was 84.06% and 80.89%, will directly affect text semantic analysis. Therefore, an intelligent
respectively, lower than our model considering syntactic structure classification method for large-scale projects combining syntactic
and semantic information. This illustrates the importance of syn- structure and semantic information was built to automatically rec-
tactic structure and semantic information in safety hazard text ognize construction site safety hazards.
classification. 1. BERT is used to calculate a word vector and obtain the similarity
Second, a GCN model is introduced into the safety hazard text between words, and a similarity-based adjacency matrix is built
deep analysis method to process the graph structure and carry out to represent semantic information. The syntactic structure ex-
the encoding of syntactic structure and semantic information. tracted by Hanlp is used to calculate the syntactic relationship
Furthermore, a BiLSTM-based safety hazard text decoding mecha- between words and form an adjacency matrix. A graph structure
nism is proposed to mine text features and represent the long- fusing syntactic and semantic information is constructed to
distance dependency of text features, which can show the complex
quantify text content. Based on the adjacency matrix, words are
relationship between text features and ensure the accuracy of syn-
taken as nodes, and semantic and syntactic information is taken
tactic and semantic analysis. This method enriches the construction
as edges to construct the graph structure and quantify text con-
text intelligent analysis theory system, which has important guid-
tent. The graph data are regarded as the basic data of text mining
ing significance for improving the accuracy of text classification.
to carry out text feature mining from the syntactic structure and
By comparing the safety hazard classification effects of BiLSTM
semantic information levels.
and a GCN, the classification accuracy of a single method is
2. We proposed a graph structure–based hybrid deep learning
80.89% and 82.34%, respectively, which are lower than our model.
method to determine complex syntactic and semantic relation-
The results prove that the GCN and BiLSTM-based safety hazard
encoding–decoding mechanism can effectively improve the text ships and identify large-scale project construction site safety
classification accuracy. hazards. A GCN model was used to encode syntactic and se-
Third, this study provides an intelligent and accurate safety haz- mantic features, and the encoded features were decoded using
ard recognition method that can quickly determine the category of BiLSTM. Twelve safety hazard categories were established by
safety hazards on construction sites. Furthermore, it is unnecessary hydraulic engineering construction site texts and relevant stan-
to preprocess uploaded safety hazard records, and the records can dards. Safety hazard texts were used to train the intelligent clas-
be directly input into the model to obtain the corresponding safety sification model, and the recognition accuracy was 86.56%.
hazard categories, which improves the application efficiency of the To reflect the superiority of the model in safety hazard process-
model and ensures the timeliness of safety hazard analysis. The ing, compared with other deep learning models, such as CNN,
output safety hazard categories can be used to cluster texts, analyze RNN, RCNN, and so on, it was concluded that the model
the occurrence rule of safety hazards, and formulate safety hazard proposed in this study was superior to other text intelligent clas-
management measures. It is helpful manage safety hazard texts sification models. Meanwhile, an ablation study was executed
systematically. to prove the effect of syntactic structure, BERT-based semantic
information, GCN, and BiLSTM in the model and verify the
accuracy and reliability of the safety hazard intelligent classifi-
Limitation cation model.
The method proposed in this study has high accuracy and reliabil- 3. The GSHDLM adopts a new deep learning method and graph
ity and can efficiently identify construction site safety hazards. structure fusing syntactic structure and semantic information to
However, there is still room for further refinement. Construction carry out large-scale project construction site safety hazard in-
site safety hazard texts contain a lot of information, and a safety telligent classification within a short time so that text features
hazard text may contain one or more safety hazard categories. This are not only limited to text semantics but also fully consider the
study defines each safety hazard text as corresponding to a safety impact of syntactic structure. It is helpful for improving the clas-
hazard category, but it does not consider a multilabel safety hazard sification efficiency of safety hazards. The research effort en-
text problem, which affects the comprehensiveness of text informa- riches the theoretical system of large-scale project construction
tion mining. Therefore, future work should build a multilabel safety text analysis, which can provide necessary key information for
hazard text classification model to comprehensively mine safety construction site safety hazard management and decision-
hazard information and improve the accuracy of safety hazard rec- making, and it is an important prerequisite to perform intelligent
ognition. Meanwhile, current research mainly focuses on safety construction safety management.
/(ASCE)CO.1943-7862.0002165.
Hu, G., G. Lu, and Y. Zhao. 2021. “FSS-GCN: A graph convolutional
References networks with fusion of semantic and structure for emotion cause
analysis.” Knowl.-Based Syst. 212 (Jan): 106584. https://doi.org/10
Alam, K. M., N. Siddique, and H. Adeli. 2020. “Dynamic ensemble learn- .1016/j.knosys.2020.106584.
ing algorithm for neural networks.” Neural Comput. Appl. 32 (12): Kim, T., and S. Chi. 2019. “Accident case retrieval and analyses: Using
8675–8690. https://doi.org/10.1007/s00521-019-04359-7. natural language processing in the construction industry.” J. Constr.
Bai, Y., C. Li, Z. Lin, Y. Wu, Y. Miao, Y. Liu, and Y. Xu. 2021. “Efficient Eng. Manage. 145 (3): 04019004. https://doi.org/10.1061/(ASCE)CO
data loader for fast sampling-based GNN training on large graphs.” .1943-7862.0001625.
IEEE Trans. Parallel Distrib. Syst. 32 (10): 2541–2556. https://doi Ko, T., and H. D. Jeong. 2020. “Syntactic approach to extracting key
.org/10.1109/TPDS.2021.3065737. elements of work modification cause in change-order documents.”
Baker, H., M. R. Hallowell, and A. J.-P. Tixier. 2020. “Automatically learn- In Proc., Construction Research Congress (CRC) on Construction
ing construction injury precursors from text.” Autom. Constr. 118 (Oct): Research and Innovation to Transform Society, 134–142. Tucson: Con-
103145. https://doi.org/10.1016/j.autcon.2020.103145. struct Res Council.
Barnes, J., R. Kurtz, S. Oepen, L. Ovrelid, and E. Velldal. 2021. “Structured Lecun, Y., Y. Bengio, and G. Hinton. 2015. “Deep learning.” Nature
sentiment analysis as dependency graph parsing.” In Proc., Joint 521 (7553): 436–444. https://doi.org/10.1038/nature14539.
Conf. of 59th Annual Meeting of the Association-for-Computational- Li, R., L. Wang, Z. Jiang, D. Liu, M. Zhao, and X. Lu. 2021a. “Incremental
Linguistics (ACL)/11th Int. Joint Conf. on Natural Language Process- BERT with commonsense representations for multi-choice reading
ing (IJCNLP)/6th Workshop on Representation Learning for NLP comprehension.” Multimedia Tools Appl. 80 (21–23): 32311–32333.
(RepL4NLP), 3387–3402. Stroudsburg, PA: Association for Computa- https://doi.org/10.1007/s11042-021-11197-0.
tional Linguistics. Li, X., M. Cui, J. Li, R. Bai, Z. Lu, and U. Aickelin. 2021b. “A hybrid
Chen, S., J. Xi, Y. Chen, and J. Zhao. 2022. “Association mining of near medical text classification framework: Integrating attentive rule con-
misses in hydropower engineering construction based on convolutional struction and neural network.” Neurocomputing 443 (Jul): 345–355.
neural network text classification.” Comput. Intell. Neurosci. https://doi.org/10.1016/j.neucom.2021.02.069.
2022 (Jan): 1–16. https://doi.org/10.1155/2022/4851615.
Lin, J.-R., Z.-Z. Hu, J.-L. Li, and L.-M. Chen. 2020. “Understanding
Cheng, M.-Y., D. Kusoemo, and R. A. Gosno. 2020. “Text mining-based
on-site inspection of construction projects based on keyword extraction
construction site accident classification using hybrid supervised ma-
and topic modeling.” IEEE Access 8 (Nov): 198503–198517. https://doi
chine learning.” Autom. Constr. 118 (Oct): 103265. https://doi.org/10
.org/10.1109/ACCESS.2020.3035214.
.1016/j.autcon.2020.103265.
Lin, P., P. Wei, Q. Fan, and W. Chen. 2019. “CNN model for mining safety
Chi, N.-W., K.-Y. Lin, N. El-Gohary, and S.-H. Hsieh. 2016. “Evaluating
hazard data from a construction site.” [In Chinese.] J. Tsinghua Univ.
the strength of text classification categories for supporting construction
59 (8): 628–634. https://doi.org/10.16511/j.cnki.qhdxxb.2019.26.008.
field inspection.” Autom. Constr. 64 (Apr): 78–88. https://doi.org/10
.1016/j.autcon.2016.01.001. Liu, J., Z. S. Y. Wong, H.-Y. So, and K. L. Tsui. 2021. “Evaluating resam-
pling methods and structured features to improve fall incident report
Chokor, A., H. Naganathan, W. K. Chong, and M. El Asmar. 2016.
“Analyzing Arizona OSHA injury reports using unsupervised machine identification by the severity level.” J. Am. Med. Inf. Assoc. 28 (8):
learning.” Procedia Eng. 145 (Jan): 1588–1593. https://doi.org/10.1016 1756–1764. https://doi.org/10.1093/jamia/ocab048.
/j.proeng.2016.04.200. Lu, J., J. Xuan, G. Zhang, and X. Luo. 2018. “Structural property-
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. 2018. “Bert: aware multilayer network embedding for latent factor analysis.” Pattern
Pre-training of deep bidirectional transformers for language understand- Recognit. 76 (Apr): 228–241. https://doi.org/10.1016/j.patcog.2017
ing.” Preprint, submitted October 11, 2018. https://arxiv.org/abs/1810 .11.004.
.04805. Park, C., J. Park, and S. Park. 2020. “AGCN: Attention-based graph con-
Ding, L. Y., and H. Li. 2013. “Information technologies in safety manage- volutional networks for drug-drug interaction extraction.” Expert Syst.
ment of large-scale infrastructure projects.” Autom. Constr. 34 (Sep): Appl. 159 (Nov): 113538. https://doi.org/10.1016/j.eswa.2020.113538.
1–2. https://doi.org/10.1016/j.autcon.2012.10.016. Qiu, Z., Q. Liu, X. Li, J. Zhang, and Y. Zhang. 2021. “Construction and
Fang, W., H. Luo, S. Xu, P. E. D. Love, Z. Lu, and C. Ye. 2020. “Automated analysis of a coal mine accident causation network based on text min-
text classification of near-misses from safety reports: An improved deep ing.” Process Saf. Environ. Prot. 153 (Sep): 320–328. https://doi.org/10
learning approach.” Adv. Eng. Inf. 44 (Apr): 101060. https://doi.org/10 .1016/j.psep.2021.07.032.
.1016/j.aei.2020.101060. Ren, Q., M. Li, H. Li, and Y. Shen. 2021. “A novel deep learning predic-
Feng, D., and H. Chen. 2021. “A small samples training framework for tion model for concrete dam displacements using interpretable mixed
deep Learning-based automatic information extraction: Case study of attention mechanism.” Adv. Eng. Inf. 50 (Oct): 101407. https://doi.org
construction accident news reports analysis.” Adv. Eng. Inf. 47 (Jan): /10.1016/j.aei.2021.101407.
101256. https://doi.org/10.1016/j.aei.2021.101256. Salama, D. M., and N. M. El-Gohary. 2016. “Semantic text classifica-
Gao, W., and H. Huang. 2021. “A gating context-aware text classification tion for supporting automated compliance checking in construction.”
model with BERT and graph convolutional networks.” J. Intell. Fuzzy J. Comput. Civ. Eng. 30 (1): 04014106. https://doi.org/10.1061
Syst. 40 (3): 4331–4343. https://doi.org/10.3233/JIFS-201051. /(ASCE)CP.1943-5487.0000301.
Goh, Y. M., and C. U. Ubeynarayana. 2017. “Construction accident nar- Tian, D., M. Li, J. Shi, Y. Shen, and S. Han. 2021. “On-site text classifi-
rative classification: An evaluation of text mining techniques.” Accid. cation and knowledge mining for large-scale projects construction by