Professional Documents
Culture Documents
Abstract: Knowledge management for construction accident cases can identify dangerous conditions and prevent accidents by controlling
risks on-site. However, because accident cases are recorded as unstructured text data, significant time and effort are required to retrieve and
Downloaded from ascelibrary.org by Edinburgh University on 01/15/19. Copyright ASCE. For personal use only; all rights reserved.
analyze the knowledge a user wants. To overcome these limitations, this research proposes a knowledge management system for construction
accident cases using natural language processing. For this purpose, two models were developed that can retrieve appropriate cases according
to user intentions and automatically analyze tacit knowledge from construction accident cases. In the retrieval model, the query is expanded
using a construction accident case thesaurus. Ranking is calculated using Okapi BM25 and weighting according to the thesaurus. In the
analysis model, knowledge is automatically extracted using rule-based and conditional random field (CRF) methods. The proposed system
can retrieve results that are 97% relevant to the accident cases the user intended and can automatically analyze knowledge with accuracies of
93.75% and 84.13% for the rule-based and CRF models, respectively. The results demonstrate the potential of knowledge discovery from
accident reports for more-effective safety management. DOI: 10.1061/(ASCE)CO.1943-7862.0001625. © 2019 American Society of Civil
Engineers.
Author keywords: Construction accident case; Tacit knowledge; Knowledge management; Natural language processing; Information
retrieval; Information extraction.
knowledge from construction accident cases by linking IR and accident contents for accident analysis (KSCE 2014).
IE with the application of machine learning techniques.
Preprocessing (Tokenization)
A tokenizer is required to accurately recognize and process textual
Research Methodology information. Tokenization is the process of breaking a document
into pieces, called tokens (Hotho et al. 2005). A construction dic-
Research Framework tionary was constructed to recognize construction-related terms as
one token. A total of 15,564 construction terms were collected from
The research conducted here attempted to overcome the limitations the KISTEC and the National Institute of the Korean Language
of retrieving and analyzing unstructured data using NLP. The re- (NIKL) (KISTEC 2014b; NIKL 2013a, b, 2016). For instance, the
search developed a framework which retrieves a user’s intended
sentence “John lost his balance and fell down” would be converted
information and analyzes relevant tacit knowledge automatically
into “lost, balance, fall, down” after tokenizing and eliminating
(Fig. 1). This system consists of (1) the semantic retrieval model,
less-informative tokens such as grammatical components and the
and (2) the tacit knowledge extraction model. For model develop-
proper noun.
ment, the authors collected accident case data and created a token-
izer that can accurately recognize and process construction-related
textual information from the data. Semantic Retrieval Model
Fig. 1 shows the detailed framework of the semantic retrieval model,
Data Collection and Preprocessing comprising: query expansion and ranking [Figs. 1 (c and d)].
In the query expansion stage, the query is expanded through a
Data Collection prebuilt thesaurus. In the ranking stage, similarities to query docu-
A total of 4,263 accident case reports were collected from the fol- ments are calculated and compared based on the BM25 method and
lowing government organizations’ accident databases in Korea: thesaurus weight.
Fig. 1. System framework: (a) semantic retrieval model; (b) tacit knowledge extraction model; (c) query expansion; (d) ranking; (e) defining tacit
knowledge of accident cases; (f) rule-based labeling; and (g) machine learning model development.
as tower crane (T/C), back hoe (B/H), and concrete (conc). There-
fore, in this research, the thesaurus was constructed using the ap- X
n
N − nðqi Þ þ 0.5
scoreðD; QÞ ¼ log
proaches to expand the query (Fig. 2): (1) a construction thesaurus i¼1
nðqi Þ þ 0.5
of commonly used terms in the construction industry; and (2) the
fðqi ; DÞ · ðk1 þ 1Þ
Word2vec thesaurus of terms used in construction accident cases. · ð1Þ
jDj
For the first approach, the construction thesaurus was developed fðqi ; DÞ þ k1 · 1 − b þ b avgdl
through a dictionary of construction-related words provided by
NIKL (2016). The related words are classified as synonym, where Q means a query; q1 ; : : : ; qn represent words contained
abbreviation, hypernym, hyponymy, and reference (NIKL 1999). in query Q; D is the document composed of the words in Q;
For the second approach, accident reports were preprocessed fðqi ; DÞ = frequency of occurrence of word qi in document D;
(especially tokenized) to build a list of accident-related words. jDj = total number of words in D; avgdl = average number of words
Word2vec was then applied to automatically analyze various in the set of all documents to be compared; and k1 and b are free
expressions of terms used in accident cases. Word2vec provides parameters.
the means to efficiently estimate the meaning of words in a vector Six thesauri were constructed to reflect different semantic rela-
space (Mikolov et al. 2013). Word2vec assumes that words used tionships among words through query expansion. Then, as a way to
in the same context have similar meanings (Wolf et al. 2014). control the process effectively, each thesaurus was classified ac-
Word2vec expresses each word as a vector in a space of several cording to the semantic relationship, and different weights were
hundred dimensions. It learns text documents and 5–10 words assigned to each (Gong et al. 2005). Finally, the corresponding
neighboring the target word using artificial neural networks. weight of the classified thesaurus was multiplied by the Okapi
Because words of related meaning are likely to appear in similar BM25 score to calculate the weighted score. In this case, the param-
positions in a document, the probabilities of two words gradually eter values were set to their default values (k1 ¼ 2 and b ¼ 0.75).
become closer to each other as they appear repeatedly in the learn- The weighted score was calculated using
ing process. For example, Word2vec analyzes different expressions
such as “while lifting it by a tower crane,” “during lifting steel by X
n
N − nðqi Þ þ 0.5
T/C,” “while lifting it by a tower crane hook,” and “during lifting it scoreðD; QÞ ¼ log
i¼1
nðqi Þ þ 0.5
by a crane,” and places the words tower crane, T/C, crane, and
hook in a similar vector space (Fig. 2). Word2vec finds semantic fðqi ; DÞ · ðk1 þ 1Þ
· · thesaurus weight
relationships based on the expressions instead of considering the jDj
fðqi ; DÞ þ k1 · 1 − b þ b avgdl
meanings of words. Thus, a Word2vec thesaurus was constructed
by clustering terms using the similarities of term usage in the ð2Þ
Fig. 2. Construction and relationship of thesaurus: (a) construction thesaurus; and (b) Word2vec thesaurus.
sauri because it extracts all the semantic relationships used in the role that indicates where events occur or where things are located.
same context. Thus, the Word2vec thesaurus was given a weight of It is often used with relatums such as {to/at/on/from}. WP corre-
3, which is the average of the semantic relatedness weights; these sponds to purpose. The purpose is a semantic role that represents
weights are summarized in Table 1. the purpose of an action. It is often used with relatums such as
{work/during}.
Phase 2: Creating Rules Based on Semantic Role. In Phase 1, the
Tacit Knowledge Extraction Model
semantic role of tacit knowledge and its pattern are confirmed as
Fig. 1 shows the detailed framework of the tacit knowledge extrac- critical rules. However, most construction accident cases are not
tion model. The first step is to define the tacit knowledge of acci- grammatically perfect and contain various expressions. Thus, there
dent cases to be extracted [Fig. 1(e)]. The second step is to extract are limitations in extracting the necessary information using only
the tacit knowledge. Because the accident cases used in this study critical rules, so additional rules have been created through manual
were unlabeled data, rule-based labeling for training the CRF data analysis to compensate for these limitations. As a result, three
model was conducted [Fig. 1(f)]. A machine learning model was types of rules were developed in this research. The absolute rule is
then developed to automatically extract tacit knowledge [Fig. 1(g)]. the critical rule. A rule that should be added beyond the absolute
This machine learning process was applied because the rule-based rule is called a plus rule, and a rule to be excluded is called a minus
method requires manual analysis and thus it is difficult to cope with rule. For example, there is a critical rule for HO, i.e., {A is/are}.
exceptions during analysis; the number of rules tends to increase However, even if it is included as the critical rule, it is not recog-
exponentially as the number of data increases. Meanwhile, the nized as a HO if A is used for a person’s name. Other rules are
machine learning model trains itself and becomes more powerful shown in Table 2 and Fig. 3 (original rule).
with increasing data.
Machine Learning
Defining Tacit Knowledge of Accident Cases The conditional random field was used, which is a widely used
Four categories of tacit knowledge were defined in the accident method for labeling data in IE. CRF is used for optimal classifica-
cases to determine what caused the accident, where it occurred, tion, using information in the context of a sentence. This model
when it occurred, and the result: hazard object (HO), hazard posi- explains conditional probabilities to separate and classify text data
tion (HP), work process (WP), and accident result (AR) (KSCE (Lafferty et al. 2001) using CRF models PðyjxÞ, and includes com-
2014). HO is defined as a direct hazard that can potentially cause plex dependencies between dependent variables. Assuming that x
a disaster, such as form and scaffolding; HP is a place where there is is a random variable for the input data and y is a random variable
a high risk of accidents occurring, such as high place and temporary of the label corresponding to the input data, the parameter Λ ¼
facility; WP explains during which work activity an accident oc- ðλ1 ; λ2 ; : : : ; μ1 ; μ2 ; : : : Þ is defined by a conditional probability
curred, such as excavation and transport; and AR is defined as (Wallach 2004)
the type of damage caused by accidents (physical damage/personal X X n
injury), such as collapse and fall. 1
p∧ ðyjxÞ ¼ exp λj tj ðyi−1 ; yi ; x; iÞ
ZðxÞ j i¼1
Rule-Based
XX n
Phase 1: Identifying Semantic Role of Tacit Knowledge. In þ μk sk ðyi ; x; iÞ ð3Þ
Korean IE, semantic analysis is the most popular and commonly k i¼1
y ¼ arg maxP∧ ðyjxÞ ð4Þ (Peng et al. 2004). In this study, the features considered in the
y
CRF-based machine learning process were simple consecutive
where ZðxÞ is a normalization constant that causes the sum of the words with labels such as HO, HP, WP, and AR, which are con-
label probabilities for the input data to equal 1; tj ðyi−1 ; yi ; x; iÞ is textual characteristics that determine tacit knowledge.
the transition feature function; and sk ðyi ; x; iÞ is the state feature
function. CRF sets context information to a feature and then learns
it. In this case, λj and μk are weights for each feature function and System Prototype Development
can be obtained from the labeled training data. The parameter Λ,
which controls the degree of overfitting, is calculated using maxi- This research used the Python programming language as the basis
mum likelihood estimation (MLE). This research used the widely for implementing the semantic retrieval and knowledge extrac-
used Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm be- tion methodologies, Elasticsearch and pyCRFsuite. Elasticsearch
cause of its high processing speed (Wallach 2004). The parameter (2018a) is an open-source search engine with analysis support.
Λ is calculated from the training data and the most probable label y It was used to implement the semantic retrieval model because
for the given input x is obtained by Eq. (4), the Viterbi algorithm it is able to process large amounts of text data at high speed
model, and tacit knowledge is the result of the tacit knowledge For tacit knowledge extraction, the retrieval results become inputs
extraction model. The statistical analysis results of extracted tacit (Fig. 5), and the tacit knowledge is labeled through the CRF model.
knowledge can be visualized as word clouds, pie charts, and As a result, T/C (hazard object), installation (work process), and
graphs. fall (accident result) are extracted. The extracted knowledge is
visualized in word clouds and graphs after statistical analysis.
Prototype Implementation Process
This section describes the process of the system prototype using Evaluation and Discussion
the tower crane fall example. The entire process is illustrated
in Fig. 5. Semantic Retrieval Model
Semantic Retrieval Model The evaluation of the semantic retrieval model used normalized dis-
When the user inputs the query tower crane fall into the query ex- counted cumulative gain (NDCG), which evaluates the quality of
pansion, the system processes the query through the tokenizer the results in the IR field. Discounted cumulative gain (DCG) is an
(Fig. 5). The result is tower crane and fall. Each token is expanded evaluation method that begins with the assumption that documents
by means of six thesauri including related words; in other words, with higher relevance to the query are more useful, and it is better to
IDCGP
values.
where reli = relevance of query results to ith accident case retrieval Overall, the priorities of the semantic retrieval model were sim-
results derived from this research. Each accident case receives a ilar to the relevance priorities conducted through surveys. Some
penalty on the relevance score if it is ranked lower. The ideal inconsistencies arose because the respondents tended to consider
DCGP for rank position P is IDCGP [Eq. (6)]. In Eq. (5), both the frequency-related information and the semantic roles of
jRELj is equivalent to the rank position P; however, the Pth rank the words, but Okapi BM25 only relied on the frequency. These
in the list of related documents is used as the answer rank for model limitations remain a major challenge, not only in this research
evaluation. In Eq. (7), NDCGP is the normalized DCGP with but also in the IR field in general. As a result, an average of
IDCGP , and its maximum value is 1. 97% for the NDCG values is considered to be an appropriate value
In this research, the standard reli for the degree of relevance be- for the retrieval model.
tween the accident case retrieval result and the query for the DCG
evaluation was from 5 to 1 points (for an accident case with a very Tacit Knowledge Extraction Model
high level of relevance to query to a case with a very low level of
relevance, respectively). The tacit knowledge extraction model was evaluated by comparing
The evaluation of reli was conducted through surveys, and tar- the results of the expert-labeled results with the results of the rule-
geted 16 experts who had been engaged in the construction industry based model and the CRF model to determine their consistency
for less than 5 years. The 16 experts were divided into 4 groups of with the expert results. Because no labeled test set was available,
4. Four test queries were given to the experts: collapse and fall dur- it was necessary to collect labeled data from the experts through a
ing bridge concrete placement, collapse and burying during tunnel survey. The survey was undertaken by the 16 experts who partici-
boring, fall from scaffold, and landslide during a trench operation. pated in the previous retrieval model evaluation, and the final data
These are the most common types of construction accidents that were collected by labeling 101 randomly selected accident cases.
occurred in Korea in 2016. Each group was provided with 10 docu- Table 5 compares the results of the two models: expert-labeled and
ments for 1 different query, and ranked them based on the relativity rule-based-labeled test sets.
of the documents with the given query. Finally, the retrieval results The rule-based model achieved a high accuracy of 93.75%
were compared with the test set (the questionnaire ranks) and the (308=325) on average, which was mostly consistent with the results
differences were quantified based on NDCG. of the experts’ labeling. The rule-based model was also used to
An example of NDCG results is given in Table 3. The four re- construct training data for the CRF model with the intent of improv-
spondents in Group 1 evaluated the relevance of documents to the ing the reliability of the training data. However, the CRF model
query, and based on this, DCG and IDCG were calculated to even- achieved a lower average accuracy, 84.13% (265=315). In particu-
tually derive NDCG results of 99%, 98%, 97%, and 98%. In Group lar, HO and HP yielded low accuracies of 67% (57=85) and 79%
2, the results were 98%, 98%, 97%, and 98%. The results of Group (34=43), respectively. Overall accuracy was also lower than that of
3 were 97%, 93%, 97%, and 90%, and the results of Group 4 were the rule-based model.
97%, 99%, 98%, and 99.7%. In the case of a construction accident, Additionally, this study conducted rule-based labelling for train-
there is no ground truth: there is no correct answer involving rel- ing the CRF model, so verification was performed while comparing
ativity, so the individual NDCG results were slightly different for the test results with the results of the rule-based approach. The veri-
the same query. However, the average NDCG of the four groups fication focused on evaluating whether the CRF model can learn
was 98%, 98%, 94%, and 98% respectively, which was almost not only rule-based critical rules but also additional exceptional
the same value. Thus, it can be confirmed that the performance rules by itself. The CRF model’s precision, recall, and F1 measure
of the semantic retrieval model does not change significantly based were also calculated to evaluate the performance of the CRF model
on the query type. in detail (Powers 2011). The precision indicates whether the results
Acknowledgments
Downloaded from ascelibrary.org by Edinburgh University on 01/15/19. Copyright ASCE. For personal use only; all rights reserved.
posure to safety hazards in construction.” J. Constr. Eng. Manage. 135 (8): networks for bilingual semantic representations.” Int. J. Comput.
726–736. https://doi.org/10.1061/(ASCE)0733-9364(2009)135:8(726). Linguist. Appl. 5 (1): 27–44.
Salton, G., and M. J. McGill. 1986. Introduction to modern information Yoo, H. W. 2009. “The study on the methodology of the Korean parser.”
retrieval. New York: McGraw-Hill. Korean Cult. Res. 50: 153–182.
Sarawagi, S. 2007. “Information extraction.” Found. Trends Databases Zhang, J., and N. M. El-Gohary. 2013. “Semantic NLP-based information
1 (3): 261–377. https://doi.org/10.1561/1900000003. extraction from construction regulatory documents for automated com-
Shin, Y. S., and W. S. Yoo. 2015. “Early warning model using case- pliance checking.” J. Comput. Civ. Eng. 30 (2): 04015014. https://doi
based reasoning for construction site safety accidents.” J. Korean Soc. .org/10.1061/(ASCE)CP.1943-5487.0000346.
Hazard Mitig. 15 (6): 27–33. https://doi.org/10.9798/KOSHAM.2015 Zhang, S., J. Teizer, J. K. Lee, C. M. Eastman, and M. Venugopal. 2013.
.15.6.27. “Building information modeling (BIM) and safety: Automatic safety
Thomas, H. R., M. J. Horman, U. E. L. de Souza, and I. Zavřski. 2002. checking of construction models and schedules.” Autom. Constr.
“Reducing variability to improve performance as a lean construction 29: 183–195. https://doi.org/10.1016/j.autcon.2012.05.006.
principle.” J. Constr. Eng. Manage. 128 (2): 144–154. https://doi.org Zhang, X., Y. Deng, Q. Li, M. Skitmore, and Z. Zhou. 2016. “An incident
/10.1061/(ASCE)0733-9364(2002)128:2(144). database for improving metro safety: The case of Shanghai.” Saf. Sci.
Tixier, A. J. P., M. R. Hallowell, B. Rajagopalan, and D. Bowman. 2016. 84: 88–96. https://doi.org/10.1016/j.ssci.2015.11.023.
“Automated content analysis for construction safety: A natural language
Zhou, P., and N. El-Gohary. 2015. “Ontology-based information extraction
processing system to extract precursors and outcomes from unstruc-
from environmental regulations for supporting environmental compli-
tured injury reports.” Autom. Constr. 62: 45–56. https://doi.org/10
ance checking.” In Proc., Int. Workshop Computing in Civil Engineer-
.1016/j.autcon.2015.11.001.
TTA (Telecommunications Technology Association). 2017a. “Information ing, 190–198. Reston, VA: ASCE.
retrieval, IR.” Accessed July 10, 2017. http://word.tta.or.kr/dictionary Zhou, Z., Q. Li, and W. Wu. 2011. “Developing a versatile subway con-
/dictionaryView.do?word_seq=045850-1. struction incident database for safety management.” J. Constr. Eng.
TTA (Telecommunications Technology Association). 2017b. “Natural lan- Manage. 138 (10): 1169–1180. https://doi.org/10.1061/(ASCE)CO
guage processing, NLP.” Accessed July 10, 2017. http://word.tta.or.kr .1943-7862.0000518.
/dictionary/dictionaryView.do?word_seq=049996-1. Zou, Y., A. Kiviniemi, and S. W. Jones. 2017. “Retrieving similar cases for
TTA (Telecommunications Technology Association). 2017c. “Thesaurus.” construction project risk management using natural language process-
Accessed July 10, 2017. http://terms.tta.or.kr/dictionary/dictionaryView ing techniques.” Autom. Constr. 80: 66–76. https://doi.org/10.1016/j
.do?word_seq=094047-1. .autcon.2017.04.003.