You are on page 1of 12

Automation in Construction 12 (2003) 395 – 406

www.elsevier.com/locate/autcon

Automating hierarchical document classification for


construction management information systems
Carlos H. Caldas, Lucio Soibelman *
Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign,
Newmark CE Lab. MC 250, 205 North Mathews Avenue, Urbana, IL 61801, USA
Accepted 8 January 2003

Abstract

The widespread use of information technologies for construction is considerably increasing the number of electronic text
documents stored in construction management information systems. Consequently, automated methods for organizing and
improving the access to the information contained in these types of documents become essential to construction information
management. This paper describes a methodology developed to improve information organization and access in construction
management information systems based on automatic hierarchical classification of construction project documents according to
project components. A prototype system for document classification is presented, as well as the experiments conducted to verify
the feasibility of the proposed approach.
D 2003 Elsevier Science B.V. All rights reserved.

Keywords: Construction management; Classification systems; Information management; Information systems; Text/data mining

1. Introduction tions. In the distributed and dynamic construction


environment, the ability to exchange and integrate data
The use of communications and information tech- from different sources and in different formats be-
nologies in the construction industry is creating new comes crucial to the development of the construction
opportunities for collaboration, coordination, and processes supported by these management information
information exchange among organizations that work systems. Furthermore, the data collected in these sys-
on a construction project. Inter-organizational con- tems provide a valuable source for data mining [11,28].
struction management information systems are increas- Discovered knowledge can be used to increase the
ingly being used for this purpose. They comprise a set performance of future activities and projects.
of interrelated components that collect, retrieve, proc- Given that a large percentage of the project docu-
ess, store, and distribute data to support planning, ments is generated in text format, methods for organiz-
control, and decision-making among project organiza- ing and improving access to the information contained
in these types of documents become essential to con-
struction information management. Construction infor-
* Corresponding author. Tel.: +1-217-333-4759; fax: +1-217-
333-9464.
mation classification systems (CICSs) can be used to
E-mail addresses: caldas@uiuc.edu (C.H. Caldas), support this information management process. The
soibelma@uiuc.edu (L. Soibelman). classification structure in a construction information

0926-5805/03/$ - see front matter D 2003 Elsevier Science B.V. All rights reserved.
doi:10.1016/S0926-5805(03)00004-9
396 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

classification system (CICS) defines concept hierar- object’s name to the terms being used in the different
chies that can be used for document classification, construction documents.
providing a common framework for document organ- The previously mentioned limitations and the push
ization and management among project organizations. towards fully integrated and automated project pro-
These classification frameworks can be embedded in cesses justify the need for the development of auto-
inter-organizational information systems, like project mated classification methods for construction project
websites, project management software, and document documents that can explore the internal characteristics
management systems. Examples of CICSs include: the of these documents and adapt to different classifica-
CSI MasterFormat [17], CSI UniFormat [33], CI/SfB, tion frameworks.
Uniclass, and the Overall Construction Classification This paper presents a unique way to improve infor-
System [20]. mation organization and access in inter-organizational
One limitation of the existing inter-organizational construction management systems based on methods
information systems is the reliance on manual classi- for automated hierarchical classification of construc-
fication methods conducted by human experts. With tion project documents according to CICSs items. In
the growth in the use of information technologies by order to accomplish this goal, a combination of techni-
construction companies, the increasing availability of ques from the areas of information retrieval and text
electronic documents, and the development of model- mining was explored. As a result, a methodology for
based systems, manual classification becomes imprac- automated hierarchical document classification was
tical. One example of the limitations of manual devised and implemented. A prototype of a construc-
classification is the time and effort that would be tion document classification system was also devel-
required to classify all documents created in a con- oped to provide easy deployment and scalability to the
struction project (contracts, specifications, meeting classification process. The developed prototype auto-
minutes, change orders, field reports, and requests mated all steps of the text classification process.
for information, among others), according to all com- Experiments were conducted to validate the results
ponents of a CICS. and demonstrate the applicability of the implemented
Another limitation of the current systems is the techniques.
consideration of documents as single units for the
purpose of classification and retrieval. Many construc-
tion documents, including specifications and meeting 2. Construction management information systems
minutes, should clearly be divided and then assigned
to more than one item of a CICS. This limitation can The escalating globalization and complexity of
be illustrated by the case in which a project manager construction projects have increased the participation
wants to access information contained in meeting of companies from diverse locations in project teams
minutes regarding a specific CSI MasterFormat item [3]. In this environment, effective inter-organizational
in order to solve an issue. Using current technologies, construction management information systems able to
the project manager would need to manually search minimize time and distance constraints are necessary.
and analyze each document individually in order to Examples of such systems are described extensively
obtain the desired information. in literature [18,19,22,27,32,34,39]. In the distributed
A third problem that exists in available systems is and dynamic construction environment, the ability to
the lack of support for differences in vocabularies and exchange and integrate information from different
naming conventions. This problem can be illustrated sources and in different data formats becomes crucial
by the case in which an architect gives a name for a to the improvement of the construction processes
particular object in a project model. Since there is supported by these systems. Simoff and Maher [26]
usually no standard vocabulary among organizations argue that a key issue in managing construction
that participate in a construction project, references to information is the diversity of data types, including:
that particular object in project documents are often
done using different names. Using current technolo-  structured data files, stored in database manage-
gies, project managers would need to map the model ment systems or specific applications, such as data
C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406 397

warehousing, enterprise resource planning, cost edge-based interfaces linking multiple applications
estimating, scheduling, payroll, finance, and ac- and multiple databases; (iii) integration through geom-
counting; etry; and (iv) integration through a shared project
 semi-structured data files, such as HyperText Mark- model holding all the information relating to a project
up Language (HTML), Extensible Markup Lan- according to a common infrastructure model.’’
guage (XML), or Standardized General Markup The technical integration through a shared project
Languages (SGML) files; model can be based on the creation of model-based
 unstructured text data files, such as contracts, systems using 3D/4D CAD [1] or on the use of
specifications, catalogs, change orders, requests for distributed software architectures to facilitate the inte-
information, field reports, and meeting minutes; gration of decentralized project information [29,32].
 unstructured graphic files stored in binary format, The adoption of data standards can support these
such as 2D and 3D drawings; and integration approaches. Examples of initiatives in this
 unstructured multimedia files, such as pictures, area are presented by Eastman [7], and include the
audio, and video files. ISO-STEP, the Industry Foundation Classes (IFC)
created by the International Alliance for Interoper-
For instance, let us consider a typical construction ability [12], and the aecXML specification [2].
situation where a construction manager wants to find Currently, the majority of the architecture, engi-
all available information about one construction activ- neering, construction, and facilities management
ity, say, placing concrete in a slab. He/she will probably (AEC/FM) information integration initiatives focus
find the drawings in computer-aided design (CAD) on structured data types. Nevertheless, Soibelman
files, the cost estimates in files produced by cost and Caldas [27] argue that a large percentage of the
estimation systems, the schedule in files generated by construction data is stored on semi-structured and
project management software, the specifications and unstructured files. Recent research work addressed
contracts in text documents, the communications some of the issues related with unstructured data
among project members in e-mail files, and price integration. Fruchter [9] describes tools to capture,
quotes in files collected from different websites. A share, and reuse project information. Garrett et al. [10]
major task is how to retrieve, classify, and integrate explore the use of text analysis for building up
information in these different file formats, especially classifications of regulation sections. Wood [38]
considering that the files can also be stored in different describes an approach to extracting concepts from
organizations, computers, or file systems. textual design documentation. BruU ggemann et al. [4]
Information integration methodologies have been proposed the use of arbitrarily structured metadata to
investigated worldwide in order to improve informa- markup documents. Scherer and Reul [24] use text
tion organization and access in inter-organizational clustering techniques to group similar documents and
construction management information systems. Tei- retrieve project knowledge from heterogeneous AEC/
cholz [31] argues that project information should be FM documents. Yang et al. [35] and Kosovac et al.
integrated in three dimensions: ‘‘(1) horizontal inte- [16] proposed the use of controlled vocabularies
gration of multiple disciplines that take part in a (thesauri) to integrate heterogeneous data representa-
construction project; (2) vertical integration of multi- tions. Since a great percentage of AEC/FM information
ple stages in the life cycle of a facility; and (3) is exchanged using text data files, the management of
longitudinal integration over time, which is also the information contained in these types of documents
related with the capture of knowledge that allows becomes crucial to construction information manage-
improved performance or better decisions in the ment.
future.’’
Fisher and Kunz [8] argue that technical and
managerial strategies have been used to improve 3. Construction information classification systems
information integration. On the technical side, there
are four approaches to achieve integration [21,40]: Construction management information systems
‘‘(i) communication between applications; (ii) knowl- generate a significant quantity of data that needs to
398 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

be organized, stored, accessed, and used by all project Ref. [6]. The importance of this study is that auto-
organizations. The increase in the amount and types of mated document classification methods can be used to
information generated and the construction industry’s improve information organization and access in cur-
subsequent reliance on it motivated the creation of rent information management systems as well as
classification standards that can comprehend the full being a foundation for integration of construction
scope of construction information. These standards documents in emerging model-based systems.
enable the organization of project information and Experiments were conducted in order to evaluate
facilitate the communication between project organ- the alternative methods that could be applied in each
izations throughout the project’s life cycle. of the phases of the document classification process.
The information classification standards created by The database selected for this evaluation was the
the AEC/FM industry are called construction informa- Sweet’s Product Marketplace [30]. This database
tion classification systems [13]. They can be defined as stores data from over 10,700 manufacturers and
a standard representation of construction project infor- 61,300 products for the construction industry. Con-
mation. According to Kang and Paulson [13,14], a struction products are classified using the hierarchical
construction information classification system pro- structure of CSI MasterFormat [17] in this database.
vides a common method for improving organization The experiments were conducted using 3030 ran-
and coordination of information in construction proj- domly selected documents from the Sweet’s database.
ects. Examples of CICSs include the CSI Masterformat The goal was to verify the classification accuracy of
[17], the CSI Uniformat [33], and the Overall Con- the proposed automated document classification
struction Classification System [20], and Uniclass method, using the classification decisions already
[14]. For instance, in OCCS project facilities, con- defined in the Sweet’s database as a benchmark. The
structed entities, spaces, elements, work results, prod- selected documents were originally classified in the
ucts, process phases, process services, process partic- database according to a subset of 121 CSI Master-
ipants, process aids, process information, and attributes Format items. These items were distributed according
are all defined in a standard manner. Therefore, CICSs to the CSI MasterFormat classification hierarchy and
provide a common framework for information organ- were composed of 16 items on level one, 52 items on
ization and access in construction management in- level two, and 53 items on level three.
formation systems as well as knowledge dissemina- The activity diagram of the proposed document
tion, being an essential component in the integration of classification process is presented in Fig. 1. The
construction project information. definition of the classes and the selection of the
training positive, training negative, testing positive,
and testing negative documents that will be used to
4. Automated hierarchical construction document create the classification model and verify their accu-
classification racy are the initial activities that should be con-
ducted.
From the observations and problems presented in The documents used to create the classification
Sections 1 and 2, we can infer that information models as well as the new documents to be classified
integration, organization, and access should be con- are usually stored in different data formats including:
sidered in construction management. Since a great word processor, spreadsheet, e-mail, HTML, XML,
percentage of the information exchanged among con- PostScript (PS), and Portable Document Format
struction organizations is stored in text data files, the (PDF) files. In order to apply the classification algo-
management of the information contained in these rithms, these files need to be converted to text file
types of documents becomes essential. In order to format. This is usually done using file converter
improve the management of text-based information, systems in order to create a text version of each
an automated document classification method was document, while keeping the original documents in
devised and implemented. The method was designed their native formats and locations. The text versions
according to the construction document classification can then be used in the remaining activities of the
process developed by the authors and described in classification process.
C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406 399

Fig. 1. UML activity diagram of CDCS.


400 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

The next two steps require decisions regarding positive training documents in class C; Nneg = Total
removal of stopwords and stemming. Stopwords are number of negative training documents in class C;
frequent words that do not carry information relevant NhasT = Total number of training documents in class C
to text classification like conjunctions, prepositions, that has term T; NnoT = Total number of training
and pronouns. Stemming is the process of prefix and/ documents in class C that does not have term T;
or suffix removal to generate word stems. This is done NposhasT = Total number of positive training docu-
to group words that have the same conceptual mean- ments in class C that has term T; NneghasT = Total
ing. Our experiments revealed that the removal of number of negative training documents in class C
stopwords, as well as the use of stemming algorithms that has term T; NposnoT = Total number of positive
improves classification accuracy in most of the cases. training documents in class C that does not have term
The index terms were obtained in one of the steps of T; NnegnoT = Total number of negative training docu-
the document classification process. Therefore, pre- ments in class C that does not have term T.
defined index terms were not used in the process. The research demonstrated that the effectiveness of
According to Sebastiani [25], a major character- DR methods depends on the classification method
istic, or difficulty of text classification problems is the used. For instance, the results for support vector
high dimensionality of the feature space. Many clas- machines [15] without dimensionality reduction were
sification algorithms cannot deal with such a large slightly better than when dimensionality reduction
feature set, since processing is extremely costly in was used. Table 1 presents the classification accuracy
computational terms. Hence, in many cases, there is a results for support vector machines in different CSI
need to reduce the original feature set, which is MasterFormat levels without dimensionality reduc-
commonly known as dimensionality reduction (DR) tion, as well as the best classification result obtained
or attribute selection in the pattern recognition liter- from the test cases where dimensionality reduction
ature. was used.
Various DR methods have been tested in this Classification algorithms cannot directly interpret
research. These methods are grounded on concepts text documents. For this reason, a preparation and
from the areas of information theory and linear indexing procedure that maps a text document into a
algebra [36]. In our experiments, the information gain compact representation of its content needs to be
method gave satisfactory results. In the information uniformly applied to training and test documents.
gain method, the expected reduction in entropy caused The vector space model was selected for document
by selecting a term that will be used to classify the representation because the resulting model can be
documents is calculated for all terms that occur in the uniformly applied to the different classification algo-
documents belonging to each class. Terms with high- rithms analyzed. In the vector space model, vectors
est information gain are selected. The information represent documents. The collection of documents is
gain is calculated using the following formula: represented by an m  n term-by-document weighted
frequency matrix A={aij}, where aij was defined as
GainðI; CÞ ¼ EntropyðT; CÞ  ðNhasT =Ntotal Þ the weight of a term i in document j. Each of the m
 EntropyðT; ChasT Þ  ðNnoT =Ntotal Þ
 EntropyðT; CnoT Þ; Table 1
Effect of dimensionality reduction on classification accuracy using
SVM
where: Gain(T,C) = Information gain for term T in class
CSI MasterFormat level Classification accuracy
C; Entropy(T,C) =  (Npos/Ntotal)  log2 (Npos/Ntotal)
 (Nneg/Ntotal)  log2 (Nneg/Ntotal); Entropy(T,ChasT) = Dimensionality reduction
 (NposhasT/NhasT)  log2 (NposhasT/NhasT)  (NneghasT/ Without (%) With (%)
N hasT )  log 2 (N neghasT /N hasT ); Entropy(T,C noT ) = Level 1 95.88 94.33
 (NposnoT/NnoT)  log2 (NposnoT/NnoT)  (NnegnoT/ Level 2 91.53 88.64
NnoT)  log2 (NnegnoT/NnoT); Ntotal = Total number of Level 3 86.37 83.17
All levels 92.05 89.53
training documents in class C; Npos = Total number of
C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406 401

unique terms in the document collection is assigned a where: tfcki = the tfc weight of term k in document i;
row in the matrix, while each of the n documents in tf-idfki = the tf-idf weight of term k in document i; tf-
the collection is assigned a column in the matrix. A idfsi = the tf-idf weight of term s in document i; T = set
non-zero element, aij, indicates not only that term i of all terms that occurs at least once in the collection.
occurred in document j, but also the number of times In tfc weighting, the values of tf-idf weighting are
the term appears in that document or its relative normalized to minimize the effect of length differ-
weight. Since the number of terms in a given docu- ences among documents. Our experiments demonstra-
ment is typically far less than the number of terms in ted that these different weighting schemes have
the entire document collection, the matrix A is usually different classification accuracies. Table 2 presents
very sparse. For each class (defined here as a CICS the accuracy results in different CSI MasterFormat
item), only the terms selected after the dimensionality levels, using the index weighting methods previously
reduction step are used to create the vector space described.
model. An independent vector space model needs to The machine learning algorithms used to create the
be created for each class. classification models have their own data input format
Several ways of determining the weights aij were and requirements. Usually, their data input is made
investigated, including: Boolean weighting, absolute using text files containing the data that will be
frequency, term frequency-inverse document fre- processed. The data transformation step aims to create
quency (tf-idf) weighting, and normalized term fre- the data input files required by the classification
quency-inverse document frequency (tfc) weighting algorithms. Basically, the information from the vector
[23]. These approaches were originally developed space model is converted into the appropriate text file
based on two empirical observations regarding text format.
documents: (i) the more times a word occurs in a Pattern classification algorithms are used to create
document, the more relevant it is to the subject of the the classification models. In this case, the classes are
document, and (ii) the more times the word occurs represented by the items of a Construction Informa-
throughout all documents in the collection, the more tion Classification System. Hence, construction docu-
poorly it discriminates between documents. ment classification is defined as the task of assigning a
In Boolean weighting, a value of 1 is given to each Boolean value to each pair {dj, ci}aD  C, where D is
cell, aij, in which the term i occurred in document j. In a domain of project documents and C is a set of CICS
absolute frequency weighting, the cell aij value is items (classes). A value of T (true) assigned to {dj, ci}
given by the absolute frequency of the term i in indicates a decision that document dj is related with
document j. tf  idf weighting uses the following item ci, while a value of F (false) indicates that dj is
formula to calculate the cell values: not related with item ci.
Several algorithms were tested, including: naive
tf  idf ki ¼ fki  log2 ðN =dk Þ; Bayes, k-nearest neighbors, Rocchio, and support
where: tf-idfki = the tf-idf weight of term k in docu- vector machines (SVM). Table 3 presents the classi-
ment i; fki = the absolute frequency of term k in docu- fication accuracy results in different CSI MasterFor-
ment i; N = the number of documents in the collection;
dk = the number of documents containing term k.
Table 2
The reasoning behind the tf-idf weighting is that if Effect of the index weighting methods on classification accuracy
the term occurs in many of the documents in the
CSI MasterFormat level Classification accuracy
collection, then it does not serve well as a document
Index weighting method
identifier and should be given a low weight as a
potential index term. In tfc weighting, the values for Boolean Abs. frequency tf-idf tfc
(%) (%) (%) (%)
each cell aij is calculated by the formula:
,vu T
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Level 1 89.11 81.48 82.98 95.88
uX Level 2 78.89 65.12 64.70 91.53
tfc ¼ tf  idf t ðtf  idf Þ2 ; Level 3 69.49 50.05 50.32 86.37
ki ki si
s¼1 All levels 80.58 67.83 68.30 92.05
402 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

Table 3 separates the positive and negative training document


Effect of the classification method on classification accuracy vectors for each class in a high dimensional feature
CSI MasterFormat level Classification accuracy without space. Each dimension in this feature space is repre-
dimensionality reduction
sented by an index term, and the coordinate for each
Classification method dimension is defined by the corresponding index term
Naive Bayes k-nn Rocchio SVM weight. In its simplest linear separable case, SVM
(%) (%) (%) (%) finds a hyperplane that separates the set of positive
Level 1 94.18 81.80 93.81 95.88 examples from the set of negative examples with
Level 2 87.87 68.47 88.35 91.53 maximum margin. Fig. 2 illustrates the linear separat-
Level 3 81.93 58.19 84.48 86.37
ing hyperplane. The points x which lie on the hyper-
All levels 88.88 71.15 89.53 92.05
plane satisfy wx + b = 0, where w is normal to the
hyperplane, b/NwN is the perpendicular distance from
the hyperplane to the origin, and NwN is the Euclidian
mat levels using different classification algorithms. norm of w [5].
Since SVM outperformed the other classification This problem can be solved using constrained
methods in this experiment, and was also the method quadratic programming optimization methods in
with best performance in other experiments conducted which the margin, given by 2/NwN, is maximized
by the authors and reported in Ref. [6], a support subject to the constraints yi*(wxi + b) z 1, where xi
vector machine [15] was the algorithm selected for the represents each individual training document vector
implementation of the automated hierarchical docu- for the class being considered and yi corresponds to
ment classification process. classification decision ( + 1 for positive documents
By using a SVM classifier, a classification model and  1 for negative documents) for document vector
can be created for each class by observing the char- xi. Data about all multidimensional spaces and hyper-
acteristics of a set of documents that have previously planes (support vectors) need to be stored efficiently
been classified manually by a domain expert. This since these data will be required in order to classify
approach relies on the existence of an initial corpus of new/unseen documents.
documents previously classified according to their After generating the classification model, its effec-
relevance to a set of project components. A document tiveness is evaluated. The alternative adopted for this
dj is called a positive example of ci if {dj, ci} = T and a evaluation was to randomly split the initial collection
negative example of ci if {dj, ci} = F. of documents into two sets.
Since each construction document can belong to
more than one class (one individual document can be  Training set: set of documents that were used to
related to more than one CICS item), the classification create the classification model.
process was designed to handle multiple binary clas-  Test set: set of documents that were used for testing
sifications. In this case, each document is compared the effectiveness of the classifier.
with each class. For each class, a binary decision is
made in order to define whether the document is In our experiments, the random selection of train-
related or not with that particular class (CICS item). ing and testing sets was repeated 10 times and the
The large number of classes that usually need to be results were averaged in order to calculate the accu-
defined in order to classify construction documents racy of each classification model.
imposes another challenge on the classification task. In each trial, the documents in the test set did not
For multiple binary classifications, a classification participate in the training set. If this condition was not
model has to be created for each of the existing satisfied, then the experimental results obtained would
classes. probably be unrealistically good. The definition of the
In support vector machines, each model is defined size of the training set was also crucial to avoid
by a specific multidimensional space composed of all overfitting. This happens when the classifier performs
training document vectors for that class. The SVM with few errors on the training set and does not
classifier aims to find a decision surface that best generalize to the new test cases.
C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406 403

Fig. 2. SVM Classification.

Whenever a new document needs to be classified, The system enables the classification of construction
it must be projected into the multidimensional space documents according to the specific classification
of each of the existing classes considering the same items found in construction information classification
data preparation options (e.g.: use of the stemmer, systems. CDCS automates the steps involved in the
index term weighting method). This projection is con- document classification process previously described.
ducted very carefully since the index terms in the It is currently composed of seven main modules: data
document to be classified need to match the right selection, data conversion, dimensionality reduction,
multidimensional space dimensions. Considering that data preparation, data transformation, learning, and
the new document vector is xnew, the classification classification. The system was implemented in the
decision for a new document for a given class is given programming language Java and uses Java Database
by the sign of (wxnew + b). A positive value means that Connectivity (JDBC) to communicate with a database
the new document is related to this class. A negative management system (SQL Server). This database
value means that the new document is not a member of stores the data generated during the creation of the
this class. classification models; this data will also be used in the
Since, there are several classification models (one classification of new documents.
for each class), the new document needs to be In CDCS, the classification structure can be de-
projected in several multidimensional spaces. There- fined according to a hierarchy of classes. For instance,
fore, this process needs to be repeated for each of the considering the CSI MasterFormat [17] as the classi-
existing classification models. fication structure, the document is initially classified
according to each element of the first level (CSI
MasterFormat level one-Divisions). For the elements
5. Implementing automated hierarchical document in the first level in which the classification decision
classification was true (meaning that the document was related with
that particular CSI MasterFormat level one item-
A prototype system, called the Construction Docu- Division), the binary classification can then be con-
ment Classification System (CDCS), was implemented ducted for the second hierarchical level (CSI Master-
in order to test the feasibility of the proposed approach. Format level two). Following the same process, for
404 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

Table 4 since they contain fewer training documents and the


Hierarchical classification results (level one) documents are more similar.
CSI MasterFormat code Class name Classification Preliminary results indicated that the highest clas-
accuracy (%)
sification accuracy was achieved using SVM as the
01000 General Requirements 93.90 classification algorithm, tfc, as the index weighting
02000 Site Construction 95.23
method, and no dimensionality reduction. This con-
03000 Concrete 91.13
04000 Masonry 95.40 figuration achieved an average accuracy of 95.88%
05000 Metals 90.51 for the first hierarchical level, 91.53% for the second
06000 Wood and Plastics 94.87 level, and 86.37% for the third. The average classi-
07000 Thermal and 96.04 fication accuracy for SVMs, considering the tests
Moisture Protection
conducted in all class levels, was 92.05%, which is
08000 Doors and Windows 97.39
09000 Finishes 96.27 comparable to human performance in similar manual
10000 Specialties 96.81 document classification tasks [37]. Table 4 and Fig. 3
11000 Equipment 99.34 present the hierarchical classification accuracy results
12000 Furnishings 93.96 for this case.
13000 Special Construction 96.53
At first, the fact that the results using dimension-
14000 Conveying Systems 98.41
15000 Mechanical 99.19 ality results were slightly lower than when no dimen-
16000 Electrical 97.61 sionality reduction method was used seems surprising.
Level 1 95.88 However, according to Joachims [15], this happens
because in text classification there are only very few
irrelevant features (index terms). He demonstrated that
the elements in the second level in which the classi- even features ranked lowest still contain considerable
fication decision was true (meaning that the document information and that aggressive dimensionality reduc-
was related with that particular CSI MasterFormat tion may result in a loss of information. Similar
level two item), the binary classification can then be behavior occurred in our experiments. The tfc indexing
conducted for the third hierarchical level (CSI Master- method considered both the frequency of the index
Format level three). term in the document and in the project collection, and
Tests using CDCS were conducted to evaluate the used a normalization method to minimize document
performance of the proposed automated hierarchical vector length differences. Support vector machines
classification method. Hierarchical classification is performed well because of the high dimensionality of
more challenging than flat classification because the the feature space, composed of document vectors that
accuracy tends to reduce in the lower hierarchical had only few entries that were not zero. This happens
levels. This usually happens because it is more diffi- because each document contained only some of the
cult to differentiate the classes at the lower levels index terms that occurred in the project collection.

Classification Re sults

95.88%
Accuracy

91.53%

86.37%

Level 1 Level 2 Level 3


CSI MasterFormat Hierarchical Level

Fig. 3. Hierarchical classification results (Average by Level).


C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406 405

The proposed methodology can also be used to Experiments were conducted to verify the classi-
improve the organization and access to more unstruc- fication accuracy for hierarchical classification struc-
tured text documents. It has been successfully tested tures. A construction products’ database, originally
in other types of construction documents, such as classified according to a hierarchical structure, was
meeting minutes, requests for information, change used in this analysis. The results demonstrated the
orders, and design review documents. effectiveness and applicability of automated docu-
ment classification methods for construction manage-
ment information systems. Examples of other prob-
6. Conclusions lems that can benefit from the proposed automated
classification method include: analysis of construction
In this paper, a methodology for automated hier- project documentation, organization of multimedia
archical document classification was described and project inspection files based on their description,
evaluated. Automatic hierarchical classification is part facilitation of automated access to project specifica-
of an ongoing research project that aims to improve tions in proactive project controls systems, identifica-
the organization and access of unstructured text docu- tion of problem areas and potential causes of delays,
ments in construction management information sys- cost overruns, or quality deviations, and generation of
tems and facilitate the integration of such documents lessons learned that could be applied in future activ-
in model-based systems. This is a very important issue ities and projects.
for construction information management because a
large percentage of project information is stored in
text documents and these documents contain valuable
Acknowledgements
information for decision-making, data analysis, and
knowledge discovery.
The authors would like to thank the National
The methodology supports the generation of clas-
Science Foundation for the support under the grant
sification models based on project information classi-
number 0201299.
fication structures, such as construction information
classification systems or project model objects. After
creating these classification models, new construction
documents can be effectively classified. The main References
characteristics of the proposed methodology are:
[1] F.B. Aalami, M. Fischer, J.C. Kunz, AEC 4D-CAD produc-
tion model: definition and automated generation. CIFE WP
 It does not require the manual assignment of 052, 1998.
metadata (keywords or index terms) to all docu- [2] aecXML, < http://www.iai-na.org/domains/aecxml/about/
ments in the information system. Manual assign- aecxml_about.html> (Aug 28, 2002).
ment of metadata is a tedious task. It is also hard to [3] C.J. Anumba, N.F.O. Evbuomwan, A taxonomy for commu-
nication facets in concurrent life-cycle design and construc-
achieve consistency when a large number of users tion, Computer-Aided Civil and Infrastructure Engineering 14
from different organizations are adding documents (1999) 37 – 44.
to the system. [4] B.M. BruUggemann, K. Holz, F. Molkenthin, Semantic doc-
 It does not need the utilization of a controlled umentation in engineering, Proceedings of the ICCCBE-
vocabulary that would only be effective if it was VIII, Palo Alto, CA, ASCE, Reston, VA, August, 2000,
pp. 828 – 835.
accepted as a standard by the AEC/FM organiza- [5] C.J.C. Burges, A tutorial on support vector machines for pat-
tions and adopted by all users of a construction tern recognition, Data Mining and Knowledge Discovery 2 (2)
management information system. (1998) 121 – 167.
 It uses already existing AEC/FM standards to define [6] C.H. Caldas, L. Soibelman, J. Han, Automated classification
the categories that will be used for classification; of construction project documents, Journal of Computing in
Civil Engineering, 2002 (October) 16 (4), pp. 234 – 243.
and [7] C.M. Eastman, Building Product Models: Computer Environ-
 It facilitates the creation of automated mapping me- ments Supporting Design and Construction, CRC Press, Boca
chanisms from documents to project components. Raton, FL, USA, 1999.
406 C.H. Caldas, L. Soibelman / Automation in Construction 12 (2003) 395–406

[8] M. Fischer, J. Kunz, The circle: architecture for integrating tion. Technical Report IEI-B4-31-1999, Istituto di Elabora-
software, Journal of Computing in Civil Engineering 9 (2) zione dell’Informazione, CNR, Pisa, Italy, 1999.
(1995) 122 – 133. [26] S.J. Simoff, M.L. Maher, Ontology-based multimedia data
[9] R. Fruchter, A/E/C teamwork: a collaborative design and mining for design information retrieval, Proc. of Computing
learning space, Journal of Computing in Civil Engineering in Civil Engineering, ASCE, Reston, VA, 1998, pp. 212 – 223.
13 (4) (1999) 261 – 269. [27] L. Soibelman, C. Caldas, Project extranets for construction
[10] J.H. Garrett Jr., S.J. Fenves, D.M. Stasiak, A WWW-based management: the American experience, Proceedings of En-
regulation broker, CIB Proceedings Publication 198: Con- tac-2000, May, 2000, Salvador, Brazil.
struction on the Information Highway, CIB, Rottedam, [28] L. Soibelman, H. Kim, Generating construction knowledge
1996, pp. 219 – 230. with knowledge discovery in databases, Journal of Computing
[11] J. Han, M. Kamber, Data Mining: Concepts and Techniques, in Civil Engineering, vol. 16 (1), ASCE, 2002, pp. 39 – 48.
Morgan Kaufmann, San Francisco, CA, 2001. [29] L. Soibelman, F. Peña-Mora, A distributed multi-reasoning
[12] IAI, < http://www.iai-international.org/iai_international/> mechanism to support the conceptual phase of structural
(Aug 28, 2002). design, Journal of Structural Engineering 126 (6) (2000)
[13] L.S. Kang, B.C. Paulson, Adaptability of information classi- 733 – 742.
fication systems for civil works, Journal of Construction En- [30] Sweet’s.Sweet’s Product Marketplace, < http://sweets.
gineering and Management 123 (4) (1997) 419 – 426. construction.com/default.jsp> (Aug 28, 2002).
[14] L.S. Kang, B.C. Paulson, Information classification for civil [31] P. Teicholz, Vision of future practice, Berkeley-Stanford
engineering projects by Uniclass, Journal of Construction En- Workshop on Defining a Research Agenda for AEC Proc-
gineering and Management 126 (2) (2000) 158 – 167. ess/Product Development in 2000 and Beyond, Stanford,
[15] T. Joachims, Text categorization with support vector ma- CA, 1999.
chines: learning with many relevant features, Proceedings [32] ToCEE-Towards a Concurrent Engineering Environment Proj-
of ECML-98, Chemnitz, Germany, Springer, Berlin, 1998, ect, The ToCEE client-server system for concurrent engineer-
pp. 137 – 142. ing. Final Report-ESPRIT Project No. 20587, 2000.
[16] B. Kosovac, T. Froese, D. Vanier, Integrating heterogene- [33] UniFormat, UniFormat 1998 Edition, 9Construction Specifi-
ous data representations in model-based AEC/FM systems, cations Institute, Alexandria, VA, 1998.
Proceedings of CIT 2000, Reykjavik, Iceland, CIB, Rotter- [34] VEGA, Virtual Enterprises using Groupware Tools and Dis-
dam, vol. 1, 2000, pp. 556 – 566. tributed Architecture-VEGA Project < http://cic.cstb.fr/ILC/
[17] MasterFormat, MasterFormat 1995 Edition, Construction ecprojec/vega/home.htm> (Aug 28, 2002).
Specifications Institute, Alexandria, VA, 1995. [35] M.C. Yang, W.H. Wood, M.R. Cutkosky, Data mining for
[18] W.J. O’Brien, Implementation issues in project web-sites: a thesaurus generation in informal design information retrieval,
practitioner’s viewpoint, Journal of Management in Engineer- Proceedings of the International Computing Congress, ASCE,
ing 16 (3) (2000) 34 – 39. Reston, VA, 1998, pp. 189 – 200.
[19] OSMOS, Open System for Inter-enterprise Information Man- [36] Y. Yang, J.O. Pedersen, A comparative study on feature se-
agement in Dynamic Virtual Environments-OSMOS Proj- lection in text categorization, Proceedings of ICML-97, 1997,
ect, < http://cic.vtt.fi/projects/osmos/index.html> (Aug 28, pp. 412 – 420, Nashville, TN.
2002). [37] S.A. Weiss, S. Kasif, E. Brill, Text Classification in USENET
[20] OCCS, Overall Construction Classification System, < http:// Newsgroups: A Progress Report, Department of Computer
www.occsnet.org> (Aug 28, 2002). Science, The Johns Hopkins University, Baltimore, MD,
[21] Y. Rezgui, Y. Brown, G. Cooper, J. Yip, P. Brandon, J. Kirk- 1997 (April).
ham, An information management model for concurrent con- [38] W.H. Wood, The development of modes in textual design
struction engineering, Journal of Automation in Construction data, Proceedings of the ICCCBE-VIII, Palo Alto, CA, CESE,
5 (4) (1996) 343 – 355. Reston, CA, 2000 (August), pp. 882 – 889.
[22] E.M. Rojas, A.D. Songer, Web-centric systems: a new para- [39] A. Zarli, Y. Rezgui, A survey of internet-oriented technologies
digm for collaborative engineering, Journal of Management in for document-driven applications in construction open dynamic
Engineering 15 (1) (1999) 39 – 45. virtual environments, Proceedings of CIT 2000-International
[23] G. Salton, C. Buckley, Term weighting approaches in auto- Conf., vol. 1, Construction Information Technology, Reykja-
matic text retrieval, Information Processing and Management vik, Iceland, 2000, pp. 1089 – 1101.
2 (5) (1988) 513 – 523. [40] Y. Zhu, R.R. Issa, Web-based construction document process-
[24] R.J. Scherer, S. Reul, Retrieval of project knowledge from ing via malleable frame, Journal of Computing in Civil Engi-
heterogeneous AEC documents, Proceedings of the ICCCBE- neering 15 (3) (2001) 157 – 169.
VIII, Palo Alto, CA, ASCE, Reston, VA, August, 2000, pp.
812 – 819.
[25] F. Sebastiani, Machine learning in automated text categorisa-

You might also like