You are on page 1of 8

Journal of Cleaner Production 109 (2015) 144e151

Contents lists available at ScienceDirect

Journal of Cleaner Production


journal homepage: www.elsevier.com/locate/jclepro

Early prediction of the performance of green building projects using


pre-project planning variables: data mining approaches
Hyojoo Son, Changwan Kim*
Department of Architectural Engineering, Chung-Ang University, Seoul 156-756, Republic of Korea

a r t i c l e i n f o

a b s t r a c t

Article history:
Received 29 August 2013
Received in revised form
15 August 2014
Accepted 21 August 2014
Available online 2 September 2014

Early prediction of the success of green building projects is an important and challenging issue. The aim
of this study was to develop a model to predict the cost and schedule performance of green building
projects based on the level of denition during the pre-project planning phase. To this end, a three-step
process was proposed: pre-processing, variable selection, and prediction model construction. Data from
53 certied green buildings were used to develop the models. After balancing the data set with respect to
the proportion of cases in each of the outcome categories by pre-processing, the number of input variables was reduced from 64 to 13 and 7 for cost and schedule performance prediction respectively, using
the ReliefF-W variable selection method. Then, cost and schedule performance prediction models were
constructed using the selected variables and four different classiers: a support vector machine (SVM), a
back-propagation neural network (BPNN), a C4.5 decision tree algorithm (C4.5), and a logistic regression
(LR). The classication performance of the four models was compared to assess their applicability. The
SVM models exhibited the highest accuracy, sensitivity, and specicity in predicting both the cost and
schedule performance of green building projects. The results of this study empirically validated that the
cost and schedule performance of green building projects is highly dependent on the quality of denition
in the pre-project planning phase.
2014 Elsevier Ltd. All rights reserved.

Keywords:
Green building project
Certied green building
Pre-project planning
Cost performance
Schedule performance
Data mining

1. Introduction
Green building, a practice that is two decades old, has become
more prevalent in recent years. The U.S. Environmental Protection
Agency denes green building as the practice of creating structures and using processes that are environmentally responsible and
resource-efcient throughout a building's lifecycle, from siting to
design, construction, operation, maintenance, renovation, and
deconstruction (U.S. Environmental Protection Agency, 2010a).
Many countries have developed and adopted various green building rating systems (Zhang et al., 2013). The exemplar systems are
the U.K.s Building Research Establishment Environmental Assessment Method (BREEAM), the U.S. Leadership in Energy and Environmental Design (LEED), Australian Green Star, Germany's
Deutsche Gtesiegel Nachhaltiges Bauen (DGNB), the Japanese
Comprehensive Assessment System for Building Environmental
Efciency (CASBEE), and the Korean Green Building Certication
System (K-GBCS). These systems outline specic guidelines for
implementing green practices into the lifecycle of a building, thus

* Corresponding author. Tel.: 82 2 820 5726; fax: 82 2 812 4150.


E-mail addresses: changwan@cau.ac.kr, changwan72@gmail.com (C. Kim).
http://dx.doi.org/10.1016/j.jclepro.2014.08.071
0959-6526/ 2014 Elsevier Ltd. All rights reserved.

guiding the stakeholders of the building project in the green delivery of their project.
The number of buildings certied with green building rating
systems has rapidly increased as a result of the rapid spread of such
systems and the recognition of the benets of green buildings,
which include reduced operating costs; the creation, expansion,
and development of markets for green products and services;
improved occupant productivity; and optimized life-cycle economic performance (Zhang, 2014; Atlee, 2011; U.S. Environmental
Protection Agency, 2010b; Shen et al., 2010). For example, the total U.S. green building market value is expected to increase from
$10 billion in 2005 to between $98 and $106 billion in 2013
(McGraw-Hill Construction, 2012). This rapid growth of the green
building market has raised concerns among project stakeholders
about the risks involved, especially the high level of uncertainty
with respect to project performance in the delivery of green
building projects (Hwang and Leong, 2013; Robichaud and
Anantatmula, 2011). Hwang and Leong (2013) reported that the
failure rate of 39 green building projects in their survey was 33% in
terms of schedule performance, more than twice that of traditional
building 40 projects.
The successful delivery of green building projects is more difcult and complex than for traditional building projects, particularly

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

due to the additional considerations which must be taken into


account in the pre-project planning phase (Zhang et al., 2011a).
These include advanced simulation and analysis, higher construction standards, the contribution of multidisciplinary experts, site
precautions, and knowledge of sustainable building practices
(Zhang et al., 2011b; Pulaski et al., 2006). Poor or incomplete preproject planning is at the root of failures of green building projects (Chandramohan et al., 2012; Robichaud and Anantatmula,
2011).
In previous studies, pre-project planning practice was recognized as an important contributor to the success of green building
projects. Pulaski et al. (2006) reported that signicant improvements in the performance of green building projects can be made
by giving consideration to constructability in early design phase.
Robichaud and Anantatmula (2011) highlighted the value of the
involvement of a cross-disciplinary team during the pre-project
planning phase in determining the success of green building projects. Swarup et al. (2011) stated that the involvement of the
contractor in the early design phase is vital to achieve success in
green building projects. Zhang et al. (2011a) acknowledged the
importance of engaging the stakeholders and encouraging them to
have efcient communication during the planning and design
phases to their successful delivery of green building projects.
Although previous studies have emphasized the importance of
specic pre-project planning practices, none have explored an
empirical relationship between the level of denition in the preproject planning phase and the performance of green building
projects. Furthermore, the predictability of success by taking the
implementation of pre-project planning activities into account has
not been validated empirically. Early prediction of the performance
of a green building project is critical for project stakeholders to
make decisions that could inuence project success.
The aim of this study is to develop a model to predict the cost
and schedule performance of green building projects based on the
level of denition during the pre-project planning phase. The rest
of this paper is organized as follows: Related studies on the inuence of pre-project planning on the performance of building project
are discussed in Section 2. Data collection and the characteristics of
the data set are outlined in Section 3. The research methodology is
presented in Section 4 and the experimental results are given in
Section 5. Finally, conclusions are presented in Section 6, which also
highlights the contribution from this research and mentions areas
that merit future research.
2. Related studies
Considerable research has been devoted to investigating the
inuence of pre-project planning on the performance of building
project. Some researchers have applied statistical methods, such as
mean value difference comparison and statistical regression analysis, to validate the relationship between pre-project planning and
the project performance (for example, Wang and Gibson, 2010;
Wang, 2002; Cho and Gibson, 2001).
More recent studies have applied data mining techniques to
explore the relationship between pre-project planning phase and
the nancial performance of building project. Wang et al. (2012)
used a support vector machine model to examine the relationship
between sub-scores from three different sections of the Project
Denition Rating Index (PDRI) and the cost and schedule performance of 92 building projects. The models had a predictive accuracy of 92% and 81% for cost and schedule performance
respectively. Son et al. (2012) developed a hybrid predictive model
to understand the impact of pre-project planning on cost performance, combining principal component analysis (PCA) with a
support vector regression (SVR) model. Values for 64 PDRI variables

145

and cost performance data from 84 commercial building projects


were used to develop the model. The cost performance values
predicted by the PCAeSVR method were very close to the actual
values (mean absolute percentage error less than 10%). All of the
previous studies which have shown that the level of project denition inuences the performance of building projects focused on
traditional construction projects. None have investigated the predictability of the success of green building projects based on the
level of denition in the pre-project planning phase. The question
then arises whether the level of project denition during the preproject planning phase inuences the performance of green
building projects.
3. Data collection
3.1. Interview design
Data collection was based on a face-to-face interview followed
by a questionnaire for 53 respondents. The survey respondents
included the key project stakeholders (i.e. architect, developer, or
project manager) for each project. These persons had access to
detailed information about the progress of the project during the
pre-project planning phase and had responsibility for the project at
that time. The questionnaire was designed to measure the level of
denition in the pre-project planning phase and the cost and
schedule performance of each project.
In this study, the 64 scope denition elements in the PDRI for
buildings were employed as the independent variables to measure
the level of denition in the pre-project planning phase of each
project. The PDRI for buildings, developed by the Construction
Industry Institute (CII, 1999) provides a 64-item checklist that
encompasses all the project activities during the pre-project
planning phase. The elements in the checklist were developed
with the participation of more than 100 industry practitioners and
were developed with consideration of their impact on overall
project performance (Cho and Gibson, 2001). These practitioners
included engineers, architects, and other industry professionals
directly involved in planning and executing building projects.
Hence, the checklist comprises comprehensive and extensive elements that are most suitable for measuring the level of denition
in the pre-project planning phase of each project (Cho and Gibson,
2001).
In the PDRI for building projects, the 64 variables are divided
into 11 categories, which in turn are grouped into three sections:
basis of project decision (18 variables), basis of design (32 variables), and execution approach (14 variables). The variables and
their groupings are listed in Appendix I. Detailed descriptions of the
variables are published by the CII (1999). To collect information on
the level of denition in the pre-project planning phase for each
project, the interviewer asked the respondents to evaluate each of
the 64 variables on a ve-point Likert-type scale. The scale ranged
from 1 (complete denition) to 5 (incomplete or poor denition).
Respondents were also asked to provide data about the actual
recorded cost and duration at the time of project completion, as
well as the budgeted cost and planned duration estimated at the
time that the decision was made to proceed with the project. From
this data, a project was classied as a cost failure if its actual cost
exceeded budgeted cost, or as a schedule failure if its actual duration exceeded planned duration. Otherwise, the project was
considered successful.
3.2. Data prole
The survey targets were 53 certied green buildings constructed
in South Korea over the past ve years from 2008 to 2012. These

146

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

were certied by either the Leadership in Energy and Environmental Design (LEED), developed by the U.S. Green Building Council
(USGBC), or the Korean Green Building Certication System (KGBCS), developed by the Korea Green Building Council (KGBC). The
median cost of the projects was $131.5 million, with a range of $13.8
to $499.8 million. The duration of these projects was 31.2 months
on average, with a range of 12e56 months. A detailed overview of
the questionnaire data is presented in Table 1.
Thirty-seven projects (70%) had unfavorable cost variance
(actual cost exceeded budgeted cost), while only 16 projects had
favorable cost variance. The average of the unfavorable variances,
i.e. cost escalation, for the 37 failed projects was 13.7%. Databases
on green building projects typically indicate a large proportion of
cost-overrun projects compared to on-budget projects; therefore,
green building projects require careful management with respect
to cost performance. In contrast to cost performance, 32 projects
were completed on schedule, while 21 projects (40%) had unfavorable schedule variance (actual duration exceeded planned
duration). In the case of delays in the completion of projects,
where the agreed planned duration is exceeded, penalties for delays are generally imposed on the contractor. Such penalties usually take the form of withheld payments. Should the contractor
make up time and deliver on the contractual completion date, any
penalties incurred for delay during construction are reimbursed.
The average unfavorable schedule variance for the 21 failed projects was 11.7%.
4. Methodology
The main objective of this study is to develop a model to predict
the cost and schedule performance of green building projects
based on the level of denition during the pre-project planning
phase. To this end, a three-step process is proposed: preprocessing, variable selection, and prediction model construction.
Pre-processing of the data set is carried out to deal with imbalances in the proportions of the outcomes in the data. Variable
selection is used to remove redundant or irrelevant variables, to
reduce the relatively high number of variables compared to the
number of cases. Then, prediction models are constructed using
the selected variables and four different classiers, a support
vector machine (SVM), a back-propagation neural network (BPNN),
a C4.5 decision tree algorithm (C4.5), and a logistic regression (LR),
to predict the cost and schedule performance of green building
projects. Finally, the classication performance of the four models
is compared.

Table 1
Type, size and duration of the green building projects in the study (N 53).
Characteristic
Project type (building use)
Residential
Ofce
Educational
Retail
Cultural
Research
Other
Project size
< $50 million
$50e$100 million
> $100 million
Project duration
< 24 months
24e36 months
> 36 months
a

Frequency

Percentage (%)a

21
16
8
3
2
2
1

39.6
30.2
15.1
5.7
3.8
3.8
1.9

10
12
31

18.9
22.6
58.5

6
37
10

11.3
69.8
18.9

Percentages may not sum to 100% as a result of rounding.

4.1. Pre-processing
The data set comprised 37 failed and 16 successful projects with
respect to cost performance; the failure-to-success ratio was 7:3.
There were 21 failed and 32 successful projects with respect to
schedule performance; the failure-to-success ratio was 4:6. The
data set was unbalanced because different outcome classes were
not equally represented, a prevalent characteristic in the construction industry. Such imbalances lead to a bias toward the majority class, because the standard classiers tend to predict the class
with the higher number of cases, which are positively weighted
ndez et al., 2009). Such a bias results in overduring training (Ferna
ndez et al., 2009). Therefore,
prediction of the majority class (Ferna
pre-processing is necessary to deal with the imbalance problem
before constructing prediction models, by balancing the class distribution in the data set. In this study, the synthetic minority oversampling technique (SMOTE) was employed, proposed by Chawla
et al. (2002).
SMOTE over-samples the minority class by generating synthetic
cases at random intervals between existing minority cases rather
than duplicating existing minority cases (Gao et al., 2011; Li et al.,
2010). Because SMOTE over-samples the minority class without
data duplication, the over-tting problem can be avoided (Gao
ndez et al., 2009). The technique rst nds the
et al., 2011; Ferna
k-nearest neighbors of each minority case. In this study, the value of
k was set to 5, as recommended by Chawla et al. (2002). Next,
depending on the amount of over-sampling required, there are
several iterations in which one neighbor is randomly selected from
the k-nearest neighbors. If the amount of over-sampling required is
200% and k 5, then there are two iterations in which one neighbor
is randomly selected from the ve nearest. Then the difference
between the case under consideration and its neighbor is calculated. This difference is multiplied by a random number between
0 and 1. Finally, the new synthetic cases are added to the data set
and assigned to the minority class.
Fig. 1 shows the class distribution before and after preprocessing with respect to cost and schedule performance. Using
SMOTE, the original number of cases of the minority class was
doubled by reducing the imbalance ration from 7.0:3.0 to 5.4:4.6 for
cost performance and from 4.0:6.0 to 5.7:4.3 for schedule
performance.
4.2. Variable selection
After pre-processing, 69 and 74 cases were obtained in total for
cost and schedule performance prediction, respectively. However,
64 independent variables in the data set represented a large
number compared to the relatively small number of cases. This
would make the standard classiers complex and difcult to train
(Ng et al., 2008). In such cases, it is advisable to discard less relevant
variables by variable selection which determines more relevant
variables in the data set (Liu and Setiono, 1998). Through this
process, both generalization and classication accuracy could be
higher than they would have been otherwise (Ng et al., 2008). In
this study, the ReliefF-W algorithm (weighted by distance) was
employed for variable selection.
ReliefF selects variables based on the estimation of their relevance according to their ability to distinguish between different
classes (Chen and Yu, 2012; Liu et al., 2010). The Relief algorithm
assigns relevance as a weight to each variable; thus irrelevant
variables can be discarded (Kohavi and John, 1997). In the implementation, cases are rst selected randomly from the data set.
Then, the nearest neighbors in the same class (nearest hits) and in
the other class (nearest misses), are determined. Finally, the weight
of each variable is updated by computing the difference in the value

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

147

Fig. 1. Class distribution before and after pre-processing using SMOTE: (a) original data set; (b) after pre-processing.

of the variable between a randomly selected case and its nearest


hits and misses, assuming that the variable values for cases in the
same class should have greater similarity than those for cases in
different classes. After exhausting all cases, the variables whose
weights are greater than a pre-established threshold value remain
(Aarabi et al., 2006). In this study, the signicance threshold value
was set at 0.05 to select more relevant variables and also limit the
number of relevant variables. ReliefF-W is similar to ReliefF, except
that ReliefF-W applies weights to nearest neighbors according to
their distances before adjusting variable scores (van Hulse et al.,
2012).
Tables 2 and 3 summarize the results of variable selection for
cost and schedule performance prediction, respectively. Only 13 of
the 64 candidate variables had a score of at least 0.05 for cost
performance prediction. These variables and their scores are given
in Table 2, listed in descending score magnitude.
Table 3 presents the 7 variables which had a score of at least 0.05
for schedule performance prediction, listed in order of descending
score magnitude. Three variables, viz. business justication, reliability philosophy, and economic analysis, were selected for both cost
and schedule performance prediction. As a result of the variable
selection, the number of independent variables was reduced
considerably from the original 64 variables.
4.3. Prediction models

predict the cost and schedule performance of green building projects based on the level of denition during the pre-project planning phase. These four classiers were chosen for this study
because they are well established and widely used in solving binary classication problems (for example, Olson et al., 2012; Xia
and Jin, 2008; Quintana et al., 2008; Dreiseitl et al., 2001). All
four classiers were evaluated and their performance was then
compared.
4.3.1. Support vector machine (SVM)
The SVM, developed by Vapnik (1995), is a powerful, supervised
learning method that has strong discriminative power. It can
handle non-linear classication problems by using a kernel function to implicitly map data to a high-dimensional space. It then
constructs a hyperplane as a discriminant function to maximize
the margin of separation between two classes in the highdimensional feature space (Vapnik, 1995). The SVM offers good
generalization ability, based on the principle of structural risk
minimization and with the help of multiplier parameters such as
Lagrange multipliers (Ali and Smith, 2006; Dreiseitl et al., 2001).
The SVM achieves this by solving an optimization problem using
training data.
Given a labeled set of M training data (xi,yi), where xi 2 RN and yi
is the associated label (yi 2 {1,1}), the discriminant hyperplane
is dened as follows:

M
X

The reduced sets of variables were used as input for the four
different types of learning classiers (SVM, BPNN, C4.5, and LR) to

f x

Table 2
ReliefF-W scores for the 13 selected independent variables for cost performance
prediction.

where k(,,,) is the kernel function. Constructing an optimal hyperplane is equivalent to estimating all the nonzero coefcients ai
(support vectors) and the bias b. For further information, see
Vapnik (1995, 1998).

Rank

Variable description

Score

1
2
3
4
5
6
7
8
9
10
11
12
13

A2. Business Justication


C6. Project Cost Estimate
B1. Reliability Philosophy
C5. Project Schedule
D3. Civil/Geotechnical Information
F4. Mechanical Design
E10. Building Finishes
F2. Architectural Design
A4. Economic Analysis
E5. Growth and Phased Development
L3. Project Delivery Method
A5. Facility Requirements
E13. Window Treatment

0.1183
0.1154
0.1012
0.0913
0.0839
0.0752
0.0742
0.0667
0.0636
0.0632
0.0538
0.0520
0.0514

ai yi kxi ; x b;

(1)

i1

Table 3
ReliefF-W scores for the 7 selected variables for schedule performance prediction.
Rank

Variable description

Score

1
2
3
4
5
6
7

A2. Business Justication


B1. Reliability Philosophy
D6. Utility Sources with Supply Conditions
A4. Economic Analysis
D4. Governing Regulatory Requirements
E7. Functional Relationship Diagram/Room by Room
B4. Design Philosophy

0.0946
0.0901
0.0856
0.0773
0.0674
0.0598
0.0546

148

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

4.3.2. Back-propagation neural network (BPNN)


An articial neural network is a mathematical and computational model that attempts to simulate a biological neural structure.
One of the most widely used conguration using an articial neural
network is the BPNN, which conjoins a feedforward multi-layer
perceptron with a back-propagation algorithm (Gardner and
Dorling, 1998). This is a modication of the standard linear perceptron that uses three or more layers of nodes, or neurons, with
nonlinear activation functions; it is more powerful than the linear
perceptron in that it can distinguish data that are not linearly
separable (Delen et al., 2005; Cybenko, 1989).
Multiple nodes are connected in layers, with the output of each
node being the thresholded weighted sum of its inputs from the
previous layer. The network weights are usually learned using a
gradient-descent algorithm, known as back-propagation. The
weight-update rule in a back-propagation algorithm is dened as
follows:

 


Dwji n hdj xji aDwji n  1 ;

(2)

where Dwji(n) is the weight update performed during the n th


iteration through the main loop of the algorithm, h is a positive
constant known as the learning rate, dj is the error term associated
with node j, xji is the input from node i to j, and a (0  a < 1) is a
constant (known as the momentum).
4.3.3. C4.5 decision tree (C4.5)
Decision trees have been popular in practice due to their
simplicity, high evaluation speed, and interpretability (Wang et al.,
2005). A decision tree classier uses a divide-and-conquer
approach, and typically builds recursively following a top-down
approach. It applies a simple if-then rule that uses a set of input
variables to split a set of training data into progressively smaller
subgroups, based on some measure of disparitydusually entropy
(Geethanjali and Ray, 2011). It has a ow-chart-like tree structure,
where each internal node represents a test of a particular variable,
each branch represents a test outcome on the entire set of variables,
and each leaf node represents the class assigned to the corresponding branch.
In this study, the C4.5 (also called the J48) algorithm was used to
construct a decision tree. This is the advanced version of the decision tree algorithm ID3 (Quinlan, 1993). The C4.5 algorithm has
two phases: building (growing) and pruning. It uses entropy-based
information gain as the selection criterion. Conceptually, the algorithm breaks a complex decision into several simpler decisions,
with the assumption that the nal solution obtained in this way
resembles the intended desired solution. Here, a pruning process is
carried out to prevent the omnipresent problem of over-tting
(D'Hean and van den Poel, 2013; Li et al., 2004). An unknown
case is routed down the tree according to the values of its variables
tested in successive nodes; when a leaf is reached, the case is
classied on the basis of the variables along the corresponding
branch. A path is traced along a branch, from the root to the leaf
node that holds the class prediction for that case.
4.3.4. Logistic regression (LR)
The LR is a statistical regression method commonly used in
classication problems (Delen et al., 2005). Unlike linear regression,
it allows prediction of a discrete outcome from a set of input variables that may be continuous, discrete, dichotomous, or a mix of
any of these (Stojanova et al., 2012). The logit (logarithm of the odds
ratio) is assumed to be linear with respect to these variables
(Hosmer and Lemeshow, 2000). In the LR, the probability of a case
belonging to a certain class can be found (Caesarendra et al., 2010).
The probability of a dichotomous class is related to a set of potential

input variables. Also, the relationships among the variables and the
strengths of these relationships can be obtained from the LR model
(Stojanova et al., 2012).
The major advantage of LR is that it can produce a simple
probabilistic formula for the classication. In addition, it does not
assume linearity in the relationship between the input variables
and the classes, nor does it require normally distributed input
variables (Keramati and Youse, 2011). The LR model is dened as
follows:

logp=1  p b0 b1 x1 b2 x2 bn xn ;

(3)

where p is the probability of the outcome of interest, b0 is the


intercept term, and bi represents the b coefcient associated with
the corresponding independent variable xi (i 1,...,n).
4.4. Evaluation
The evaluation of the four classiers was based on comparing
their performance with respect to the average prediction accuracy,
sensitivity, and specicity. These statistics were calculated by
computing four quantities: the number of correctly classied successes (true positives, TP), the number of correctly classied failures
(true negatives, TN), the number of failures incorrectly classied as
successes (false positives, FP), and the number of successes incorrectly classied as failures (false negatives, FN) (Sokolova and
Lapalme, 2009). The three performance measures are given by
the following formulae:


Accuracy

TP TN
TP TN FP FN


Sensitivity

Specificity

TP
TP FN

TN
FP TN


(4)


(5)


(6)

For the evaluation of the classication performance of each


classier, k-fold cross-validation was used. With the data set
comprising green building projects, splitting the data into
training and test sets is not feasible because the number of
projects constructed within recent years is so small. In such a
case, k-fold cross-validation is one solution to evaluate the
generalization performance of a model (An et al., 2007; Duda
et al., 2001). This method is known for its tendency to yield
minimal bias and variance in comparison to all other validation
methods, including the leave-one-out method (Kohavi, 1995). In
addition, it has been shown that k-fold cross-validation can
prevent the over-tting problem in the estimation of performance (e.g., Ma et al., 2008; Ding et al., 2008; Min and Lee, 2005;
Hsu et al., 2003).
Extensive studies on numerous data sets with different classiers have demonstrated that 10-fold cross-validation is optimal in
terms of computation and estimation of error, which is also
conrmed theoretically (Hastie et al., 2001). Thus, 10-fold crossvalidation was used to assess the performance of the classiers.
The data set was split into 10 approximately equal folds (sets),
with each fold in turn used for testing and the remaining nine folds
used for training. Thus, every case was used exactly once for testing.
The estimate of the overall performance is then calculated by
averaging the 10 sets of performance evaluation measures. Such a
procedure minimizes the impact of data dependence and improves
the reliability of the resultant estimates (West, 2000).

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

5. Experimental results

Table 4
Cost performance prediction results for the four classication models.

5.1. Construction of prediction models


In this study, to show the reliability of the evaluation of classiers, parameter optimization for each of three classiers (SVM,
BPNN, and C4.5) was performed. The LR does not require parameter
optimization. The description of the parameter optimization process, the method, and the results are as follows. In all experiments,
this study employed algorithms from WEKA release 3.7.9 (Witten
and Frank, 2005), which is a Java-based machine learning tool
that enables the objective evaluation of classiers.
5.1.1. Construction of the SVM model
There are two key tasks in using the SVM: choosing the kernel
function and setting the best kernel parameters. Selecting an
appropriate kernel function and setting appropriate kernel parameters can greatly improve the SVM classication accuracy
(Akay, 2009). The SVM for this study used a radial basis function,
which is the most commonly used kernel function. The parameters
that therefore have to be optimized for the SVM with a radial basis
function kernel are the penalty parameter C and the kernel function
parameter g. Improper selection of parameters can cause overtting or under-tting problems in the SVM model (Min and Lee,
2005). In other words, these parameters provide indicative measures that help control the over-tting of the SVM model (Bui et al.,
2012).
In this study, the grid search approach was used. The grid
search approach has been demonstrated as an efcient way to nd
the best C and g (Hsu et al., 2003). In the grid search, (C,g) pairs
were tested and the pair with the best 10-fold cross-validation
accuracy was chosen. In this study, a grid space consisting of all
(C,g) with log2C 2 {5,4,...,39,40} and log2g 2 {30,29,...,19,20}
was used. It is interesting to note that the prediction performance
of the SVM was sensitive to both penalty and kernel function parameters in the grid search, indicating that simultaneous optimization of parameters was required for optimal prediction. After
conducting the grid search, the optimal (C,g) for cost performance
prediction was found to be (22, 22), with a cross-validation accuracy of 91.3%. For schedule performance prediction, the optimal
(C,g) was found to be (20, 23) with a cross-validation accuracy of
90.5%.
5.1.2. Construction of the BPNN model
A three-layered multi-layer perceptron was used in this study. It
has been shown that a multiple hidden-layer neural network can
approximate any function (Irie and Miyake, 1988). In the hidden
layer, the number of hidden nodes is preset as the average of the
number of selected variables and the number of classes. The parameters to be optimized for a BPNN are the learning rate h and the
momentum a. To achieve this, a grid search was performed testing
(h,a) pairs, in a grid space consisting of all (h,a) with
h 2 {0.1,...,0.8,0.9} and a 2 {0.1,...,0.8,0.9}. Then, the pair with the
best 10-fold cross-validation accuracy was chosen. In the grid
search, it was observed that the prediction performance of the
BPNN was sensitive to both learning rate and momentum parameters. After conducting the grid search, the optimal (h,a) for cost
performance prediction was found to be (0.7,0.1) with a crossvalidation accuracy of 87.0%. For schedule performance prediction, the optimal (h,a) was found to be (0.5,0.1) with a crossvalidation accuracy of 75.7%.
5.1.3. Construction of the C4.5 model
The parameters that are generally adjusted in the design of a
C4.5 algorithm include the condence factor (F) and the minimum

149

SVM
BPNN
C4.5
LR

Accuracy (%)

Sensitivity (%)

Specicity (%)

91.30
86.96
82.61
76.81

96.88
87.50
90.63
81.25

86.49
86.49
75.68
72.97

number of samples per leaf (M). To achieve higher prediction accuracy and to avoid over-tting of data, a grid search was carried
out by varying F from 0.15 to 0.35 in steps of 0.05 and M from 1 to
10 in steps of 1. Then, the pair with the best 10-fold crossvalidation accuracy was chosen. In the grid search, it was
observed that the C4.5 algorithm was also sensitive to the condence factor as well as the number of cases per leaf. After conducting the grid search, the optimal (F,M) for cost performance
prediction was found to be (0.25,3) with a cross-validation accuracy of 82.6%. For schedule performance prediction, the optimal
(F,M) was found to be (0.35,3) with a cross-validation accuracy of
77.0%.
5.2. Prediction performance comparisons
The average prediction accuracy, as well as sensitivity and
specicity, for the four different classiers for cost and schedule
performance prediction are shown in Tables 4 and 5. They are
ranked primarily in descending order of their average prediction accuracy. The SVM models exhibited the highest accuracy,
sensitivity, and specicity in predicting both cost and schedule
performance, whereas the LR fared the worst. These results
clearly indicate that the cost and schedule performance of
green building projects can be predicted at an early stage with
the SVM models. The BPNN and C4.5 models showed comparable performance in terms of accuracy, sensitivity, and specicity for both cost and schedule performance predictions. The
statistical LR model had the worst predictive accuracy and the
largest errors of the four models in terms of sensitivity and
specicity.
6. Conclusion
The prediction of cost and schedule performance of green
building projects during the pre-project planning phase is an
important and challenging issue. Pre-project planning is generally
acknowledged to be a key contributor to success in green building
projects. The aim of this study was to develop a model to predict
the cost and schedule performance of green building projects based
on the level of denition during the pre-project planning phase.
Three-step process was proposed and presented to achieve this
objective, pre-processing, variable selection, and prediction model
construction. Data from 53 certied green buildings were used to
develop the models. The data set was balanced in terms of the
relative proportion of the outcome classes by pre-processing. The

Table 5
Schedule performance prediction results for the four classication models.

SVM
BPNN
C4.5
LR

Accuracy (%)

Sensitivity (%)

Specicity (%)

90.54
75.68
77.03
60.81

84.38
71.88
62.50
46.88

95.24
78.57
88.10
71.43

150

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151

number of input variables was reduced from 64 to 13 and 7 for cost


and schedule performance prediction, respectively, using the variable selection method, ReliefF-W. Then, cost and schedule performance prediction models were constructed using the selected
variables and four different classiers. The classication performance of these classiers was compared in terms of their accuracy,
sensitivity, and specicity, to evaluate the applicability of the
models.
Several interesting ndings stem from this study. First, the
SVM model showed superior performance for both cost and
schedule performance predictions of green building projects than
the BPNN, the C4.5 and the LR models. Project performance can
be predicted at an early stage using the SVM model, with 91.3%
and 90.5% average accuracy, 96.9% and 84.4% sensitivity, and
86.5% and 95.2% specicity for cost and schedule performance,
respectively.
Second, the results of this study empirically validated that the
cost and schedule performance of green building projects was
highly dependent on the quality of denition in the pre-project
planning phase. The prediction models were appropriate for
describing the relationship between the level of denition in the
pre-project planning phase and the cost and schedule performance
of green building project; selected variables encompassing project
activities in the pre-project planning phase were decisive factors
affecting the predictive capability of the models. Consequently, this
again points to the SVM models in the pre-project planning phase
as accurate predictors.
Third, the proposed models could be utilized to credibly
assess the cost and schedule performance of green building
projects, based on the level of denition during the pre-project
planning phase. Hence, the model could be used to assist project stakeholders in assessing the potential success or failure of
cost and schedule performance of their green building projects
and in making decisions on whetherdand howdto proceed with
the project. During the pre-project planning phase, such a reliable and useful assessment tool is crucial for project stakeholders
in assessing and identifying their green building project's potential success or failure in advance of the beginning of actual
design and execution, and therefore without having to undertake
a large investment in money, time, and effort. Ultimately, an
accurate model for future cost and schedule performance will
serve as a guide to the early development of strategies for
distributing the control of cost and schedule performance among
various stakeholders, thereby contributing to the overall
enhancement of green building project success in the construction industry.
It is recommended that future studies empirically compare the
impact of pre-project efforts on the cost and schedule performance
between green building projects and traditional projects. This may
allow project stakeholders to better justify additional pre-project
planning efforts for their projects, which may help to deliver
their green projects more successfully in terms of cost and schedule
performance. Further, this study was performed exclusively based
on the data from 53 green building projects. Thus, to determine the
generalizability of the results across this industry, additional
empirical studies based on a larger number of green building projects are needed.

Acknowledgments
This research was supported by Basic Science Research
Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (NRF2013R1A1A2A10058175).

Appendix I. Description of the 64 PDRI variables (CII, 1999)

Section/category/variable description
I. Basis of Project Decision
A. Business Strategy
A1. Building Use Requirements
A2. Business Justication
A3. Business Plan
A4. Economic Analysis
A5. Facility Requirements
A6. Future Expansion/Alteration Considerations
A7. Site Selection Considerations
A8. Project Objectives Statement
B. Owner Philosophies
B1. Reliability Philosophy
B2. Maintenance Philosophy
B3. Operating Philosophy
B4. Design Philosophy
C. Project Requirements
C1. Value-Analysis Process
C2. Project Design Criteria
C3. Evaluation of Existing Facilities
C4. Scope of Work Overview
C5. Project Schedule
C6. Project Cost Estimate
II. Basis of Design
D. Site Information
D1. Site Layout
D2. Site Surveys
D3. Civil/Geotechnical Information
D4. Governing Regulatory Requirements
D5. Environmental Assessment
D6. Utility Sources with Supply Conditions
D7. Site Life Safety Considerations
D8. Special Water and Waste Treatment Requirements
E. Building Programming
E1. Program Statement
E2. Building Summary Space List
E3. Overall Adjacency Diagrams
E4. Stacking Diagrams
E5. Growth and Phased Development
E6. Circulation and Open Space Requirements
E7. Functional Relationship Diagram/Room by Room
E8. Loading/Unloading/Storage Facilities Requirements
E9. Transportation Requirements
E10. Building Finishes
E11. Room Data Sheets
E12. Furnishings, Equipment, and Built-Ins
E13. Window Treatment
F. Building/Project Design Parameters
F1. Civil/Site Design
F2. Architectural Design
F3. Structural Design
F4. Mechanical Design
F5. Electrical Design
F6. Building Life Safety Requirements
F7. Constructability Analysis
F8. Technological Sophistication
G. Equipment
G1. Equipment List
G2. Equipment Location Drawings
G3. Equipment Utility Requirements
III. Execution Approach
H. Procurement Strategy
H1. Identify Long-lead/Critical Equipment and Materials
H2. Procurement Procedures and Plans
J. Deliverables
J1. CADD/Model Requirements
J2. Documentation/Deliverables
K. Project Control
K1. Project Quality Assurance and Control
K2. Project Cost Control
K3. Project Schedule Control
K4. Risk Management
K5. Safety Procedures

H. Son, C. Kim / Journal of Cleaner Production 109 (2015) 144e151


(continued )
Section/category/variable description
L. Project Execution Plan
L1. Project Organization
L2. Owner Approval Requirements
L3. Project Delivery Method
L4. Design/Construction Plan and Approach
L5. Substantial Completion Requirements

References
Aarabi, A., Wallois, F., Grebe, R., 2006. Automated neonatal seizure detection: a
multistage classication system through feature selection based on relevance
and redundancy analysis. Clin. Neurophysiol. 117, 328e340.
Akay, M.F., 2009. Support vector machines combined with feature selection for
breast cancer diagnosis. Expert Syst. Appl. 36, 3240e3247.
Ali, S., Smith, K.A., 2006. On learning algorithm selection for classication. Appl. Soft
Comput. 6, 119e138.
An, S., Liu, W., Venkatesh, S., 2007. Fast cross-validation algorithms for least squares
support vector machine and kernel ridge regression. Pattern Recogn. 40,
2154e2162.
Atlee, J., 2011. Selecting safer building products in practice. J. Clean. Prod. 19,
459e463.
Bui, D.T., Pradhan, B., Lofman, O., Revhaug, I., 2012. Landslide susceptibility
assessment in Vietnam using support vector machines, decision tree, and nave
bayes models. Math. Probl. Eng. 2012, 1e26.
Caesarendra, W., Widodo, A., Yang, B.-S., 2010. Application of relevance vector
machine and logistic regression for machine degradation assessment. Mech.
Syst. Signal Process. 24, 1161e1171.
Chandramohan, A., Narayanan, S.L., Gaurav, A., Krishna, N., 2012. Cost and time
overrun analysis for green construction projects. Int. J. Green. Econ. 6, 167e177.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321e357.
Chen, Y.-H., Yu, S.-N., 2012. Selection of effective features for ECG beat recognition
based on nonlinear correlations. Artif. Intell. Med. 54, 43e52.
Cho, C.-S., Gibson Jr., G.E., 2001. Building project scope denition using project
denition rating index. J. Archit. Eng. 7, 115e125.
Construction Industry Institute (CII), 1999. Project Denition Rating Index (PDRI)d
Building Projects. Implementation Resource 155e2, Austin, TX.
Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Math.
Control Signal 2, 303e314.
Delen, D., Walker, G., Kadam, A., 2005. Predicting breast cancer survivability: a
comparison of three data mining methods. Artif. Intell. Med. 34, 113e127.
Ding, Y., Song, X., Zen, Y., 2008. Forecasting nancial condition of Chinese listed
companies based on support vector machine. Expert Syst. Appl. 34, 3081e3089.
Dreiseitl, S., Ohno-Machado, L., Kittler, H., Vinterbo, S., Billhardt, H., Binder, M.,
2001. A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J. Biomed. Inf. 34, 28e36.
Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classication. Chapter 9. John Wiley
& Sons, New York, NY.
D'Haen, J., van den Poel, D., 2013. Model-supported business-to-business prospect
prediction based on an iterative customer acquisition framework. Ind. Mark.
Manag. 42, 544e551.
ndez, A., del Jesus, M.J., Herrera, F., 2009. Hierarchical fuzzy rule based clasFerna
sication systems with genetic rule selection for imbalanced data-sets. Int. J.
Approx. Reason 50, 561e577.
Gao, M., Hong, X., Chen, S., Harris, C.J., 2011. A combined SMOTE and PSO based RBF
classier for two-class imbalanced problems. Neurocomputing 74, 3456e3466.
Gardner, M.W., Dorling, S.R., 1998. Articial neural networks (the multilayer
perceptron)dA review of applications in the atmospheric sciences. Atmos.
Environ. 32, 2627e2636.
Geethanjali, P., Ray, K.K., 2011. Identication of motion from multi-channel EMG
signals for control of prosthetic hand. Australas. Phys. Eng. Sci. Med. 34,
419e427.
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning:
Data Mining, Inference and Prediction, rst ed. Springer-Verlag, New York, NY.
Hosmer, D.W., Lemeshow, S., 2000. Applied Logistic Regression, second ed. John
Wiley & Sons, New York, NY.
Hsu, C.-W., Chang, C.-C., Lin, C.-J., 2003. A Practical Guide to Support Vector Classication. Technical Report. National Taiwan University, Taipei, Taiwan.
Hwang, B.-G., Leong, L.P., 2013. Comparison of schedule delay and causal factors
between traditional and green construction projects. Technol. Econ. Dev. Eco.
19, 310e330.
Irie, B., Miyake, S., 1988. Capabilities of three-layered perceptrons. In: Proc. IEEE Int.
Conf. on Neural Networks, 24e27 July 1988, San Diego, CA, pp. 641e648.
Keramati, A., Youse, N., 2011. A proposed classication of data mining techniques in
credit scoring. In: Proc. 2011 Int. Conf. on Industrial Engineering and Operations
Management, 22e24 January 2011, Kuala Lumpur, Malaysia, pp. 416e424.
Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation
and model selection. In: Proc. Int. Joint Conf. on Articial Intelligence, 20e25
al, Canada, pp. 1137e1145.
August 1995, Montre

151

Kohavi, R., John, G.H., 1997. Wrappers for feature subset selection. Artif. Intell. 97,
273e324.
Li, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selection and
multiclass classication methods for tissue classication based on gene
expression. Bioinformatic 20, 2429e2437.
Li, D.-C., Liu, C.-W., Hu, S.C., 2010. A learning method for the class imbalance
problem with medical data sets. Comput. Biol. Med. 40, 509e518.
Liu, H., Setiono, R., 1998. Some issues on scalable feature selection. Expert Syst.
Appl. 15, 333e339.
Liu, J.-W., Cheng, C.-H., Chen, Y.-H., Chen, T.-L., 2010. OWA rough set model for
forecasting the revenues growth rate of the electronic industry. Expert Syst.
Appl. 37, 610e617.
Ma, C.-Y., Yang, S.-Y., Zhang, H., Xiang, M.-L., Huang, Q., Wei, Y.-Q., 2008. Prediction
models of human plasma protein binding rate and oral bioavailability derived
by using GAeCGeSVM method. J. Pharm. Biomed. 47, 677e682.
McGraw-Hill Construction, 2012. Green Building Outlook Strong for Both Nonresidential & Residential Sectors Despite Soft Economy. Press Release. http://
www.construction.com/about-us/press/green-building-outlook-strong-forboth-non-residential-and-residential.asp.
Min, J.H., Lee, Y.-C., 2005. Bankruptcy prediction using support vector machine with
optimal choice of kernel function parameters. Expert Syst. Appl. 28, 603e614.
Ng, W.W.Y., Yeung, D.S., Firth, M., Tsang, E.C.C., Wang, X.-Z., 2008. Feature selection
using localized generalization error for supervised classication problems using
RBFNN. Pattern Recogn. 41, 3706e3719.
Olson, D.L., Delen, D., Meng, Y., 2012. Comparative analysis of data mining methods
for bankruptcy prediction. Decis. Supp. Syst. 52, 464e473.
Pulaski, M.H., Horman, M.J., Riley, D.R., 2006. Constructability practices to manage
sustainable building knowledge. J. Archit. Eng. 12, 83e92.
Quinlan, J.R., 1993. C4.5 Programs for Machine Learning, rst ed. Morgan Kaufmann,
San Mateo, CA.
Quintana, D., Saez, Y., Mochon, A., Isasi, P., 2008. Early bankruptcy prediction using
ENPC. Appl. Intell. 29, 157e161.
Robichaud, L.B., Anantatmula, V.S., 2011. Greening project management practices
for sustainable construction. J. Manage. Eng. 27, 48e57.
Shen, L.-Y., Tam, V.W.Y., Tam, L., Ji, Y.-B., 2010. Project feasibility study: the key to
successful implementation of sustainable and socially responsible construction
management practice. J. Clean. Prod. 18, 254e259.
Sokolova, M., Lapalme, G., 2009. A systematic analysis of performance measures for
classication tasks. Inf. Process. Manag. 45, 427e437.
Son, H., Kim, C., Kim, C., 2012. Hybrid principal component analysis and support
vector machine model for predicting the cost performance of commercial
building projects using pre-project planning variables. Autom. Constr. 27,
60e66.

Stojanova, D., Kobler, A., Ogrinc, P., Zenko,
B., D
zeroski, S., 2012. Estimating the risk
of re outbreaks in the natural environment. Data Min. Knowl. Disc. 24,
411e442.
Swarup, L., Korkmaz, S., Riley, D., 2011. Project delivery metrics for sustainable,
high-performance buildings. J. Constr. Eng. Manage. 137, 1043e1051.
U.S. Environmental Protection Agency, 2010a. Basic Information. EPA Web site.
http://www.epa.gov/greenbuilding/pubs/about.htm.
U.S. Environmental Protection Agency, 2010b. Why Build Green? EPA. Web site,.
http://www.epa.gov/greenbuilding/pubs/whybuild.htm.
van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., Wald, R., 2012. Threshold-based
feature selection techniques for high-dimensional bioinformatics data. Netw.
Model Anal. Health Inf. Bioinforma. 1, 47e61.
Vapnik, V.N., 1995. The Nature of Statistical Learning Theory, rst ed. Springer, New
York, NY.
Vapnik, V., 1998. The support vector method of function estimation. In:
Suykens, J.A.K., Vandewalle, J.P.L. (Eds.), Nonlinear Modeling: Advanced Black
Box Techniques. Springer, New York, NY, pp. 55e85.
Wang, Y.-R., 2002. Applying the PDRI in Project Risk Management. Ph.D. dissertation. The University of Texas at Austin, Austin, TX.
Wang, Y.-R., Gibson, G.E., 2010. A study of preproject planning and project success
using ANNs and regression models. Autom. Constr. 19, 341e346.
Wang, Y., Tetko, I.V., Hall, M.A., Frank, E., Facius, A., Mayer, K.F.X., Mewes, H.W.,
2005. Gene selection from microarray data for cancer classicationda machine
learning approach. Comput. Biol. Chem. 29, 37e46.
Wang, Y.-R., Yu, C.-Y., Chan, H.-H., 2012. Predicting construction cost and schedule
success using articial neural networks ensemble and support vector machines
classication models. Int. J. Proj. Manage. 30, 470e478.
West, D., 2000. Neural network credit scoring models. Comput. Oper. Res. 27,
1131e1152.
Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and
Techniques, second ed. Morgan Kaufmann, San Francisco, CA.
Xia, G.-E., Jin, W.-D., 2008. Model of customer churn prediction on support vector
machine. Syst. Eng. Theory Pract. 28, 71e77.
Zhang, X., 2014. Paradigm shift toward sustainable commercial project development. Habitat Int. 42, 186e192.
Zhang, X., Platten, A., Shen, L., 2011a. Green property development practice in
China: costs and barriers. Build. Environ. 46, 2153e2160.
Zhang, X., Shen, L., Wu, Y., 2011b. Green strategy for gaining competitive advantage
in housing development: a China study. J. Clean. Prod. 19, 157e167.
Zhang, X., Shen, L., Zhang, L., 2013. Life cycle assessment of the air emissions during
building construction process: a case study in Hong Kong, Renew. Sust. Energ.
Rev. 17, 160e169.

You might also like