You are on page 1of 4

2007 IEEE International Conference on Granular Computing

Analyzing Software System Quality Risk Using Bayesian Belief Network

Hu Yong
Guangdong University of Foreign Studies, Sun Yat-sen University, 510275, China henryhu200211@163. com

Chen Juhua Sun Yat-sen University, 510275, China isscjh@ mail.

Jiaxing Huang Sun Yat-sen University, 510275, China

Mei Liu University of Kansas, 66045, U.S.A

Kang Xie Sun Yat-sen University, 510275, China mnsxk@mail.sysu.

Uncertainty during the period of software project development often brings huge risks to contractors and clients. Developing an effective method to predict the cost and quality of software projects based on facts such as project characteristics and two-side cooperation capability at the beginning of the project can aid us in finding ways to reduce the risks. Bayesian Belief Network (BBN) is a good tool for analyzing uncertain consequences, but it is difficult to produce precise network structure and conditional probability table. In this paper, we build up the network structure by Delphi method for conditional probability table learning, and learn to update the probability table and confidence levels of the nodes continuously according to application cases, which would subsequently make the evaluation network to have learning abilities, and to evaluate the software development risks in organizations more accurately. This paper also introduces the EM algorithm to enhance the ability in producing hidden nodes caused by variant software projects.

1. Introduction
Both software developing technologies and tools are making rapid progress in recent years, but many software fail due to project schedule delay, cost overspend, or unsatisfactory product. According to a report delivered by the Standish Group in 2004, 18% of the software projects investigated was considered as failure; and 58% of them were unsuccessful due to schedule delay, exceeding budget or unsatisfying product. A successful software development project relies on many factors. It is complicated and difficult to control all of the factors and to ensure a smooth operation between the factors. The risks may be effectively managed if they can be detected beforehand. The goal of this paper is to introduce a mathematical model and to demonstrate that a software development team can rely on the model to accurately predict and calculate the risks and corresponding impacts on the success of the project.

2. Related Works
Boehm [1] proposed to divide software risk management into two parts: risk assessment and risk control. Rogers Pressman [2] suggested a risk Driver

Research supported by Guangdong Software Science Foundation

(2005B70101096)., National Nature Science Foundation (70572053) (60673135)Corresponding author: Xie Kang

method according to the risk assessment method of the Air Force. This driver method has complete network structure, but it cannot use the historical data effectively. Ramamoorthy used expert system based on regulation and knowledge system based on influence diagram to assess risks[3]. This method integrated historical data and mathematical method. Clyde Chittister [4] posed three questions, which must be answered by software risk assessors: What are the problems? How often do the problems occur? What troubles will the problem bring? Bayesian Belief Networks (BBN) can provide good answers to those questions. The nodes of BBN corresponding to risk events may answer Q1, confidence levels of nodes may answer Q2, and the condition probability relationship may answer Q3. Sunita Chulani [5] proposed to apply Bayesian Belief Network to software cost assessment. In this way, it can integrate predictive and empirical knowledge and sample data information. The function of learning is outstanding on incomplete and sparse data samples. Pendharkar [6] suggested using Bayesian Belief Network for software cost assessment model, which integrated historical information and experts assessment in order to get more accurate cost prediction. Another important application of BBN in software development process is software quality management. Norman Fenton [7] [8] proposed the applications of BBN in software defect prediction, quality control and management. He indicated that BBN is the most effective model in software quality management through comparison with six other types of methods [9]. He built up the BBN model mostly based on experts judgments. The probability distribution is determined from an analysis of the literature or common-sense assumptions on the direction and strength of relations between variables [10]. Anthony Kwok Tai Hui, et al [11] introduced a method to build up a Bayesian Belief Network for software risk assessment. They used a professional research report as a reference to build up Bayesian Belief Network and conditional probability table. The assessment network was used to make Bayesian reasoning. After that it can get the probability distribution of every risk event. By this method, Bayesian Belief Network can be applied to the whole process of software risk assessment, and it is reliable. But it has two drawbacks. First, since the conditional probability table is based on experts experience, the result can be subjective. The second problem is that when additional nodes need to be introduced to the network, new research work must be performed in order to get the new conditional probability table. Therefore, the expandability

0-7695-3032-X/07 $25.00 2007 IEEE DOI 10.1109/GrC.2007.83


is limited. The main work of this paper is to analyze the risk factors, extend and adjust the Bayesian Belief Network in order to build up a better network, and learn conditional probability table by sample data.

3. Build a Bayesian Belief Assessment Network

BBN is a special directed acyclic graph. A complete BBN consists of network structure and CPT (conditional probability table). Once the network structure is fixed, the conditional independent hypothesis of every probability events can be designed. There are two methods to build Bayesian Belief Network. One is to learn the network structure based on certain algorithm using some given sample data, and the other one is to use Delphi method to fix network structure according to experts experience. In this paper, we adopt the second approach. The rationale falls into two aspects. One reason is that constructing network structure by learning requires a lot of sample data and counting work. Secondly, we believe that the experts in software development can judge the independent relationships between two events. Therefore in this paper, we present some event factors and judge the independent relationship of each factor according to experts opinions. First of all, we collected the event factors from previous materials [12][13][14]. After that we classified and filtered all factors and selected 50 risk factors to fill the table. Then we use the list to fix the topology and rate of the network by Delphi method. Finally we present this list to about 30 experts, and let them make connections between two risk factors that may have a causal relationship. In order to simplify the network and satisfy characteristics of a directed acyclic graph, we make some necessary changes to the network by deleting some connections that have the least agreement among the experts. The partial network is shown in Figure 1.

that we cannot give all the items a value in all project samples due to the difference of various projects. If we cannot figure out the value of some event nodes, this project sample is considered as incomplete. When there are samples with missing data in the sample set, our model will automatically call the EM algorithm for learning. LAUR ITZENS had investigated how EM algorithm can be applied in learning Bayesian Belief Network parameter [16]. If there is missing data, we need to give the nodes with missing values a hypothesized value. Then it corrects these values according to the Bayesian Belief Network, which we are learning. We will show how EM is applied in our learning model. Because the event nodes only take three values, let N ijk stand for the number of samples in the sample set D where vi vij and

i ik

N ik stands for the number of samples where

i ik , ijk stands for the probability of vi vij when i ik . We let the CPT at present to be the 0 hypothesized initial value . And the formula for t 1 calculating the next hypothesis based on the present t hypothesis is described in the following steps. First of all, let the value of ijk at present hypothesis to be ijk ,
we defined a likelihood function as below:

l ( | t ) P(vi , i | d l ) ln ijk 1

ijk l

Given the likelihood function above, each of iteration of the EM algorithm has two steps: In the first step, the E step, each

i, k , j calculates the

expectation of N ijk under hypothesis :

4. Bayesian Belief Network Parameter Learning

Once the network structure is fixed, we can start working on the CPT learning. In order for the model to maintain assessing ability when there are not enough data samples, we let the learning model to integrate the predictive knowledge and the information of data collection. Here, the predictive knowledge is the probability of each event node and the condition probability of the adjacent nodes. The experts gave their subjective opinion when building up the network structure, and the CPT item is the average value given by each expert. This CPT will be used as an initial table for the network parameter learning. We gave every network node two values: {0: never happened, 1: happened}. We marked the network structure as , and the nodes as vi , then the CPT we needed is the union probability of the whole Bayesian Belief Network. There are two methods to learn Bayesian union probability. When all the data samples in the collection are complete, we can use grads upping algorithm [15]. However, through investigation, we found

lnE[Nijk | D, t ] P(vi j,i k) lnijk . (2)


Then, we can calculate formula (1) using the expectation of all N ijk . In the second step, the M step, we choose

t 1

to be

which tries to maximize the formula1:

t 1

arg max l ( | t ) 3

The iteration continues until the value of formula (1) converges. LAUR ITZENS have proved that this converging point exists and it can be reached in a few steps [16]. When it converges, we will have the value of ijk for each i, k , j . Then we can update CPT according to formula (4) to make the sum of probability




be 1:


ijk ijk
i ,k



Figure 1. Software System Quality Risk Assessment Network

5. Introduction of the Assessment Tool Based on the Model

An operational tool can be developed for software risk assessment and simulation based on our model. This tool may help the project management team to analyze and control the risks of software projects. On one hand, this tool may help the user organization to assess the ability of the contractor organization. On the other hand, this tool may also help the contractor organization to do some selfassessment on the project, or it can help the organization in configuring resource and avoiding risk. The input of the tool is the probability vector of the top-level nodes. Usually we make the probability of the exiting event to be 1, and the other event to be 0. When we finish all the input, we can start the reasoning of the model. This tool can also help the project managers to trace the project through its whole life cycle. When some event happens, they can change the probability vector manually, and start the simulation. With this tool they can get the real time risk prediction. Meanwhile, this kind of simulation can support the decision makers of the project. This tool can be used to monitor the project until its completion. And when the project is finished, the result can be added to the data sample collection of the model learning process to analyze future projects.

construct network. The CPT deduced from learning and updating may reduce the subjectivity of assessment model greatly, and make the assessment result more reliable. We introduced EM algorithm to learn CPT, which enhanced the models analyzing ability and predicting ability to changing projects. For the validation of the model, we collected our data with questionnaire from real projects. We had distributed in total 300 questionnaires and had collected 135 samples back. After evaluation, 120 of them were considered to be valid. These data samples came from a broad industrial scope including software developing, communication, Internet service, transportation and government located in Guangzhou, Shanghai, Shenzhen and Nanjing. Among them, 72% of the participants have at least 4 years software project experience, and 50% of them are company managers or project managers, 36.7% of them are project developer. All of the above suggest that they have the expertise and are qualified to provide the data. Then we separated the samples into two sets: 20 samples were chosen for the network validation and the rest 100 samples were used to train the model. The precision can reach 80%. As expected, most correct or true predictions have higher probability (larger than 0.95). The detailed results are shown in Table 1.

7. Conclusion
In this research, we analyzed potential risks involved in software system quality using Bayesian Belief Network. The network structure is constructed using the Delphi method for conditional probability table learning. The probability table and confidence levels of the nodes are updated and learned continuously based on application cases, which would subsequently enable the evaluation network to have learning abilities, and to evaluate the

6. Experiment and Validation

Bayesian Belief Network has strong capabilities in analyzing and learning, and it can also maintain the capabilities with existence of missing data, which matches the need of the software developing diversity. This paper introduced Bayesian Belief Network to simulate and analyze the changing risks of software development. We combined the current literature and experts experience to


software development risks in organizations more accurately. The EM algorithm is introduced to enhance the ability of producing hidden nodes caused by variant software projects. Our model is validated though training and evaluation over 120 real-life development projects. The experiment results have demonstrated that the model can achieve high prediction accuracy of 80% (shown in Table 1). The confidence levels of our predictions are

mostly larger than 0.95, which means that our model is indeed reliable. Based on our model, an operational tool can be developed for software risk assessment and simulation, which can help the project management team to analyze and control the risks of software projects and help the project managers to trace the project through its whole life cycle. .

Table 1. Comparison of sample results and the prediction of the model

No Sample Results Prediction Probability True-false No Sample Results Prediction Probability True-false Accuracy 1 Fail Fail 0.97 T 11 Fail Fail 0.96 T 2 Fail Fail 0.99 T 12 Success Success 0.99 T 3 Fail Fail 0.97 T 13 Success Success 0.85 T 4 Fail Fail 1.0 T 14 Fail Success 0.64 F 5 Fail Fail 0.97 T 15 Success Success 0.61 T 6 Fail Success 0.81 F 16 Success Success 0.98 T 80% 7 Fail Fail 0.63 T 17 Success Success 0.98 T 8 Success Fail 0.79 F 18 Success Success 0.98 T 9 Success Success 0.97 T 19 Success Success 0.94 T 10 Fail Success 0.92 F 20 Success Success 0.64 T

7. References
[1] Boehm. Software Risk Management: Principles and Practice. IEEE Software. 1991, (1):32-41. [2] Roger S. Pressman. A Manager's Guide to Software Engineering. McGraw-Hill, Inc., New York, NY, 1993. [3] C. V. Ramamoorthy, C. Chandra. Knowledge Based Tools for Risk Assessment in Software Development and Reuse. Proceedings of the 1993 IEEE International Conference on Tools with AI, Boston, Massachusetts, Nov. 1993. 1993,364-371. [4] Clyde Chittister, Yacov Y. Haimes. Assessment and Management of Software Technical Risk. IEEE Transactions on Systems, Man, and Cybernetics. 1994, 24(2):187-202. [5] Sunita Chulani, Barry Boehm. Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE Transactions on Software Engineering. 1999, 24(4):573-583. [6] P arag C. Pendharkar, Girish H. Subramanian, James A. Rodger. A Probabilistic Model for Predicting Software Development Effort. IEEE Transactions on Software Engineering. 2005, 31(7):615-624. [7] Martin Neil, Norman Fenton. Predicting Software Quality using Bayesian Belief Networks. Proceedings of 21st Annual Software Engineering Workshop, NASA/Goddard Space Flight Centre, December 4-5, 1996. 1996, 217-230. [8] Norman Fenton, Martin Neil. Probabilistic Modeling for Software Quality Control. S. Benferhat and P. Besnard (Eds.): ECSQARU 2001, LNAI 2143. 2001,

444453. [9] Norman Fenton, Martin Neil. A Critique of Software Defect Prediction Models. IEEE Transactions on Software Engineering. 1999, 25(5):675-689. [10] Norman Fenton, Paul Krause and Martin Neil. A Probabilistic Model for Software Defect Prediction. IEEE Transactions in Software Engineering. 2000. [11] Anthony Kwok Tai Hui, Dar Biau Liu. A Bayesian Belief Network Model and Tool to Evaluate Risk and Impact in Software Development Projects. Reliability and Maintainability, 2004 Annual Symposium RAMS. 2004, 297-301. [12] Sarma Nidumolu. The Effect of Coordination and Uncertainty on Software Project Performance: Residual Performance Risk as an Intervening Variable. Information Systems Research. 1995, 6(3):191-219. [13] Linda Wallace, Mark Keil, Arun Rai. Understanding Software Project Risk: A Cluster Analysis. Information and Management. 2004, 42(1):115-125. [14] Carr, M., Kondra, S. Taxonomy Based Risk Identification. Software Engineer Instituted Technical Report SEI-93-TR-006, Pittsburgh, PA. Software Engineering Institute (SEI internal report), 1993. [15] Tom M. Mitchell, Machine Learing, MeGraw-Hill Companies, Inc. 1997 [16] Steffen L. Lauritzen. The EM Algorithm for Graphical Association Models with Missing Data. Computational Statistics & Data Analysis. 1995, 19(2):191-201.