Professional Documents
Culture Documents
Conrady Applied Science, LLC - Bayesias North American Partner for Sales and Consulting
Table of Contents
Introduction
About the Authors Stefan Conrady Lionel Jouffe 2 2 2
Model Application
Interactive Inference Target Interpretation Tree Summary References Contact Information Conrady Applied Science, LLC Bayesia SAS Copyright 23 24 26 27 28 28 28 28
www.conradyscience.com | www.bayesia.com!
Introduction
Data classification is one of the most common tasks in the field of statistical analysis and countless methods have been developed for this purpose over time. A common approach is to develop a model based on known historical data, i.e. where the class membership of a record is known, and to use this generalization to predict the class membership for a new set of observations. Applications of data classifications permeate virtually all fields of study, including social sciences, engineering, biology, etc. In the medical field, classification problems often appear in the context of disease identification, i.e. making a diagnosis about a patients condition. The medical sciences have a long history of developing large body of knowledge, which links observable symptoms with known types of illnesses. It is the physicians task to use the available medical knowledge to make inference based on the patients symptoms, i.e. to classify the medical condition, in order to enable appropriate treatment. Over the last two decades, so-called medical expert systems have emerged, which are meant to support physicians in their diagnostic work. Given the sheer amount of medical knowledge in existence today, it should not be surprising that significant benefits are expected from such machine-based support in terms of medical reasoning and inference. In this context, several papers by Wolberg, Street, Heisey and Managasarian became much-cited examples. They proposed an automated method for the classification of Fine Needle Aspirates1 through imaging processing and machine learning, with the objective of achieving a greater accuracy in distinguishing between malignant and benign cells for the diagnosis of breast cancer. At the time of their study, the practice of visual inspection of FNA yielded an inconsistent diagnostic accuracy. The proposed new approach would increase this accuracy reliably to over 95%. This research was quickly translated into clinical practice and has since been applied with continued success. As part of their studies in the late 1980s and 1990s, the research team generated what became known as the Wisconsin Breast Cancer Database, which contains measurements of hundreds of FNA samples and the associated diagnoses. This database has been extensively studied, especially outside the medical field. Statisticians and computer scientists have proposed a wide range of techniques for this classification problem and have continuously raised the benchmark for predictive performance. Our objective with this paper is to present Bayesian networks as a very practical framework for working with this kind of classification problem. Furthermore, we intend to demonstrate how the BayesiaLab software can extremely quickly and simply create a Bayesian network model that is on par performance-wise with virtually all existing models. Also, while most of our previous white papers focused on marketing science applications, we hope that this case study from the medical field can demonstrate their universal applicability of Bayesian networks. We speculate that our modeling approach with Bayesian networks (as the framework) and BayesiaLab (as the software tool) achieves 99% of the performance of the best conceivable, custom-developed model, while only requiring 10% of the development time. This allows researchers to focus more on the subject matter of their studies, because they are less
Fine needle aspiration (FNA) is a percutaneous (through the skin) procedure that uses a fine gauge needle (22 or 25
gauge) and a syringe to sample fluid from a breast cyst or remove clusters of cells from a solid mass. With FNA, the cellular material taken from the breast is usually sent to the pathology laboratory for analysis. www.conradyscience.com | www.bayesia.com 1
distracted by the technicalities of traditional statistical tools. As a result, Bayesian networks and BayesiaLab are a very important innovations accelerating research and in pursuing translational science.
www.conradyscience.com | www.bayesia.com
10. Mitoses (1 - 10) 11. Class (benign/malignant) Attributes 2 through 9 were computed from digital images of fine needle aspirates (FNA) of breast masses. These features describe the characteristics of the cell nuclei in the image. The class membership was established via subsequent biopsies or via long-term monitoring of the tumor.
Upon exclusion of the row identifier, this database is ideally suited for the evaluation version of BayesiaLab, which is
We will not go into detail here regarding the definition of the attributes and their measurement. Rather, we refer the reader to papers referenced in the bibliography. The Wisconsin Breast Cancer Database is available to any interested researcher from the UC Irvine Machine Learning Repository.3 We use this database in its original format without any further transformation, so our results can be directly compared to dozens of methods that have been developed since the original study.
Notation
To clearly distinguish between natural language, software-specific functions and study-specific variable names, the following notation is used: BayesiaLab-specific functions, keywords, commands, etc., are shown in bold type. Attribute/variable/node names are capitalized and italicized.
Data Import
Our modeling process begins with importing the database, which is available in a CSV format, into BayesiaLab. The Data Import Wizard guides the analyst through the required steps.
In the first dialogue box of the Data Import Wizard, we can click on Define Typing and specify that we wish to set aside test set of the database. Following common practice, we will randomly select 20% of the 699 records as test data, and, as a result, the remaining 80% will serve as our training data set.
www.conradyscience.com | www.bayesia.com
In the next step, the Data Import Wizard will suggest the data type for each variable (or attribute4 ). Attributes 2 through 10 are identified as continuous variables and Class is read as a discrete variable. Only for the first variable, Sample code, the analyst has to specify Row Identifier, so it is not mistaken for a continuous predictor variable.
For the import process of this study, the most important step is the selection of the discretization algorithm. As we know that the exclusive objective is classification, we will choose the Decision Tree algorithm, which will discretize each variable for an optimum information gain with respect to the target variable Class. Bayesian networks are entirely non-parametric, probabilistic models and for their estimation they require a certain minimum of observations. To help us with the selection of discretization levels, we use the heuristic of five observations per parameter and probability cell. Given that we have a relatively small database with only 560 observations,5 three discretization intervals for each variable appear to be an appropriate choice. If we used a higher number of discretization levels, we would most likely need more observations for the reliable estimation of the parameters.
4 5
Attribute and variable are used interchangeably throughout the paper. 560 cases are in the training set (80%) and 139 are in the test set (20%). 5
www.conradyscience.com | www.bayesia.com
Upon clicking Finish, we will immediately see a representation of the newly imported database in the form of a fully unconnected Bayesian network. Each variable is now represented as a blue node in the graph panel of BayesiaLab.
The question mark symbol, which is associated with the Bare Nuclei node, indicates that there are missing values for this variable. Hovering over the question mark with the mouse pointer while pressing the i key will show the number of missing values.
Unsupervised Learning
When working with BayesiaLab, it is recommended to always perform Unsupervised Learning first on any newly imported database. This is the case, even when the exclusive objective is predictive modeling, for which Supervised Learning will later be the main tool. Learning>Association Discovering>EQ will initiate the EQ algorithm, which, in this case, is suitable for the initial review of the database. For larger databases with significantly more variables, the Maximum Weight Spanning Tree is a very fast algorithm and can be used first instead.
The analyst can visually review the learned network structure and compare it to his or her domain knowledge. This quickly provides a sanity check for the database and the variables and it may highlight any inconsistencies.
www.conradyscience.com | www.bayesia.com
Furthermore, one can also display the Pearson correlation between the nodes, by selecting Analysis>Graphic>Pearsons Correlation and clicking the Display Arc Comment button in the toolbar.
For instance, a potentially incorrect sign of a correlation would noticed immediately by the analyst as the arcs are colorcoded. Red and blue arcs indicate negative and positive Pearson correlations respectively.
www.conradyscience.com | www.bayesia.com
In most cases, the Markov Blanket algorithm is a good starting point for any predictive model. This algorithm is extremely fast and can even be applied to databases with thousands of variables and millions of records, although database size is not a concern in this particular study. The Markov Blanket for a node A is the set of nodes composed of As parents, its children, and its childrens other parents (=spouses).
The Markov Blanket of the node A contains all the variables, which, if we know their states, will shield the node A from the rest of the network. This means that the Markov Blanket of a node is the only knowledge needed to predict the behavior of that node A. Learning a Markov Blanket selects relevant predictor variables, which is particularly helpful when there is a large number of variables in the database (In fact, this can also serve as a highly-efficient variable selection method in preparation for other types of modeling, outside the Bayesian network framework). Upon Markov Blanket learning for our database, the resulting Bayesian network looks as follows:
www.conradyscience.com | www.bayesia.com
This suggests that Class, has a direct probabilistic relationship with all variables except Marginal Adhesion and Single Epithelial Cell Size, which are disconnected. The lack of their connection with the Target indicates that these nodes are independent given the nodes in the Markov Blanket. For a better visual interpretation, we will apply the Force Directed Layout algorithm and obtain a view with the Class at its center. Both unconnected variables are shown at the bottom of the graph.
Beyond distinguishing between predictors (connected nodes) and non-predictors (disconnected nodes), we can further examine the relationship versus the Target Node Class by highlighting the Mutual Information of the arcs connecting the nodes. This function is accessible within the Validation Mode via Analysis>Graphic>Arcs Mutual Information.
www.conradyscience.com | www.bayesia.com
The thickness of the arcs is now proportional to the Mutual Information, i.e. the strength of the relationship between the nodes. Intuitively, Mutual Information measures the information that X and Y share: it measures how much knowing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then knowing X does not provide any information about Y and vice versa, so their Mutual Information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa.
Formal Definition of Mutual Information
www.conradyscience.com | www.bayesia.com
10
We can also show the values of the Mutual Information on the graph by clicking on Display Arc Comments. In the top part of the comment box attached to each arc, the Mutual Information of the arc is shown. Below, expressed as a percentage and highlighted in blue, we see the relative Mutual Information in the direction of the arc (parent node ! child node). And, at the bottom, we have the relative mutual information in the opposite direction of the arc (child node ! parent node).
Model 1 Performance
As we are not equipped with specific domain knowledge about the variables, we will not further interpret these relationships but rather run an initial test for Network Performance we want to know how well this Markov Blanket model can predict the states of the Class variable, i.e. benign versus malignant. This test is available via Analysis>Network Performance>Targeted.
Using our previously defined test set for validating our model, we obtain the following, rather encouraging results: Markov Blanket - Test Set
www.conradyscience.com | www.bayesia.com
11
Of the 87 benign cases of the test set, 96.5% were correctly identified (true negative), which corresponds to a false positive rate of 3.5%. More importantly though, of the 52 malignant cases, 100% were identified correctly (true positive) with no false negatives. This yields a total precision of 97.8%. Analogous to the original papers on this topic, we will also perform a K-Fold Cross Validation, which will iteratively select different test and training sets, and, based on those, learn and test the model. The Cross Validation can be performed via Tools>Cross Validation>Targeted.
We choose 10 samples, i.e. 10 iterations with 69 cases as test samples and 630 training cases.
The results from the Cross Validation confirms the good performance of this model. The overall precision is 96.7%, with a false negative rate of 2.9%.
www.conradyscience.com | www.bayesia.com
12
At this point we might be tempted to conclude our analysis, as our Markov Blanket modeling is already performing at a level comparable to the most sophisticated (and complex) models ever developed from this database. More remarkable though is the minimal effort that was required for creating our model with the Supervised Learning algorithms in BayesiaLab. Even a new user of BayesiaLab would be expected to replicate the above steps in less than 30 minutes.
www.conradyscience.com | www.bayesia.com
13
As can be expected, the resulting network is somewhat more complex than the standard Markov Blanket.
The additional arcs (compared to the Markov Blanket network) are highlighted with green markers.
Model 2 Performance
With this Augmented Markov Blanket network we now proceed to performance evaluations, analogous to the Markov Blanket model. Initially, we evaluate the performance on the test set.
www.conradyscience.com | www.bayesia.com
14
To complete the evaluation of this model, we will also perform a K-Fold Cross Validation. Augmented Markov Blanket - Cross-Validation
www.conradyscience.com | www.bayesia.com
15
Despite the greater complexity of the model, we only see a marginal improvement in overall precision.
Structural Coefficient
Up to this point, we have not addressed the Structural Coefficient (SC), which is the only adjustable parameter for all the learning algorithms in BayesiaLab. This parameter is available to manage network complexity. By default, this Structural Coefficient is set to 1, which reliably prevents the learning algorithms from overfitting the model to the data. In studies with relatively few observations, the analysts judgment is needed for determining a potential downward adjustment of this parameter. On the other hand, when data sets are very large, increasing the parameter to values higher than 1 will help manage the network complexity. Given the fairly simple network structure of Model 1, complexity was of no concern. Model 2 is more complex, but still very manageable. The question is, could a more complex network provide greater precision without overfitting? To answer this question, we will perform the Structural Coefficient Analysis, which generates several metrics that help in making a trade-off between complexity and precision. The function Tools>Cross Validation>Structural Coefficient Analysis starts this process.
We are prompted to specify the range of the Structural Coefficient to be examined and the number of iterations. The Number of Iterations determines the interval steps to be taken within the specified range of the Structural Coefficient. Given the relatively light computational load, we choose 50 iterations. With more complex models, we might be more conservative, as each iteration re-learns and re-evaluates the network. Furthermore, we select Compute Structure/ Targets Precision Ratio to compute our target metric.
The resulting report will show us how the network structure changes as a function of the Structural Coefficient. This can be interpreted as the degree of confidence the analyst should have in any particular arc in the structure.
www.conradyscience.com | www.bayesia.com
16
Clicking Graphs, will show a synthesized network, consisting of all structures generated during the iterative learning process.
www.conradyscience.com | www.bayesia.com
17
The reference structure is represented by black arcs, which show the original network learned prior to the start of the Structural Coefficient Analysis. The blue-colored arcs are not contained in the reference structure, but they appear in networks that have been learned as a function of the different Structural Coefficients (SC). The thickness of the arcs is proportional to the frequency of individual arcs existing in the learned networks. More importantly for us, however, is determining the correct level of network complexity for a reliable and accurate prediction performance while avoiding to overfit the data. We can plot several different metrics in this context by clicking Curve. Structure/Targets Precision Ratio is the most relevant metric in our case and the corresponding plot is shown below. This first plot shows the metric computed for the whole database.
Typically, the elbow of the L-shaped curve identifies a suitable value for the Structural Coefficient (SC). More formally, we would look for the point on the curve where the second derivative is maximized. With a visual inspection, an SC value of around 0.4 appears to be a good candidate for that point. The portion of the curve, where SC values approach 0, shows the characteristic pattern of overfitting, which is to be avoided. In order to further validate this interpretation, we will also compute the same metric for the training/test database.
www.conradyscience.com | www.bayesia.com
18
This graph has the same properties as the previous one and suggests a similar SC value. As a result, we can have some confidence in this new value for the Structural Coefficient. We will also plot the Targets Precision alone as a function of the SC. On the surface, the curve resembles an L-shape, too, but the curve moves only within roughly 1 percentage point, i.e. between 97% and 98%. For practical purposes, this means that the curve is virtually flat.
www.conradyscience.com | www.bayesia.com
19
Structure
Target's Precision %
ture, as the denominator, Targets Precision, is nearly constant across a wide range of SC values, as per the graph above. The joint interpretation of Targets Precision and Structure/Targets Precision Ratio indicates that little can be gained with lowering the SC, but that there is a definite risk of overfitting. Nevertheless, we relearn the network with an SC of 0.4, generating, as expected, a more complex network, which is displayed below.
The performance of the model (with SC=0.4) on the test set appears to be virtually the same,
www.conradyscience.com | www.bayesia.com
20
and the result from the K-Fold Cross Validation is not materially different from the previous performance with SC=1. Augmented Markov Blanket (SC=0.4) - Cross-Validation
www.conradyscience.com | www.bayesia.com
21
Conclusion
The models reviewed, Markov Blanket and Augmented Markov Blanket (SC=0.4 and SC=1), have performed at virtually indistinguishable levels in terms of classification performance. The greater complexity of either Augmented Markov Blanket specification did not yield the expected precision gain. Precision and false negatives are shown as the key metrics in the summary table below.
B>??*"J
1*"2'0,3+*(2#/ =>.?#(/#@,1*"2'0,3+*(2#/,ABCD<E =>.?#(/#@,1*"2'0,3+*(2#/,ABCD:68E
In this situation, the choice of model should be determined by the most parsimonious specification. This provides the best prospect of good generalization of the model beyond the samples observed in this study. The originally specified Markov Blanket model will thus be recommended as the model of choice. Reestimating these models with more observations could potentially change this conclusion and might more clearly differentiate the classification performance. For now, however, we select the Markov Blanket model and it will serve as the basis for the next section of this paper, Model Application.
www.conradyscience.com | www.bayesia.com
22
Model Application
Interactive Inference
Without further discussion of the merits of each model specification, we will now show how the learned Markov Blanket model can be applied in practice. For instance, we can use BayesiaLab to review the individual classification predictions made based on the model. This feature is call Interactive Inference, which can be accessed via Inference>Adaptive Inference.
This will bring up Monitors for all variables in the Monitor Panel, and the navigation bar above allows scrolling through each record of the test set. Record #0 can be seen below with all the associated observations highlighted in green. Given the observations shown, the model predicts a 99.76% probability that the cells from this FNA sample are malignant (the Monitor is highlighted in red).
For reference, we will also show record #22, which is classified as benign.
www.conradyscience.com | www.bayesia.com
23
Most cases are rather clear-cut, as above, with record #19 being one of the few exceptions. Here, the probability of malignancy is 73%.
www.conradyscience.com | www.bayesia.com
24
www.conradyscience.com | www.bayesia.com
25
In our particular example, this may not be relevant, as all pieces of evidence, i.e. all observations regarding the FNA are obtained simultaneously. However, in the context of other diagnostic methods, such as mammography and surgical biopsy, a tree-based decision structure can help prioritize the sequence of exams, given the evidence obtained up to that point.
Summary
By using Bayesian networks as the framework, we have shown a practical new modeling approach based on the widely studied Wisconsin Breast Cancer Database. Our prediction accuracy is comparable with the results of all known studies on this topic. With BayesiaLab as the software tool, modeling with Bayesian networks becomes accessible to a very broad range of analysts and researchers, including non-statisticians. The speed of modeling, analysis and subsequent implementation make BayesiaLab a suitable tool in many areas of research and especially for translational science.
www.conradyscience.com | www.bayesia.com
26
References
Abdrabou, E. A.M.L, and A. E.B.M Salem. A Breast Cancer Classifier based on a Combination of Case-Based Reasoning and Ontology Approach. El-Sebakhy, E. A, K. A Faisal, T. Helmy, F. Azzedin, and A. Al-Suhaim. Evaluation of breast cancer tumor classification with unconstrained functional networks classifier. In the 4th ACS/IEEE International Conf. on Computer Systems and Applications, 281287, 2006. Hung, M. S, M. Shanker, and M. Y Hu. Estimating breast cancer risks using neural networks. Journal of the Operational Research Society 53, no. 2 (2002): 222231. Karabatak, M., and M. C Ince. An expert system for detection of breast cancer based on association rules and neural network. Expert Systems with Applications 36, no. 2 (2009): 34653469. Mangasarian, Olvi L, W. Nick Street, and William H Wolberg. Breast cancer diagnosis and prognosis via linear programming. OPERATIONS RESEARCH 43 (1995): 570--577. Mu, T., and A. K Nandi. BREAST CANCER DIAGNOSIS FROM FINE-NEEDLE ASPIRATION USING SUPERVISED COMPACT HYPERSPHERES AND ESTABLISHMENT OF CONFIDENCE OF MALIGNANCY. Wolberg, W. H, W. N Street, D. M Heisey, and O. L Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology* 1. Human Pathology 26, no. 7 (1995): 792796. Wolberg, William H, W. Nick Street, and O. L Mangasarian. MACHINE LEARNING TECHNIQUES TO DIAGNOSE BREAST CANCER FROM IMAGE-PROCESSED NUCLEAR FEATURES OF FINE NEEDLE ASPIRATES. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.2109. Wolberg, William H, W. Nick Street, and Olvi L Mangasarian. Breast Cytology Diagnosis Via Digital Image Analysis (1993). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.
www.conradyscience.com | www.bayesia.com
27
Contact Information
Conrady Applied Science, LLC 312 Hamlets End Way Franklin, TN 37067 USA +1 888-386-8383 info@conradyscience.com www.conradyscience.com Bayesia SAS 6, rue Lonard de Vinci BP 119 53001 Laval Cedex France +33(0)2 43 49 75 69 info@bayesia.com www.bayesia.com
Copyright
2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following: You may print or download this document for your personal and noncommercial use only. You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady Applied Science, LLC and Bayesia SAS as the source of the material. You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system.
www.conradyscience.com | www.bayesia.com
28