Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Effective Classification Algorithms to Predict the Accuracy of Tuberculosis - A Machine Learning Approach

Effective Classification Algorithms to Predict the Accuracy of Tuberculosis - A Machine Learning Approach

Ratings: (0)|Views: 76 |Likes:
Published by ijcsis
Tuberculosis is a disease caused by mycobacterium which can affect virtually all organs, not sparing even the relatively inaccessible sites. India has the world’s highest burden of tuberculosis (TB) with million estimated incident cases per year. Studies suggest that active tuberculosis accelerates the progression of Human Immunodeficiency Virus (HIV) infection. Tuberculosis is much more likely to be a fatal disease among HIV-infected persons than persons without HIV infection. Diagnosis of pulmonary tuberculosis has always been a problem. Classification of medical data is an important task in the prediction of any disease. It even helps doctors in their diagnosis decisions. In this paper we propose a machine learning approach to compare the performance of both basic learning classifiers and ensemble of classifiers on Tuberculosis data. The classification models were trained using the real data collected from a city hospital. The trained models were then used for predicting the Tuberculosis as two categories Pulmonary Tuberculosis (PTB) and Retroviral PTB(RPTB) i.e. TB along with Acquired Immune
Deficiency Syndrome (AIDS). The prediction accuracy of the classifiers was evaluated using 10-fold Cross Validation and the results have been compared to obtain the best prediction accuracy. The results indicate that Support Vector Machine (SVM) performs well among basic learning classifiers and Random forest from ensemble with the accuracy of 99.14% from both classifiers respectively. Various other measures like Specificity, Sensitivity, F-measure and ROC area have been used in comparison.
Tuberculosis is a disease caused by mycobacterium which can affect virtually all organs, not sparing even the relatively inaccessible sites. India has the world’s highest burden of tuberculosis (TB) with million estimated incident cases per year. Studies suggest that active tuberculosis accelerates the progression of Human Immunodeficiency Virus (HIV) infection. Tuberculosis is much more likely to be a fatal disease among HIV-infected persons than persons without HIV infection. Diagnosis of pulmonary tuberculosis has always been a problem. Classification of medical data is an important task in the prediction of any disease. It even helps doctors in their diagnosis decisions. In this paper we propose a machine learning approach to compare the performance of both basic learning classifiers and ensemble of classifiers on Tuberculosis data. The classification models were trained using the real data collected from a city hospital. The trained models were then used for predicting the Tuberculosis as two categories Pulmonary Tuberculosis (PTB) and Retroviral PTB(RPTB) i.e. TB along with Acquired Immune
Deficiency Syndrome (AIDS). The prediction accuracy of the classifiers was evaluated using 10-fold Cross Validation and the results have been compared to obtain the best prediction accuracy. The results indicate that Support Vector Machine (SVM) performs well among basic learning classifiers and Random forest from ensemble with the accuracy of 99.14% from both classifiers respectively. Various other measures like Specificity, Sensitivity, F-measure and ROC area have been used in comparison.

More info:

Published by: ijcsis on Aug 13, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/22/2012

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011
Effective Classification Algorithms to Predict theAccuracy of Tuberculosis-A Machine LearningApproach
Asha.T
Dept. of Info.Science & Engg.,Bangalore Institute of TechnologyBangalore, INDIA
S. Natarajan
Dept. of Info. Science & Engg.P.E.S. Institute of TechnologyBangalore,INDIA
 K.N.B. Murthy
Dept.of Info. Science & Engg.P.E.S.Institute of TechnologyBangalore,INDIA 
 Abstract
 
Tuberculosis is a disease caused by mycobacteriumwhich can affect virtually all organs, not sparing even therelatively inaccessible sites. India has the world’s highest burdenof tuberculosis (TB) with million estimated incident cases peryear. Studies suggest that active tuberculosis accelerates theprogression of Human Immunodeficiency Virus (HIV) infection.Tuberculosis is much more likely to be a fatal disease amongHIV-infected persons than persons without HIV infection.Diagnosis of pulmonary tuberculosis has always been a problem.Classification of medical data is an important task in theprediction of any disease. It even helps doctors in their diagnosisdecisions. In this paper we propose a machine learning approachto compare the performance of both basic learning classifiers andensemble of classifiers on Tuberculosis data. The classificationmodels were trained using the real data collected from a cityhospital. The trained models were then used for predicting theTuberculosis as two categories Pulmonary Tuberculosis(PTB)and Retroviral PTB(RPTB) i.e. TB along with Acquired ImmuneDeficiency Syndrome(AIDS). The prediction accuracy of theclassifiers was evaluated using 10-fold Cross Validation and theresults have been compared to obtain the best predictionaccuracy. The results indicate that Support Vector Machine(SVM) performs well among basic learning classifiers andRandom forest from ensemble with the accuracy of 99.14% fromboth classifiers respectively. Various other measures likeSpecificity, Sensitivity, F-measure and ROC area have been usedin comparison.
 Keywords-component; Machine learning; Tuberculosis;Classification, PTB, Retroviral PTB
I.
 
I
NTRODUCTION
 There is an explosive growth of bio-medical data, rangingfrom those collected in pharmaceutical studies and cancertherapy investigations to those identified in genomics andproteomics research. The rapid progress in data miningresearch has led to the development of efficient and scalablemethods to discover knowledge from these data. Medical datamining is an active research area under data mining sincemedical databases have accumulated large quantities of information about patients and their clinical conditions.Relationships and patterns hidden in this data can provide newmedical knowledge as has been proved in a number of medicaldata mining applications.Data classification process using knowledge obtained fromknown historical data has been one of the most intensivelystudied subjects in statistics, decision science and computerscience. Data mining techniques have been applied to medicalservices in several areas, including prediction of effectivenessof surgical procedures, medical tests, medication, and thediscovery of relationships among clinical and diagnosis data.In order to help the clinicians in diagnosing the type of diseasecomputerized data mining and decision support tools are usedwhich are able to help clinicians to process a huge amount of data available from solving previous cases and suggest theprobable diagnosis based on the values of several importantattributes. There have been numerous comparisons of thedifferent classification and prediction methods, and the matterremains a research topic. No single method has been found tobe superior over all others for all data sets.India has the world’s highest burden of tuberculosis (TB) withmillion estimated incident cases per year. It also ranks[20]among the world’s highest HIV burden with an estimated 2.3million persons living with HIV/AIDS. Tuberculosis is muchmore likely to be a fatal disease among HIV-infected personsthan persons without HIV infection
.
It is a disease caused bymycobacterium which can affect virtually all organs, notsparing even the relatively inaccessible sites.Themicroorganisms usually enter the body by inhalation throughthe lungs. They spread from the initial location in the lungs toother parts of the body via the blood stream. Theypresent adiagnostic dilemma even for physicians with a great deal of experience in this disease.II.
 
R
ELATED
W
ORK
 Orhan Er. And Temuritus[1] present a study on tuberculosisdiagnosis, carried out with the help of Multilayer NeuralNetworks (MLNNs). For this purpose, an MLNN with twohidden layers and a genetic algorithm for training algorithmhas been used. Data mining approach was adopted to classifygenotype of mycobacterium tuberculosis using c4.5algorithm[2].Rethabile Khutlang et.al.
 
present methods for the
89http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011
automated identification of Mycobacterium tuberculosis
 
inimages of Ziehl–Neelsen (ZN) stained sputum smears obtainedusing a bright-field microscope.They segment candidatebacillus objects using a combination of two-class pixelclassifiers[3].Sejong Yoon, Saejoon Kim [4] proposes a mutualinformation-based Support Vector Machine Recursive FeatureElimination (SVM-RFE) as the classification method withfeature selection in this paper.Diagnosis of breast cancerusing different classification techniques was carriedout[5,6,7,8]. A new constrained-syntax genetic programmingalgorithm[9] was developed to discover classification rulesfor diagnosing certain pathologies.Kwokleung Chan et.al. [10]used several machine learning and traditional calssifiers in theclassification of glaucoma disease and compared theperformance using ROC. Various classification algorithmsbased on statistical and neural network methods werepresented and tested for
 
quantitative tissue characterization of diffuse liver disease from ultrasound images[11] andcomparison of classifiers in sleep apnea[18]
.
Ranjit Abrahamet.al.[19]
 
propose a new feature selection algorithm CHI-WSSto improve the classification accuracy of Naïve Bayes withrespect to medical datasets.Minou Rabiei et.al.[12] use tree based ensemble classifiers forthe diagnosis of excess water production. Their resultsdemonstrate the applicability of this technique in successfuldiagnosis of water production problems. Hongqi Li, HaifengGuo et.al. present[13] a comprehensive comparative study onpetroleum exploration and production using five featureselection methods including expert judgment, CFS, LVF,Relief-F, and SVM-RFE, and fourteen algorithms from fivedistinct kinds of classification methods including decision tree,artificial neural network, support vector machines(SVM),Bayesian network and ensemble learning.Paper on “Mining Several Data Bases with an Ensemble of Classifiers”[14]
 
analyze the two types of conflicts, one createdby data inconsistency within the area of the intersection of thedata bases and the second is created when the meta methodselects different data mining methods with inconsistentcompetence maps for the objects of the intersected part andtheir combinations and suggest ways to handle them.Referenced paper[15] studies medical data classificationmethods, comparing decision tree and system reconstructionanalysis as applied to heart disease medical data mining.Under most circumstances, single classifiers, such as neuralnetworks, support vector machines and decision trees, exhibitworst performance. In order to further enhance performancecombination of these methods in a multi-level combinationscheme was proposed that improves efficiency[16]. paper[17]demonstrates the use of adductive network classifiercommittees trained on different features for improvingclassification accuracy in medical diagnosis.III.
 
D
ATA
S
OURCE
 The medical dataset we are classifying includes 700 realrecords of patients suffering from TB obtained from a cityhospital. The entire dataset is put in one file having manyrecords. Each record corresponds to most relevant informationof one patient. Initial queries by doctor as symptoms and somerequired test details of patients have been considered as mainattributes. Totally there are 11 attributes(symptoms) and oneclass attribute. The symptoms of each patient such as age,chroniccough(weeks), loss of weight, intermittent fever(days),night sweats, Sputum, Bloodcough, chestpain, HIV,radiographic findings, wheezing and class are considered asattributes.Table I shows names of 12 attributes considered along withtheir Data Types (DT). Type N-indicates numerical and C iscategorical.
Table I. List of Attributes and their Datatypes
No Name DT
1 Age N2 Chroniccough(weeks) N3 WeightLoss C4 Intermittentfever N5 Nightsweats C6 Bloodcough C7 Chestpain C8 HIV C9 Radiographicfindings C10 Sputum C11 Wheezing C12 Class C
IV.
 
C
LASSIFICATION
A
LGORITHMS
 
SVM
(SMO)The original SVM algorithm was invented by VladimirVapnik. The standard SVM takes a set of input data, andpredicts, for each given input, which of two possible classesthe input is a member of, which makes the SVM a non-probabilistic binary linear classifier.A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which canbe used for classification, regression or other tasks. Intuitively,a good separation is achieved by the hyperplane that has thelargest distance to the nearest training data points of any class(so-called functional margin), since in general the larger themargin the lower the generalization error of the classifier.
K-Nearest Neighbors(
IBK
)
The
-nearest neighbors algorithm (
-NN) is a method for[22]classifying objects based on closest training examples in the
90http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011
feature space.
-NN is a type of instance-based learning., orlazy learning where the function is only approximated locallyand all computation is deferred until classification. Here anobject is classified by a majority vote of its neighbors, with theobject being assigned to the class most common amongst its
 nearest neighbors (
is a positive, typically small).
Naive Bayesian Classifier (
Naive Bayes
)
It is Bayes classifier which is a simple probabilistic classifierbased on applying Baye’s theorem(from Bayesian statistics)with strong (naive) independence[23] assumptions. Inprobability theory Bayes theorem shows how one conditionalprobability (such as the probability of a hypothesis givenobserved evidence) depends on its inverse (in this case, theprobability of that evidence given the hypothesis). In moretechnical terms, the theorem expresses the posteriorprobability (i.e. after evidence E is observed) of a hypothesisH in terms of the prior probabilities of H and E, and theprobability of E given H. It implies that evidence has astronger confirming effect if it was more unlikely before beingobserved.
C4.5 Decision Tree(
J48 in weka
)
Perhaps C4.5 algorithm which was developed by Quinlan isthe most popular tree classifier[21]. It is a decision supporttool that uses a tree-like graph or model of decisions and theirpossible consequences, including chance event outcomes,resource costs, and utility. Weka classifier package has its ownversion of C4.5 known as J48. J48 is an optimizedimplementation of C4.5 rev. 8.
Bagging(bagging)
Bagging (Bootstrap aggregating) was proposed by LeoBreiman in 1994 to improve the classification by combiningclassifications of randomly generated training sets. Theconcept of bagging (voting for classification, averaging forregression-type problems with continuous dependent variablesof interest) applies to the area of predictive data mining tocombine the predicted classifications (prediction) frommultiple models, or from the same type of model for differentlearning data. It is a technique generating multiple trainingsets by sampling with replacement from the available trainingdata and assigns vote for each classification.
Adaboost(
Adaboost M1
)
AdaBoost is an algorithm for constructing a “strong” classifieras linear combination of “simple” “weak” classifier. Instead of resampling, Each training sample uses a weight to determinethe probability of being selected for a training set. Finalclassification is based on weighted vote of weak classifiers.AdaBoost is sensitive to noisy data and outliers. However insome problems it can be less susceptible to the overfittingproblem than most learning algorithms.
Random forest (
or random forests
)
The algorithm for inducing a random forest was developed byleo-braiman[25]. The term came from random decision foreststhat was first proposed by Tin Kam Ho of Bell Labs in 1995. Itis an ensemble classifier that consists of many decision treesand outputs the class that is the mode of the class's output byindividual trees. It is a popular algorithm which builds arandomized decision tree in each iteration of the baggingalgorithm and often produces excellent predictors.V.
 
E
XPERIMENTAL
S
ETUP
 The open source tool Weka was used in different phases of theexperiment. Weka is a collection of state-of-the-art machinelearning algorithms[26] for a wide range of data mining taskssuch as data preprocessing, attribute selection, clustering, andclassification. Weka has been used in prior research both in thefield of clinical data mining and in bioinformatics.Weka has four main graphical user interfaces(GUI).The maingraphical user interface are Explorer and Experimenter. OurExperiment has been tried under both Explorer andExperimenter GUI of weka. In the Explorer we can flip back and forth between the results we have obtained,evaluate themodels that have been built on different datasets, and visualizegraphically both the models and the datasets themselves-including any classification errors the models make.Experimenter on the other side allows us to automate theprocess by making it easy to run classifiers and filters withdifferent parameter settings on a corpus of datasets, collectperformance statistics, and perform significance tests.Advanced users can employ the Experimenter to distribute thecomputing load across multiple machines using java remotemethod invocation.
 A.
 
Cross-Validation
 
Cross validation with 10 folds has been used for evaluating theclassifier models. Cross-Validation (CV) is the standard DataMining method for evaluating performance of classificationalgorithms mainly, to evaluate the Error Rate of a learningtechnique. In CV a dataset is partitioned in n folds, where eachis used for testing and the remainder used for training. Theprocedure of testing and training is repeated n times so thateach partition or fold is used once for testing. The standardway of predicting the error rate of a learning technique given asingle, fixed sample of data is to use a stratified 10-fold cross-validation. Stratification implies making sure that whensampling is done each class is properly represented in bothtraining and test datasets. This is achieved by randomlysampling the dataset when doing the n fold partitions.In a stratified 10-fold Cross-Validation the data is dividedrandomly into 10 parts in which the class is represented inapproximately the same proportions as in the full dataset. Eachpart is held out in turn and the learning scheme trained on theremaining nine-tenths; then its error rate is calculated on theholdout set. The learning procedure is executed a total of 10times on different training sets, and finally the 10 error ratesare averaged to yield an overall error estimate. When seekingan accurate error estimate, it is standard procedure to repeatthe CV process 10 times. This means invoking the learningalgorithm 100 times. Given two models M1 and M2 withdifferent accuracies tested on different instances of a data set,
91http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->