(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 7, July 2011
automated identification of Mycobacterium tuberculosis
inimages of Ziehl–Neelsen (ZN) stained sputum smears obtainedusing a bright-field microscope.They segment candidatebacillus objects using a combination of two-class pixelclassifiers[3].Sejong Yoon, Saejoon Kim [4] proposes a mutualinformation-based Support Vector Machine Recursive FeatureElimination (SVM-RFE) as the classification method withfeature selection in this paper.Diagnosis of breast cancerusing different classification techniques was carriedout[5,6,7,8]. A new constrained-syntax genetic programmingalgorithm[9] was developed to discover classification rulesfor diagnosing certain pathologies.Kwokleung Chan et.al. [10]used several machine learning and traditional calssifiers in theclassification of glaucoma disease and compared theperformance using ROC. Various classification algorithmsbased on statistical and neural network methods werepresented and tested for
quantitative tissue characterization of diffuse liver disease from ultrasound images[11] andcomparison of classifiers in sleep apnea[18]
.
Ranjit Abrahamet.al.[19]
propose a new feature selection algorithm CHI-WSSto improve the classification accuracy of Naïve Bayes withrespect to medical datasets.Minou Rabiei et.al.[12] use tree based ensemble classifiers forthe diagnosis of excess water production. Their resultsdemonstrate the applicability of this technique in successfuldiagnosis of water production problems. Hongqi Li, HaifengGuo et.al. present[13] a comprehensive comparative study onpetroleum exploration and production using five featureselection methods including expert judgment, CFS, LVF,Relief-F, and SVM-RFE, and fourteen algorithms from fivedistinct kinds of classification methods including decision tree,artificial neural network, support vector machines(SVM),Bayesian network and ensemble learning.Paper on “Mining Several Data Bases with an Ensemble of Classifiers”[14]
analyze the two types of conflicts, one createdby data inconsistency within the area of the intersection of thedata bases and the second is created when the meta methodselects different data mining methods with inconsistentcompetence maps for the objects of the intersected part andtheir combinations and suggest ways to handle them.Referenced paper[15] studies medical data classificationmethods, comparing decision tree and system reconstructionanalysis as applied to heart disease medical data mining.Under most circumstances, single classifiers, such as neuralnetworks, support vector machines and decision trees, exhibitworst performance. In order to further enhance performancecombination of these methods in a multi-level combinationscheme was proposed that improves efficiency[16]. paper[17]demonstrates the use of adductive network classifiercommittees trained on different features for improvingclassification accuracy in medical diagnosis.III.
D
ATA
S
OURCE
The medical dataset we are classifying includes 700 realrecords of patients suffering from TB obtained from a cityhospital. The entire dataset is put in one file having manyrecords. Each record corresponds to most relevant informationof one patient. Initial queries by doctor as symptoms and somerequired test details of patients have been considered as mainattributes. Totally there are 11 attributes(symptoms) and oneclass attribute. The symptoms of each patient such as age,chroniccough(weeks), loss of weight, intermittent fever(days),night sweats, Sputum, Bloodcough, chestpain, HIV,radiographic findings, wheezing and class are considered asattributes.Table I shows names of 12 attributes considered along withtheir Data Types (DT). Type N-indicates numerical and C iscategorical.
Table I. List of Attributes and their Datatypes
No Name DT
1 Age N2 Chroniccough(weeks) N3 WeightLoss C4 Intermittentfever N5 Nightsweats C6 Bloodcough C7 Chestpain C8 HIV C9 Radiographicfindings C10 Sputum C11 Wheezing C12 Class C
IV.
C
LASSIFICATION
A
LGORITHMS
SVM
(SMO)The original SVM algorithm was invented by VladimirVapnik. The standard SVM takes a set of input data, andpredicts, for each given input, which of two possible classesthe input is a member of, which makes the SVM a non-probabilistic binary linear classifier.A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which canbe used for classification, regression or other tasks. Intuitively,a good separation is achieved by the hyperplane that has thelargest distance to the nearest training data points of any class(so-called functional margin), since in general the larger themargin the lower the generalization error of the classifier.
K-Nearest Neighbors(
IBK
)
The
k
-nearest neighbors algorithm (
k
-NN) is a method for[22]classifying objects based on closest training examples in the
90http://sites.google.com/site/ijcsis/ISSN 1947-5500