You are on page 1of 9

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

Comparison of the C4.5 and a Naive Bayes Classifier for the Prediction of Lung Cancer Survivability
George Dimitoglou, James A. Adams, and Carol M. Jim
AbstractNumerous data mining techniques have been developed to extract information and identify patterns and predict trends from large data sets. In this study, two classification techniques, the J48 implementation of the C4.5 algorithm and a Naive Bayes classifier are applied to predict lung cancer survivability from an extensive data set with fifteen years of patient records. The purpose of the project is to verify the predictive effectiveness of the two techniques on real, historical data. Besides the performance outcome that renders J48 marginally better than the Naive Bayes technique, there is a detailed description of the data and the required pre-processing activities. The performance results confirm expectations while some of the issues that appeared during experimentation, underscore the value of having domain-specific understanding to leverage any domain-specific characteristics inherent in the data. Index TermsData mining, mining methods and algorithms, text mining

GeorgeDimitoglouiswiththeDepartmentofComputerScienceatHoodCollege,Frederick,MD21701,USA.Email:dimitoglou@cs.hood.edu JamesA.AdamsiswithMarriottInternational,Bethesda,MD20817,USA. CarolM.JimiswiththeSchoolofEngineering&AppliedScienceatTheGeorgeWashingtonUniversity,Washington,DC20052,USA.

INTRODUCTION
ted for approximately 29% of all cancerrelated deaths (158,599patients)[25].Thesemortalityratesremainedstable in 2010 indicating that more people die of lung cancer than ofcolon,breast,andprostatecancercombined[1]. Developing treatments and better understanding the characteristicsofthediseaseisalmostexclusivelybasedon clinical and biological research. However, the abundance of cancer research related data provides new opportunities to developdataandstatisticsdriventhatmaycomplementthe traditionalresearchapproaches. Oneofthemostinterestingandchallengingareasissur vivability or disease outcome prediction. This area lends itself wellforanalysisusingdataminingtechniques.Withsurviv al analysis, diseasespecific historical data is examined and analyzedoveracertainperiodoftimetodevelopaprognos is. The increased interest in health informatics to provide more effectivehealthcareandthe abundanceof computing power, massive data storage capacity and the development of automated tools, has allowed the collection of large volumes of medical data. As a result, methods and tech niques from the computing sciences related to knowledge discovery and data mining have become very useful in identifyingpatternsandanalyzingrelationshipsinthedata thatmayleadtobetterdiseaseoutcomeprediction[7,8, 23, 26].Thistypeofpatternanalysiscanbeusedinthedevelop mentofpatientsurvivabilitypredictionmodels.Itisthepre valenceandseriouseffectsoflungcancerandhowwellex istingdatalendsitselffordataminingthatprovidedthemo tivationtoselectthissubjectastheplatformforthecompar isonofthetwoalgorithmictechniques. Oneofthespecialaspectsofourstudyisthevolumeand authenticity of the experimentation dataset. We used data providedtheSEERprogram, oneofthemostcomprehens ive sources on cancer incidence and survival in the U.S. providedbytheNationalCancerInstitute. Fortheanalysis, we used two different classification algorithms: the Naive

The proliferation of computing technologies in every aspect ofmodernlife,hassubstantiallyincreasedthevolumeofdata being collected and stored. The availability of these massive data sets introduces the challenge of having to analyze, un derstand and use the data. Using data mining techniques, datacanbeclassifiedtobringforwardinterestingpatternsor serveasapredictorforfuturetrends.Numerousalgorithmic techniques have been developed to extract information and discern patterns that may be useful for decision support. Such techniques are routinely applied in various areas and problems ranging from business, marketing and money fraud detection, to security surveillance, scientific research andhealthcare. ThispaperusestheSurveillance,EpidemiologyandEndRes ults (SEER) dataset [24], to compare the predictive effective nessoftwoalgorithmictechniques. Ituseslungcancersur vivability as the objective criterion to assess how well each techniqueisabletopredictpatientsurvivabilitygivenasetof historical data. This type of analysis is not unique, in fact it wasinspiredbyasimilaranalysisonbreastcancersurvivab ilitytechniques[2]. The particular disease was selected because of its severity. Lungcancerisoneofthemostprevalentandthemostdeadly typeofcancer.Itmanifestsasanuncontrolledcellgrowthin lung tissues. Both sexes are susceptible, with men having higherprobability(1in13)ofdevelopingthediseaseinalife time than women (1 in 16). These probabilities include smokers and nonsmokers alike with smokers being at a much higher risk. Smoking has been identified as the greatestcontributingcause[3,12,22]butexposuretoasbes tos[6,16]andradongas[5]havealsobeenlinkedwiththein cidenceofthedisease. In the United States, while lung cancer diagnosis comes secondtoprostatecancerformenandbreastcancerforwo men,itistheleadingcauseofcancerdeathacrossthesexes. In2006,lungcancerdiagnosesaccountedfor196,454patients ofallnewcancerpatientsandlungcancermortalityaccoun

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

BayesandJ48,animplementationoftheC4.5algorithm. Thefollowingsectioncontainsabriefoverviewoftherel evantalgorithmsandpredictionmodels.Section3providesa detaileddescriptionofthedataandpreprocessingactivities. Section4describessurvivabilityinthecontextofthespecific SEER datasetanditisfollowedby Section 5which presents theexperimentationandresults.Thefinalsectionsprovidea discussion of the results along with some conclusions and areasforfuturework.

ALGORITHMS AND PREDICTION MODELS

Todevelopapredictionmodel,classifiersarethedatamining algorithmofchoice.Thebasicpremiseisbasedontreatinga collectionofcasesasinput,eachbelongingtoasmallnumber ofclassesdescribedbyafixedsetofattributes[17].Theclas sifier's output is the prediction of the class to which a new case belongs. The accuracy of the prediction varies depend ingontheclassifierandthetypesofattributesandclassesin thedataset.Classificationmaybeviewedasmappingfroma setofattributestoaparticularclass[17].Forthedataanalys is, two classification techniques were used, the Naive Bayes andtheopensourceversionoftheC4.5statisticalclassifier.

2.1 Naive Bayes


TheNaiveBayesalgorithmisasimpleprobabilisticclassi fier that calculates a set of probabilities by counting the frequencyandcombinationsofvaluesinagivendataset. Theprobabilityofaspecificfeatureinthedataappears asamemberinthesetofprobabilitiesandisderivedby calculating the frequency of each feature value within a classofatrainingdataset.The training datasetis a sub set, used to train a classifier algorithm by using known valuestopredictfuture,unknownvalues. ThealgorithmusesBayestheorem[15]andassumesall attributestobeindependent giventhevalueoftheclass variable. This conditional independence assumption rarely holds true in realworld applications, hence the characterization as Naive yet the algorithm tends to per form welland learnrapidlyin various supervised classi ficationproblems[18]. This "naivety" allows the algorithm to easily construct classifications out of large data sets without resorting to complicatediterativeparameterestimationschemes.

take.Then,foreachoftheleavesitrepeatstheprocessusing thetrainingsetdataclassifiedbythisleaf[18].Thecorefunc tionofthealgorithmisdeterminingthemostappropriateat tributetobestpartitionthedataintovariousclasses.TheID3 algorithm[21]usesentropyandinformationgain[29], bor rowed from information theory, to measure the impurity of thedataitems. Smaller entropy values indicate full membership of the datatoasingleclasswhilehigherentropyvaluesindicateex istenceofmoreclassesthedatashouldbelongto. Informa tiongain[29]measuresthedecreaseoftheweightedaverage entropy of the attributes compared with the entropy of the complete dataset. Therefore, the attributes with higher in formation gain are better classification candidates for data items.OnelimitationofID3isitssensitivitytofeatureswith alargepopulationofvalues.C4.5,anID3extension,wasde veloped to address this limitation by finding high accuracy hypotheses,basedonpruningtherulesissuedfromthecon structed tree during the leaf population phase. However, C4.5iscomputationallymoreintensive,intermsoftimeand spacecomplexity. The comparison of the two techniques in the literature, Naive Bayes and J48, does not render conclusive results. Various studies in different domains have found each tech niqueperformingdifferentlythantheother.Forexample,in the context of intrusion detection systems, one study [10] foundJ48performingbetterthananothersimilarstudy[16], whilestudiesinthepredictionofacutecystitisandnephritis [14] and the classification of arrhythmia data [23] the per formanceofNaiveBayeswassuperior.Similarly,inthecon textofspamemailidentification[27],J48 performedbetter in terms of accuracy of overall classification despite the widespreaduseofBayesianbasedspanfilteringtechniques.

DATA

3.1Data Description
ThedatausedinthisstudywasacquiredfromtheSEER database. We used version 3.4.10 of the open source toolkit Weka to Environment for Knowledge Analysis (WEKA)[13]todeterminetheaccuracyoftwoalgorithms inpredictinglungcancersurvivability. Thedatausedin the experiments is sourced from the RESPIR incidence datafilesincludedinthreeSEERdatasets(Table1).
TABLE 1 SEER DATASETS AND YEARS OF COVERAGE. DatasetName yr19732004.seer9 yr19922004.sjlargak Range 19732004 19922004 Years 31 12 4

2.2 J48
J48 is an implementation of C4.5 [20] which is the suc cessoroftheID3algorithm[19].ID3isbasedoninductive logicprogrammingmethods,constructingadecisiontree based on a training set of data and using an entropy measuretodeterminewhichfeaturesofthetrainingcases areimportanttopopulatetheleavesofthetree. Thealgorithmfirstidentifiesthedominantattributeof thetrainingsetandsetsitastherootofthetree.Second,it
creates a leaf for each of the possible values the root can

yr20002004.cakylonj 20002004

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

Fig. 1. Diagram of the four phases of pre-processing activities. Starting from the raw data a series of data integrity and consistency checks prepare the data for mining.

Thereisatotalof481,432recordsinthedataset.Itcon tains records of patients diagnosed between 19732004. However, due to the modification and addition of data fields,recordspriorto1988wereremovedforourstudy. The remaining records were filtered to include only those that contain the Extent of Disease (EOD) 10digit codingsystem.Thiscodingsystemprovidesdetailed,rel evantandusefulinformation,suchastumorsize,number ofpositivenodes,numberofprimariesandotherquanti fiersthatgreatlyenrichthedataset. Theeffectofthisfil ter was to restrict the dataset to records containing dia gnoses only between the complete years in the dataset (19882003). 3.2PreprocessingActivities The records in the dataset are not diseasespecific so they include multiple cancer types and disease profiles. Topreparethedataforminingnumerouspreprocessing activities were performed. Figure 1 depicts the prepro cessing workflow. It is divided in four interconnected stageswiththeoverallobjectivetoensurethatdataisap propriate,completeandwellformatted. The input to the preprocessing workflow is the source, unprocessed,raw,datafiles. The first phase (Phase I) comprises of two activities. The first activity is to perform a structural consistency checktoensurethatfieldnamesandrecordstructuresare homogeneousbetweendifferentsourcefiles. Itisnotun common for different year data sets to have different fieldsduetochangesinthedatacollectionmethodology. The second activity is to identify anomalous data ele mentssuchasmissingvaluesanderroneousdata.Forex ample,emptyfieldsmayoftenbedenotedbyleavingthe field blank, labeling it as null or having a designation suchasN/Aor0.Whileanylabeltoindicateamiss ing value is acceptable, the convention followed in the datamustbeidentifiedandconfirmed. Phase II examines the data with respect to relevancy and completeness. Checking for relevancy ensures that pertinentinformationisincludedtoaccomplishthestudy objective. For example, data not containing any explicit informationaboutthepatient'stypeofcancer. Inthiscase,thedatahastobefurtherprocessedtode termineifasynthesisofmultiplefieldscanhelptoderive and substitute the missing information. If it is not pos sible, the dataset may be deemed inappropriate and be discarded.Anotherresultfromthisphaseistoensurethat datagranularityissufficientforthestudyobjective.Ifthe

objective is to identify individual patientbased relation shipsbutthecollectiononlyprovidesinformationatthe group level, then the data would not have the necessary granularityandshouldbediscarded. Inthethirdphase,thedataisexaminedforqualitative consistency. The objective is to ensure the dataset as a wholeismeaningfulanduseful.Forexample,whenmin ingfortheonsettimeofprostatecancer,atypeofcancer appearing exclusively on male patients, qualitative con sistency may indicate that the data includes female pa tients. Suchinclusionpollutesthedatasetandwillmost likely skew the results. Similarly, extraneous ages, such aspatientsover120yearsofageoryoungerthan0years ofage,wouldbeindicatorsofquestionableentriesinthe dataset. Thefourthphaserequiresdatafieldstobecodifiedus ingdifferenttypesofvalues(ex.ordinal,categorical,etc.). Some of the fields can be immediately used for analysis whileothersmustbenormalizedtoprovidemoremean ingfulnumericrangesortobetteridentifydescribedcon ditions. Thefinalpreprocessingtaskisrelatedtoexpec tedformatandsyntaxmodificationsrequiredbythedata miningalgorithmsandsoftware. Ineachofthefourstages,itispossiblethatsomedata maybedeemedincompleteortoocoarseformining. In such cases the data may be discarded, unless the dis cardedsetsignificantlyimpactstheintegrityofthedata set. It is also possible that the mining algorithms can handleanomalousconditionsbylabelingorignoringcer taindata. 3.3Metadata The SEERdata iswelldescribedproviding itemdescrip tions,codes,andcodinginstructions.However,thedata set includes incidence records for all possible cancer types,treatmentsanddeathcausesalongwithawealthof nonidentifiablepatientinformation. After the data is preprocessed, certain data elements present themselves with interesting information while others become candidates for omission. In the next two sections, a brief description of the data elements is provided,followedbyapostprocessingfiltering. Whilethisprocessisiterativeaftereachpreprocessing phase,herewe present thefinal resultofthe endtoend preprocessing. The following data elements were included in the SEERdataset: Age.Patientageatinitiallungcancerdiagnosis.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

Race.Patientracedescribedusingoneofsevencategories: White, Black, American Indian/Alaskan Native, Asian or Pa cificIslander,OtherUnspecifiedandUnknown. MaritalStatus.Patientmaritalstatus(Single,Married,Sep arated,Divorced,WidowedandUnknown)atthetimeofdia gnosisforthereportabletumor. PrimarySiteCode.Tumorlocationinthebody,asclassi fiedbytheInternationalClassificationofDiseasesforOn cology(ICDO3)[9]fortopographycodes. Histologic Type. Tumor morphology using 178 distinct ICDO3codes. BehaviorCode.Indicatesiftumorismalignant. TumorSize.Indicatesthetumorsize. Grade. Subjective description of tumor's appearance betweenvisits. Extension of Tumor. Farthest documented extension of tumorawayfromtheprimarysite. Lymph Node Involvement. (Binary value). Indicates if thetumorinvolvesLymphNodechains. Site Specific Surgery Code. Describes body and organ locationsthatcouldhavebeenoperatedon. Radiation.Radiationtherapymethodonfirsttreatment. StageofCancer.Physicallocationandspread. RadiationsequencewithSurgery.Sequenceofadminis tering radiation such as presurgery, postsurgery and pre/postsurgeryradiation. SurvivalCode.Denotesifpatienthassurvived.A set of "survivalcodes"wascreatedtoidentifyclassesofsurviv ing patients. Survival Time Recode expresses the survival time in months, Vital Status Recode indicates if patient is aliveandDeathCodeindicatestherecordedcauseofdeath basedonICD10[28]. 3.4Filtering Miningalgorithmsassumethedatatobenoisefree. The output of particularly the third preprocessing phase (qualitativeconsistencychecks),revealsanumberofinter estingcharacteristicsinthedata. For example, there are two fields that are eliminated fromthedataset.TheBehaviorCodeindicatesifatumoris malignant. In our dataset, the majority (99.9%) of the re cords are identified as malignant and the remaining (00.1%) contain a null value. This field is not meaningful forourparticularstudybecauseunlikeothercancertypes, lung cancers, are assumed to be malignant for all cases. Similarly, it is questionable how meaningful the Grade field is. It holds a codified, subjective description of the tumor'sappearancebetweenphysicianvisits.Whilethere is ordering of the possible values (Table 2) yet it is un known how much worse (or better) the different grades varybetweeneachother. A preliminary analysis indicates that almost half (48.36%) of all the data fall in the Grade IV category. Withinthiscategory,approximatelyhalfoftheentriesare recordedasnotavailable(N/A). Therefore, the Grade field is discarded because it is populatedby subjective observationaldata. Inaddition, the unclear demarcation of categories and information unavailabilitywouldincreasethenoiseinthedatarather

provideameaningfuldiscriminatoryattribute.
TABLE 2 THE GRADE FIELD WITH THE POSSIBLE RANGE OF VALUES. Grade GradeI GradeII GradeIII GradeIV PossibleTumorSizeDescriptions Welldifferentiated Differentiated Moderatelydifferentiated Welldifferentiated Poorlydifferentiated Differentiated Undifferentiated Anaplastic,celltypenotdetermined Notstated NotAvailable(N/A)

Twootherfieldsrequirespecialattention.ThePrimary SiteCodecontains28distinctcodestodescribetheincid enceofcancer.Notallofthecodesareapplicablesoonly the records with codes identifying lung cancer (C340 C343,C348C349)arepreserved. The Age field on the other hand, has values ranging from 1 to 106. This is a very wide value spread which may generate many, nonsignificant data points. As the significance of age ordinality becomes a concern, it is probably not very meaningful to differentiate between a 24yearoldanda25yearoldpatient.Toaddressthis,the dataisdiscretizedintoan8binaryscaleformatwiththe following entries: (024), (2534), (3544), (4554), (5564), (6574),(7584),>=85. 3.5PreliminaryStatisticalAnalysisResults Atthisstageanumberofstatisticalmeasuresandcounts areappliedtoidentifyifthereareanyinterestingpatterns simplybymeasuringthe frequency of occurrence of cer tainelementsandeventsinthedata.Earlyresultsforthe TumorSizekeywordindicateatrendofsurvivingpatient records having this attribute with a value of less than 40mm. Another interesting observation can be made about the Site Specific Surgery Code which describes the body and organ locations that could have been operated on. Themajority(95%) of the records are identified as Lung andBronchusandtheremainingasPleuraandTrachea,Me diastinum,andOtherrespiratoryorgans. On the treatment side, the keyword Radiation, indic atesthemethodofradiationtherapyperformedaspartof the first treatment. While there are almost ten possible categories, 92% of the records in the dataset are in cat egories under Diagnosed at Autopsy and Beam Radiation. The remaining categories are: Radioactive Implants, Ra dioisotopes, Combination Treatment, No Method Specified, Patient Refused Therapy, Radiation Recommended and Un knownIfAdministered. Whilenoneoftheabovemeasurementscanleadtoany conclusions about the correlation of tumor size and sur

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

vivabilityor,surgerylocationandtreatmentoptions,they offeranalternativeperspectiveonthedata.Perhapsthese findings could prove useful during an association rule learning experiment either as a validation or reenforce mentmechanism.

percentage of patients who survived after being dia gnosedwithlungcancer. 4.1SurvivalTime Thereisaninterestingpatterninthedatawhencompar ingthemediansurvivaltime(MST)inmonthsfromyear toyear. TheMSTisverystablefromyeartoyear,gradu allyincreasing,asexpected.However.betweentheyears 1999and2000,thereisadramaticdecrease.

SURVIVABILITY

Survival rates and statistics indicate the percentage of patientswhomanagetosurvivefromcancerforaspecific amountoftime,typicallyfiveyears(60months).Survival mustbedeterminedascausespecificsurvival,represent ingsurvivalfromlungcancerandnotanyothercause. The ICD10 Death Code designation plays a significant role in this determination as patient records indicating failure to survive due to causes other than lung cancer must be eliminated. This assumes the data contains a single,accurate,diseasespecificcauseofdeathcode.Itis often hard to depend on such variable particularly with cancerpatients,asthecauseofdeathmayoftenbeattrib utedtoadifferentsiteduetotometastasis,insteadofthe primary site. In addition, the survival rates are often vague inspecifying whether a patientis stillundergoing treatmentandifthepatienthasachievedfullorpartialre mission.
ALGOTITHM 1 DETERMINING SURVIVAL STATUS. 1 2 3 4 5 6 7 8 9 10 11 12 DeathCode(fromICD10designation) if(SurvivalTimeRecode>=60months) (VitalStatusRecode="alive")then Survivedtrue; elseif(SurvivalTimeRecode<60months) (causeDeathCode)then Survivedfalse; elseifSurvivalCode=nullthen removerecord; else{Survived=inconclusive;} removerecord; endif

Fig.2.Mediansurvivaltime(inmonths)peryear.

Itis thereforenecessarytoderivea"survived"flagfor eachpatientrecord.ThisflagisderivedusingAlgorithm 1andtakingintoaccounttheDeathCodefieldwiththepa tient'sreportedstatus(VitalStatusRecode)andthereported survivaltime(SurvivalTimeRecode).


TABLE 3 PROFILEOF THE SURVIVED PATIENT CLASS. Period 19881999 Total Status Survived NotSurvived Numberof Patients 14,368 160,123 174,491 % 8.23% 91.77% 100.00%

Themediannumberofsurvivedmonthsinthe16year range is constant for the first seven years, gradually in creasesbetweenyeareightandtwelve,andthendramat icallydecreaseseachyearbyapproximately33%through theendofthedataset(Figure2). Theincreaseinsurvivalrate,whichgoesagainstthe datainFig.2,couldbejustifiedwiththeintroductionofa revolutionary treatment and very early prediction after year2000.Mostlikely,thisdecreaseisduetoamajorin creaseinthenumberofpatientrecords(Table4). Afterfurtherinvestigation,wediscoveredthatthisde clinecoincideswithcodingdifferencesbetweenthemul tipleSEERreportingsites.In2000,anewSEERreporting site was introduced, collecting data from densely and highly populated areas of the U.S. including California andNewJersey.

Any records that contain a null value in the survival code after applying the above algorithm are removed. Table3showsthenumberofinstancesandtherespective

Fig.3.Numberofdiagnosedpatients(inthousands).

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

Thedatafromthesereportingsitesincreasedthenum berofrecords(Table4)andalongwiththeuseofdifferent coding standards may explain the MST decline in the data. Regionaltrendsoranomaliesmayalsobeanothercon tributingfactorsofurtherexaminationiswarrantedinor dertousethespecificdatasegmentwithconfidence.Asa result,dataafter1999werediscardedfromthestudy.
TABLE 4 MEDIAN SURVIVAL TIME AND NUMBER OF PATIENTS PER YEAR OF DIAGNOSIS (1988-2003). Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 MedianSurvival NumberofPatients (months) 7 7 7 7 7 7 7 8 8 8 8 9 6 6 5 4 11,148 11,405 11,726 12,069 16,734 16,764 16,856 17,276 17,530 17,743 18,226 18,395 32,198 31,666 29,959 26,244

TABLE 5 RULES FOR SURVIVABILITY DETERMINATION. Rule Number Rule1 Expression ifpatientlives>60monthsand (status=aliveordiedafterfollowup cutoffdate)thenpatient=survivor ifnosurvivaltimedataexistsorpatient diagnosedinlastyearofour datasetandnotdeadandsurvival<Ts thenpatientrecordisremoved ifpatientdiesandcauseofdeathiscan ceroflungorbronchusand survivaltime<Tsthenpatient=short termsurvivor ifpatientsurvivaltime>Tsthen patient=longtermsurvivor ifnoneofRules24applicablethen patientrecordisremoved

Rule2

Rule3

Rule4 Rule5

EXPERIMENTATION

The purpose of this study is to test the accuracy of two learning algorithms (Naive Bayes and J48) in predicting survivability of patients diagnosed with lung cancer. Afterpreprocessing,thedataisreadytobeprocessedby thetwoalgorithms.
TABLE 6: TRAINING DATA Years 19881992 19921998 19881997 NumberofYears 5 7 10 TargetYear 1993 1999 1998

Another aspect of survivability is the survival length By settingalimit,patientrecordscanbeclassifiedintwosurviv alcategories:shorttermandlongterm.Thislimit,foundby themediansurvivalrate,indicatesthenumberofpatientsin ayear diagnosed within twelve months. To implement this thresholdandidentifylongtermsurvivability,Rule1(Table 5)isusedtoidentifyalongtermsurvivorandclassifythepa tientasa"survivor". Just by applying this rule, 135,878 (30.23%) entries of the 449,446 total patient records werediscardeddueto insuffi cient information thatforbidus to properly classify patients assurvivedorasnotsurvived. WhileRule1establishesmeaningfulconstraintsforlong termsurvivability,itisverystrictanddiscriminatingforthe shortandmidterm. Thisisbecausethemediansurvivaltimeisconsistently lessthanayear,andintheearlyyearsofthedata,itisfairly consistentsothat3540%ofpatientssurvivelongerthana year,albeitthisnumbersignificantlydecreasesafter2000. Toaddresstheseissues,weestablishedadditionalrulesto improvethedeterminationofsurvivability(Table5).

Three experimentation executions were performed, eachunderacommonscenario:usingavariablenumber of sequential years as a training dataset, attempt to pre dict the accuracy of lung cancer incidence for a sub sequentyear(targetyear). Then, to determine the level of accuracy for each al gorithm,comparethepredictedoutcomeswiththetarget data.Severaldatasetsareused,identifiedbythenumber ofyearsoftrainingandtargetdata(Table6).

5.1 Execution
The Naive Bayes algorithm was executed over each one, five,sevenandtenyearTrainingDatadaterangesproducing correctlyclassifiedinstancesrangingfrom88.27%to92.38% (Table 7). For each experiment, we obtained the Correctly Classified Instances (CCI), the Incorrectly Classified In stances (ICI) the Kappa statistic () and the rootmean squareerror (RMSE). The CCI is the sample accuracy withoutanychancecorrectionsanditisexpressedasaper

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

centageofcorrectlyclassifiedinstances.TheKappastatisticis a chancecorrected measure and reflects the agreement between the classifications and the true classes. Values over zero indicate that the classifier is performing better than chance. The RMSE measures the differences between pre dictedorestimatedvaluesagainstobservedvalues. TABLE 7: DETAILEDEXPERIMENT RESULTS FOR THE NAIVE BAYES ALGORITHM. 5years CCI ICI
RMSE

using the RMSE, the low error rate ofbothalgorithms canserveasanindirectmeasureofeffectiveness.

DISCUSSION

6.1 Classification Results


The objective of the experiments was to compare the strength of the two techniques in survivability prediction of patients diagnosed with lung cancer. While an approximately 90%overallcorrectclassificationaccuracyisachieved,thereis anunexpectedresult. For both the Naive Bayes and the J48 algorithm, the 5 year training dataset produces a higher accuracy rate than the 7yearand10yeartrainingdatasets.Infact,the5year training dataset outperformsandis slightlymoreaccurate than the accuracy provided from the 7 year and 10year trainingdatasets.Itwouldbeexpectedtohaveahigherpre dictionaccuracywithmoretrainingdatabutthisisnotthe case.Therearetwopossiblefactorsatplay.First,thedifference inthenumberofpatientrecordsbetweenthethreedatasetsis not proportional. The 5year training set includes 15,780 lung cancer patient records and the 7year training set includes 17,192.Thisisadifferenceof1,412(an8.94%increase)ofpatient records. On the other hand, the 10year training set includes 17,063 lung cancer patient records, which is a 0.75% decrease (129 patients) than the the 7year data set. So the difference betweenthe5yearand7yeardatasetisnotsignificantlydif ferent.Second,itislikelythatthelargertrainingdatasetmay contain inconsistencies due to changes in the data collection processes and locations. The 5year training set contains data primarilyfromoneSEERreportinglocation.In1992,asecond SEER reporting location began collecting data, which was in cluded in the bulk of the dataset for the 7 year and 10year studies.

7years 15,175 (88.27%) 2,017 (11.73%) 0.53610 0.30750 17,192

10years 15,208 (89.13%) 1,855 (10.87%) 0.51360 0.29770 17,063

14,577 (92.38%) 1,203 (7.62%) 0.49100 0.24710 15,780

Total

The J48 algorithm was also executed over the same Training Data date ranges and produced correctly classi fiedinstancesrangingfrom89.63%to94.44%(Table8).
TABLE 8: DETAILEDEXPERIMENT RESULTS FOR THE J48 ALGORITHM. 5years CCI ICI
RMSE

7years 15,409 (89.63%) 2,017 (11.73%) 0.45090 0.29030 17,192

10years 15,461 (90.61%) 1,855 (10.87%) 0.40210 0.27280 17,063

14,902 (94.44%) 878 (5.56%) 0.38360 0.20970 15,780

Total

CONCLUSION & FUTURE WORK

Comparingtheoverallaccuracyofthetwoalgorithms, it seems that both algorithms performed fairly well, achievingapproximatelya90%correctclassificationrate. The J48 algorithm consistently performed slightly better thantheNaiveBayesalgorithminsuccessfullypredicting thecorrectsurvivabilityoutcome(Table9).
TABLE 9: ACCURACY COMPARISON BETWEEN THE TWO ALGORITHMS PER EXPERIMENT. ID T93 T99 T98 Prediction foryear 1993 1999 1998 Yearsof training 5 7 10 Naive Bayes 92.3764% 88.2678% 89.1285% J48 94.4360% 89.6289% 90.6113%

For all executions, the valuesofthe kappa statistic in dicate the strength of the classifier's stability and in our experimentation,bothalgorithmsperformwell.Similarly,

Thisstudyexaminestheabilityofdataminingandmachine learning methods to accurately predict the survivability of patients diagnosed with lung cancer. Survivability is definedassomeonewholivesbeyonda5yearperiod. Generally,weachieveanapproximatepredictionaccuracy around 90% using either the Naive Bayes or the J48 al gorithm.Asaresultofthisstudy,atreatingphysiciancan theoreticallycollectahandfulofmedicalmeasurementssuch astumorsizeandlocation,treatmentoptions,patientsback ground,etc.,andpredictwithafairlyhighdegreeofaccur acywhetherthepatientislikelytoliveforfiveormoreyears. Consideringthehighmortalityrate(>90%)ofthepatients in the study, it seems reasonable and useful to examine the survivability of patients over a shorter time period, forex ample,betweentwelveandeighteenmonths. Inlightofthe findings and observations of the study, future research will be focused on examining regional trends and the impact of locationonsurvivability.Moreindepthanalysiscanalsobe performed by comparing survivability results with data re lated to regional medical insurance carriers, state laws that affect the medical insurance industry, and other external factors.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

Inthispaper,wecomparedtheeffectivenessoftheNaive BayesclassificationalgorithmandtheJ48implementationof theC4.5decision tree algorithm in the prediction of surviv abilityforaspecificdisease. Fromtheresults,theNaiveBayestechniqueremainedtrue toitsreputationanditsportrayalintheliterature:itisrobust, easytointerpret,itoftendoessurprisinglywellbutmaynot bethebestclassifierinanyparticularapplication[15].Inthis case, Naive Bayes performed consistently worse than J48. Yet,itssimplicityandfairlycompetitiveperformancemakeit anappealingalternative.Overall,bothareviableanduseful algorithms and the performance differences between them weremarginal.Moreworkwillbedonetoinvestigateifthese marginaldifferencesexistduetothewaythealgorithmsare implementedorduetoinherentcharacteristicsinthedata.

[10] Garca,V.,Sanchez,J.,Mollinenda,R.(2008)"AnEmpiricalStudyof
the Behavior of Classifiers on Imbalanced and Overlapped Data Sets",LectureNotesinComputerScience,Volume4756,Progress in Pattern Recognition, Image Analysis andApplications ,Pages397406. [11] Hackshaw,AKandLaw,MRandWald,NJ.1997.Theaccumulated evidence on lung cancer and environmental tobacco smoke. British MedicalJournal,315(7114),980988. [12] HallMandEibeFandHolmesGandPfahringerBandReutemann PandWittenIH.2009.TheWEKADataMiningSoftware:AnUp date;SIGKDDExplorations,11(1). [13] Kowsalya,R.Sasikala,G,SangeethaPriyaJ.2010.AcuteCystitisand Acute Nephritis Prediction Using Machine Learning Techniques. GlobalJournalofComputerScienceandTechnology,10(8),1416. [14] Lavesson, Niklas and Davidson, Paul. 2004. A multidimensional measure function for classifier performance. In Proceedings of the 2ndIEEEInternationalConferenceonIntelligentSystems,June2004, pp.508513. [15] OReilly,KatherineMAandMclaughlin, AnneMarieandBeckett, William S and Sime, Patricia J. 2007. Asbestosrelated lung disease., AmericanFamilyPhysician,75(5),6838. [16] PandaMandPatraMR.2008.AComparativeStudyofDataMin ing Algorithms for Network Intrusion Detection. In Proceedings of the 2008 First International Conference on Emerging Trends in En gineering and Technology (ICETET 08). IEEE Computer Society, Washington,DC,USA,504507. [17] Panda M and Patra M R. 2007. Network Intrusion Detection using NaiveBayes,InternationaljournalofComputerScienceandNetwork Security,DecAZ302007,pp.258263. [18] Peddabachigari,S.,Abraham,A.,Grosan,G.,Thomas,J..2007.Mod eling intrusion detection system using hybrid intelligent systems. J. Netw.Comput.Appl.30,1(January2007),114132. [19] QuinlanRJ.1990.DecisionTreesandDecisionMaking.IEEETrans actiononSystemManCyber,March/April1990,20(2). [20] Quinlan R J. 1993. C4.5: Programs for Machine Learning. Morgan KaufmannPublishersInc.,SanFrancisco,CA,USA. [21] Quinlan R J. 1996. Improved use of continuous attributes in C4.5. JournalofArtificialIntelligenceResearch,4:7790,1996. [22] Samet J M and Wiggins C L and Humble C G and Pathak D R. 1998.CigarettesmokingandlungcancerinNewMexico.TheAmer icanReviewofRespiratoryDisease [23] Soman,T.andBobbie,P.O.2005,ClassificationofArrhythmiaUsing Machine Learning Techniques. WSEAS Transactions on Computers, Vol.4,issue6,June2005,pp.548552. [24] Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Research Data (19732004), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2010, based on the November 2009 submis sion. [25] U.S. Cancer Statistics Working Group. 2010. United States Cancer Statistics: 19992006 Incidence and Mortality Webbased Report. At lanta(GA):DepartmentofHealthandHumanServices,Centersfor DiseaseControlandPrevention,andNationalCancerInstitute;2010. URL:http://www.cdc.gov/uscs. [26] Yanwei Xing, Jie Wang, Zhihong Zhao, and and Yonghong Gao. 2007.CombinationDataMiningMethodswithNewMedicalDatato Predicting Outcome of Coronary Heart Disease. In Proceedings of the2007InternationalConferenceonConvergenceInformationTech nology (ICCIT07). IEEE Computer Society, Washington, DC, USA, 868872.

inComputationalSciencesandTechnology,3(3),321334.

ACKNOWLEDGMENT
TheauthorswishtothankSEERandtheNationalCancerIn stituteforthedata.

REFERENCES
[1] AmericanCancerSociety,Cancer Facts&Figures.American Cancer [2] Bellaachia,A.,Guven,E.PredictingBreastCancerSurvivabilityusing
DataMiningTechniques,NinthWorkshoponMiningScientificand EngineeringDatasetsinconjunctionwiththeSixthSIAMInternation alConferenceonDataMining(SDM2006),Saturday,April22,2006. Biesalski, H K and Bueno de Mesquita, B and Chesson, A and Chytil,FandGrimble,RandHermus,RJandKohrle,JandLotan, R and Norpoth, K and Pastorino, U and Thurnham, D, European Consesus Statement on Lung Cancer: risk factors and prevention. LungCancerPanel..CACancerJClin,48,3,16776;discussion1646, URL: http://www.biomedsearch.com/nih/EuropeanCon sensusStatementLungCancer/9594919.html. Catelinois,OlivierandRogel,AgnesandLaurier,DominiqueandBil lon, Solenne and Hemon, Denis and Verger, Pierre and Tirmarche, Margot,2006,Lungcancerattributabletoindoorradonexposurein France:impactoftheriskmodelsanduncertaintyanalysis.,Environ mentalHealthPerspectives,114(9),13616. Darnton,AndrewJandMcElvenny,DamienMandHodgson,John T,2006,Estimatingthenumberofasbestosrelatedlungcancerdeaths inGreatBritainfrom1980to 2000. TheAnnalsofoccupational hy giene,50(1),2938. Dechang Chen, Kai Xing, Donald Henson, Li Sheng, Arnold M. Schwartz,andXiuzhenCheng.2008.AClusteringApproachinDe velopingPrognosticSystemsofCancerPatients.InProceedingsofthe 2008SeventhInternationalConferenceonMachineLearningandAp plications (ICMLA 08). IEEE Computer Society, Washington, DC, USA,723728. Dursun Delen and Nainish Patil. 2006. Knowledge Extraction from ProstateCancerData.InProceedingsofthe39thAnnualHawaiiIn ternationalConferenceonSystemSciences(HICSS06),Vol.5.IEEE ComputerSociety,Washington,DC,USA,92.2. Fritz, A, Jack, A Percy,C Sobin, L Parkin, D.M..(eds.) International classificationofdiseasesforoncology(ICDO).3ded.Geneva:World HealthOrganization,2000. Gandhi, G M and Srivatsa, S.K. (2010). Classification Algorithms in ComparingClassifierCategoriestoPredicttheAccuracyoftheNet workIntrusionDetectionAMachineLearningApproach.Advances Society,Atlanta,GA,2010,URL:http://www.cancer.org/research/.

[3]

[4]

[5]

[6]

[7]

[8] [9]

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

[27] YounS.andMcLeodD.2006.AComparativeStudyforEmailClassi
fication,ProceedingsofInternationalJointConferencesonComputer, Information,SystemSciences,andEngineering(CISSE06),Bridgeport CT,December2006. [28] WorldHealthOrganization.19921994.InternationalStatisticalClassi ficationofDiseasesandRelatedHealthProblems(ICD10),10thRe vision.3vols.Geneva:WHO. [29] WittenIanHandFrankEibe.2005.DataMining:PracticalMachine Learning Tools and Techniques. The Morgan Kaufmann Series in Data Management Systems. 2nd Ed. Morgan Kaufmann, 2005. 1988;137(5):11103. George Dimitoglou is Associate Professor of computer science at Hood College. He holds a doctorate in computer science from The George Washington University. He is a member of the IEEE, the ACM and the Mathematical Association of America. James A. Adams is Sr. Systems Analyst with Marriott International. He holds a B.S. from George Mason University in Accounting and Management Information Systems a M.S. in computer science from Hood College where he received the 2008 Departmental Scholar Award. Carol M. Jim is a doctoral student in computer science at The George Washington University. She holds a B.A. in mathematics and computer science and a M.S. in computer science from Hood College where she received the 2010 Departmental Scholar Award.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617