You are on page 1of 163

Data Mining Lab

INDEX
The objective of the lab exercises is to use data mining techniques to identify customer segments and understand their buying behavior and to use standard databases available to understand DM processes using WEKA or any other DM tool! 1. Gain insight for running pre- defined decision trees and explore results using MS OL ! nal"tics. #. $sing I%M OL ! Miner & $nderstand the use of data 'ining for e(aluating the content of 'ultidi'ensional cu)es. *. $sing +eradata ,arehouse Miner & -reate 'ining 'odels that are executed in S.L. "# $ortal %ab& The objective of the lab exercises is to integrate pre'built reports into a portal application ! /. !u)lish cognos cu)es to a )usiness intelligence portal. Metadata ( ET% %ab& The objective of the lab exercises is to implement metadata import agents to pull metadata from leading business intelligence tools and populate a metadata repository) To understand ET% processes 0. I'port 'etadata fro' specific )usiness intelligence tools and populate a 'eta data repositor". 1. !u)lish 'etadata stored in the repositor". 2. Load data fro' heterogeneous sources including text files into a predefined 3arehouse sche'a

SVCET,Chittoor

Data Mining Lab


-ON+EN+S
S.no * / 0 1 3 + , . *4 Experi'ent Defining Weather relation for different attributes Defining employee relation for different attributes Defining labor relation for different attributes Defining student relation for different attributes Exploring 2eather relation using experimenter and obtaining results in various schemes Exploring employee relation using experimenter Exploring labor relation using experimenter Exploring student relation using experimenter 5etting up a flo2 to load an arff file batch mode! andperform a cross validation using 60, Design a 7no2ledge flo2 layout8 to load attribute selection normali9e the attributes and to store the result in a csv saver) ,ee4 NO * / 0 1 3 + , . *4 !age NOs +'*, *.'-, -.'/, /.'0. 0.'1. 34'31 33'+* +-'+, ,3'****3'**+

SVCET,Chittoor

Data Mining Lab


i'5
I'ple'entation of Data Mining lgorith's )" ttri)ute 6elation 7ile for'ats #ntroduction to We7a Data Mining Tool! We7a is a collection of machine learning algorithms for data mining tas7s) The algorithms can either be applied directly to a dataset using :;#! or called from your o2n 6ava code using We7a 6ava library!) Tools or functions! in We7a include& Data preprocessing e)g)8 Data <ilters!8 =lassification e)g)8 "ayes>et8 K>>8 =0)1 Decision Tree8 >eural >et2or7s8 5?M!8 @egression e)g)8 %inear @egression8 #sotonic @egression8 5?M for @egression!8 =lustering e)g)8 5imple K'means8 Expectation Maximi9ation EM!!8 Association rules e)g)8 Apriori Algorithm8 $redictive Accuracy8 =onfirmation :uided!8 <eature 5election e)g)8 =fs 5ubset Evaluation8 #nformation :ain8 =hi' squared 5tatistic!8 and ?isuali9ation e)g)8 ?ie2 different t2o'dimensional plots of the data!)

%aunching WEKA
The We7a :;# =hooser class 2e7a)gui):;#=hooser! provides a starting point for launching We7aAs main :;# applications and supporting tools) #f one prefers a MD# Bmultiple document interface! appearance8 then this is provided by an alternative launcher called BMain class 2e7a)gui)Main!) The :;# =hooser consists of four buttons one for each of the four major We7a applications and four menus) The buttons can be used to start the follo2ing applications& Explorer An environment for exploring data 2ith WEKA documentation deals 2ith this application in more detail!) the rest of this

Experi'enter An environment for performing experiments and conducting statistical tests bet2een learning schemes) 8no3ledge 7lo3 This environment supports essentially the same functions as the Explorer but 2ith a drag'and'drop interface) Cne advantage is that it supports incremental learning) Si'ple -LI $rovides a simple command'line interface that allo2s direct execution of WEKA commands for operating systems that do not provide their o2n command line interface)

SVCET,Chittoor

Data Mining Lab

Wor7ing 2ith Explorer We7a Data <ile <ormat #nput! The most popular data input format of We7a
is Barff 2ith Barff being the extension name of your input data file!) Experi'ent51 ,E +9E6 6EL +ION5 D A@<< file for 2eather data 2ith some numeric features Erelation 2eather Eattribute outloo7 Fsunny8 overcast8 rainyG Eattribute temperature numeric Eattribute humidity numeric Eattribute 2indy Ftrue8 falseG Eattribute playH Fyes8 noG Edata sunny8 ,18 ,18 false8 no sunny8 ,48 .48 true8 no overcast8 ,/8 ,38 false8 yes

SVCET,Chittoor

Data Mining Lab

$@E$@C=E55#>:&
#n order to experiment 2ith the application8 the data set needs to be presented to WEKA in a format the program understands) There are rules for the type of data that WEKA 2ill accept and three options for loading data into the program) Cpen <ile' allo2s for the user to select files residing on the local machine or recorded medium Cpen ;@%' provides a mechanism to locate a file or data source from a different location specified by the user Cpen Database' allo2s the user to retrieve files or data from a database source provided by the user

SVCET,Chittoor

Data Mining Lab

=%A55#<#=AT#C>&
The user has the option of applying many different algorithms to the data set in order to produce a representation of information) The best approach is to independently apply a mixture of the available choices and see 2hat yields something close to the desired results) The =lassify tab is 2here the user selects the classifier choices) <igure 1 sho2s some of the categories)

Cutput&
correctly =lassified #nstances 1 /1)+*0/ D Kappa statistic 4 Mean absolute error 4)0+3@oot mean squared error 4)0./0 @elative absolute error *44 D @oot relative squared error *44 D Total >umber of #nstances *0 III Detailed Accuracy "y =lass III T$ @ate <$ @ate $recision @ecall <'Measure @C= Area =lass * * 4)30/ * 4)+,/ 4)*+, yes 4 4 4 4 4 4)*+, no Weighted Avg) 4)30/ 4)30/ 4)0*/ 4)30/ 4)14/ 4)*+, III =onfusion Matrix III a b J'' classified as . 4 K a I yes 1 4 K b I no

=%;5TE@#>:&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e) There are a fe2 options 2ithin the cluster 2indo2 that are similar to those

SVCET,Chittoor

Data Mining Lab


described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options) II @un information III Cutput& 5cheme& 2e7a)clusterers)EM '# *44 '> '* 'M *)4E'3 '5 *44 @elation& 2eather #nstances& *0 Attributes& 1 outloo7 temperature humidity 2indy play Test mode& evaluate on training data III Model and evaluation on training set I >umber of clusters selected by cross validation& * =luster Attribute 4 *! IIIIIIIIIIIIIIIIIIIIII outloo7 sunny 3 overcast 1 rainy 3 LtotalM *+ temperature mean +/)1+*0 std) dev) 3)//-3 humidity mean ,*)30-. std) dev) .).*** 2indy T@;E + <A%5E . LtotalM *3 play yes *4 no 3 LtotalM *3 =lustered #nstance4 *0 *44D! %og li7elihood& '.)043/

=hoosing @elationship for cluster&

SVCET,Chittoor

Data Mining Lab

A55C=#AT#C>&
The associate tab opens a 2indo2 to select the options for associations 2ithin the data set) The user selects one of the choices and presses start to yield the results) There are fe2 options for this 2indo2 and one of the most popular8 Apriori8 is sho2n in <igure belo2)

II @un information III


5cheme& 2e7a)associations)<ilteredAssociator '< N2e7a)filters)Multi<ilter '< ON2e7a)filters)unsupervised)attribute)@eplaceMissing?alues ONN 'c '* 'W 2e7a)associations)Apriori '' '> *4 'T 4 '= 4). 'D 4)41 '; *)4 'M 4)* '5 '*)4 'c '* @elation& 2eather #nstances& *0 Attributes& 1 outloo7 temperature humidity 2indy play

5E%E=T#>: ATT@#";TE5&
The next tab is used to select the specific attributes used for the calculation process) "y default all of the available attributes are used in the evaluation of the data set) #f the user 2anted to exclude certain categories of the data they 2ould deselect those specific choices from the list in the cluster 2indo2) This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results) The soft2are searches through the selected attributes to decide 2hich of them 2ill best fit the desired calculation) To perform this8 the user has to select t2o options8 an attribute evaluator and a search method) Cnce this is done the program evaluates the data based on the subset of the attributes8 then it performs the necessary search for commonality 2ith the date) <igure , sho2s the opinions of attribute evaluation) C;T$;T&

SVCET,Chittoor

Data Mining Lab


III @un information III Evaluator& 2e7a)attribute5election)=fs5ubsetEval 5earch& 2e7a)attribute5election)"est<irst 'D * '> 1 @elation& 2eather #nstances& *0 Attributes& 1 outloo7 temperature humidity 2indyplay Evaluation mode& evaluate on all training data III Attribute 5election on all input data III 5earch Method& "est first) 5tart set& no attributes 5earch direction& for2ard 5tale search after 1 node expansions Total number of subsets evaluated& ** Merit of best subset found& 4)*.3 Attribute 5ubset Evaluator supervised8 =lass nominal!& 1 play!& =<5 5ubset Evaluator #ncluding locally predictive attributes 5elected attributes& *80 & outloo7 2indy

?#5;A%#PAT#C>&

The last tab in the 2indo2 is the visuali9ation tab) ;sing the other tabs in the program8 calculations and comparisons have occurred on the data set) 5elections of attributes and methods of manipulation have been chosen) The final piece of the pu99le is loo7ing at the information that has been derived throughout the process) The user can no2 actually see the data displayed in a t2o dimensional representation of the information) The first screen that the user sees 2hen they select the visuali9ation option is a matrix of plots representing the different attributes 2ithin the data set plotted against the other attributes) #f a lot of attributes are selected8 there is a scroll bar to vie2 all of the produced plots) The user can select a specific plot from the matrix to analy9e its contents in a larger8 popup 2indo2) A grid pattern of the plots allo2s the user to select the attribute positioning to their li7ing for better understanding) Cnce a specific plot has been selected8 the user can change the attributes from one vie2 to another)

SVCET,Chittoor

10

Data Mining Lab

SVCET,Chittoor

11

Data Mining Lab


Experi'ent 5#
Employee @elation #>$;T!& D A@<< file for employee data 2ith some numeric features Erelation employee Eattribute ename Fjohn8 tony8 raviG Eattribute eid numeric Eattribute esal numeric Eattribute edept Fsales8 adminG Edata
john8 ,18 ,1448 sales tony8 ,18 .1448 admin john8 ,18 ,1448 sales

C;T$;T

$@E$@C=E55#>:&
#n order to experiment 2ith the application8 the data set needs to be presented to WEKA in a format the program understands) There are rules for the SVCET,Chittoor 12

Data Mining Lab


type of data that WEKA 2ill accept and three options for loading data into the program) Cpen <ile' allo2s for the user to select files residing on the local machine or recorded medium Cpen ;@%' provides a mechanism to locate a file or data source from a different location specified by the user Cpen Database' allo2s the user to retrieve files or data from a database source provided by the user

=%A55#<#=AT#C>&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e) There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of ins) III @un information III C;T$;T& 5cheme& 2e7a)classifiers)rules)Pero@ @elation& employee #nstances& / Attributes& 0 ename eid esal edept SVCET,Chittoor 13

Data Mining Lab


Test mode& *4'fold cross'validation III =lassifier model full training set! III Pero@ predicts class value& sales Time ta7en to build model& 4 seconds

=%;5TE@#>:&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e) There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options)

C;T$;T& 5cheme& 2e7a)clusterers)EM '# *44 '> '* 'M *)4E'3 '5 *44 @elation& employee #nstances& / Attributes& 0 ename eid esal edept Test mode& evaluate on training data III Model and evaluation on training set III EM II >umber of clusters selected by cross validation& * =luster Attribute 4 *! IIIIIIIIIIIIIIIIIIIIII ename john / tony ravi * LtotalM 3 eid mean ,1 std) dev) 4 esal mean ,,//)//// std) dev) 0+*)0401 edept sales / admin LtotalM 1 =lustered #nstances 4 / *44D!
SVCET,Chittoor 14

Data Mining Lab


%og li7elihood& /),0+3/

A55C=#AT#C>&
The associate tab opens a 2indo2 to select the options for associations 2ithin the data set) The user selects one of the choices and presses start to yield the results) There are fe2 options for this 2indo2 and one of the most popular8 Apriori8 is sho2n in <igure belo2)

III @un information III 5cheme& 2e7a)associations)<ilteredAssociator '< N2e7a)filters)Multi<ilter '< ON2e7a)filters)unsupervised)attribute)@eplaceMissing?alues ONN 'c '* 'W 2e7a)associations)Apriori '' '> *4 'T 4 '= 4). 'D 4)41 '; *)4 'M 4)* '5 '*)4 'c '* @elation& employee #nstances& / Attributes& 0 ename eid esal edept

5E%E=T#>: ATT@#";TE5&
The next tab is used to select the specific attributes used for the calculation process) "y default all of the available attributes are used in the evaluation of the data set) #f the user 2anted to exclude certain categories of the data they 2ould deselect those specific choices from the list in the cluster 2indo2) This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results) The soft2are searches through the selected attributes to decide 2hich of them 2ill best fit the desired calculation) To perform this8 the user has to select t2o options8 an attribute evaluator and a search method) Cnce this is done the program evaluates the data based on the subset of the attributes8 then it performs the necessary search for commonality 2ith the date) <igure , sho2s the opinions of attribute evaluation)

C;T$;T& III Attribute 5election on all input data III 5earch Method& "est first) 5tart set& no attributes 5earch direction& for2ard 5tale search after 1 node expansions Total number of subsets evaluated& ** Merit of best subset found& 4)*.3 Attribute 5ubset Evaluator supervised8 =lass nominal!& 1 play!& =<5 5ubset Evaluator #ncluding locally predictive attributes 5elected attributes& *80 & outloo7 2indy

?#5;A%#PAT#C>&
SVCET,Chittoor 15

Data Mining Lab


The last tab in the 2indo2 is the visuali9ation tab) ;sing the other tabs in the program8 calculations and comparisons have occurred on the data set) 5elections of attributes and methods of manipulation have been chosen) The final piece of the pu99le is loo7ing at the information that has been derived throughout the process) The user can no2 actually see the data displayed in a t2o dimensional representation of the information) The first screen that the user sees 2hen they select the visuali9ation option is a matrix of plots representing the different attributes 2ithin the data set plotted against the other attributes) #f a lot of attributes are selected8 there is a scroll bar to vie2 all of the produced plots) The user can select a specific plot from the matrix to analy9e its contents in a larger8 popup 2indo2) A grid pattern of the plots allo2s the user to select the attribute positioning to their li7ing for better understanding) Cnce a specific plot has been selected8 the user can change the attributes from one vie2 to another)

SVCET,Chittoor

16

Data Mining Lab


Experi'ent5*
5T;DE>T @E%AT#C> D D A@<< file for student data 2ith some numeric features D Erelation student Eattribute sname Fjohn8 tony8 raviG Eattribute sid numeric Eattribute sbranch FE=E8 =5E8 #TG Eattribute sage numeric Edata john8 -,18 E=E8 *. tony8 /,18 #T8 admin john8 0,18 E=E8 *.

$@E$@C=E55#>:&

#n order to experiment 2ith the application8 the data set needs to be presented to WEKA in a format the program understands) There are rules for the type of data that WEKA 2ill accept and three options for loading data into the program) Cpen <ile' allo2s for the user to select files residing on the local machine or recorded medium Cpen ;@%' provides a mechanism to locate a file or data source from a different location specified by the user Cpen Database' allo2s the user to retrieve files or data from a database source provided by the user

=%A55#<#=AT#C>&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e)

SVCET,Chittoor

17

Data Mining Lab


There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options) Cutput& 5cheme& 2e7a)classifiers)rules)Pero@ @elation& student #nstances& / Attributes& 0 sname sid sbranch sage Test mode& -'fold cross'validation III =lassifier model full training set! III Pero@ predicts class value& *.)//////////////Time ta7en to build model& 4 seconds III =ross'validation IIIIII 5ummary III =orrelation coefficient '4)1 Mean absolute error 4)1 @oot mean squared error 4)3011 @elative absolute error *44 D @oot relative squared error *44 D Total >umber of #nstances /

=%;5TE@#>:&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e) There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options) heme& 2e7a)clusterers)EM '# *44 '> '* 'M *)4E'3 '5 *44 @elation& 2eather #nstances& *0 Attributes& 1 outloo7
temperature humidity 2indy play Test mode& evaluate on training dataIII Model and evaluation on training set III EM

SVCET,Chittoor

18

Data Mining Lab


II>umber of clusters selected by cross validation =luster Attribute 4 *! IIIIIIIIIIIIIIIIIIIIII outloo7 sunny 3 overcast 1 rainy 3 LtotalM *+ temperature mean +/)1+*0 std) dev) 3)//-3 humidity mean ,*)30-. std) dev) .).***2indy T@;E + <A%5E . LtotalM *3 play yes *4 no 3 LtotalM *3 =lustered #nstances 4 *0 *44D! %og li7elihood& '.)043/

A55C=#AT#C>&
The associate tab opens a 2indo2 to select the options for associations 2ithin the data set) The user selects one of the choices and presses start to yield the results) There are fe2 options for this 2indo2 and one of the most popular8 Apriori8 is sho2n in <igure belo2)

III @un information III 5cheme& 2e7a)associations)<ilteredAssociator '< N2e7a)filters)Multi<ilter '< ON2e7a)filters)unsupervised)attribute)@eplaceMissing?alues ONN 'c '* 'W 2e7a)associations)Apriori '' '> *4 'T 4 '= 4). 'D 4)41 '; *)4 'M 4)* '5 '*)4 'c '* @elation& student #nstances& / Attributes& 0 sname sid sbranch sage

5E%E=T#>: ATT@#";TE5&
The next tab is used to select the specific attributes used for the calculation process) "y default all of the available attributes are used in the evaluation of the data set) #f the user 2anted to exclude certain categories of the data they 2ould deselect those specific choices from the list in the cluster 2indo2) This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results) The soft2are searches through the selected attributes to decide 2hich of

SVCET,Chittoor

19

Data Mining Lab


them 2ill best fit the desired calculation) To perform this8 the user has to select t2o options8 an attribute evaluator and a search method) Cnce this is done the program evaluates the data based on the subset of the attributes8 then it performs the necessary search for commonality 2ith the date) <igure , sho2s the opinions of attribute evaluation)

5earch Method& "est first) 5tart set& no attributes 5earch direction& for2ard 5tale search after 1 node expansions Total number of subsets evaluated& + Merit of best subset found& * Attribute 5ubset Evaluator supervised8 =lass numeric!& 0 sage!& =<5 5ubset Evaluator #ncluding locally predictive attributes 5elected attributes& *8/ & sname sbranch

?#5;A%#PAT#C>&

The last tab in the 2indo2 is the visuali9ation tab) ;sing the other tabs in the program8 calculations and comparisons have occurred on the data set) 5elections of attributes and methods of manipulation have been chosen) The final piece of the pu99le is loo7ing at the information that has been derived throughout the process) The user can no2 actually see the data displayed in a t2o dimensional representation of the information) The first screen that the user sees 2hen they select the visuali9ation option is a matrix of plots representing the different attributes 2ithin the data set plotted against the other attributes) #f a lot of attributes are selected8 there is a scroll bar to vie2 all of the produced plots) The user can select a specific plot from the matrix to analy9e its contents in a larger8 popup 2indo2) A grid pattern of the plots allo2s the user to select the attribute positioning to their li7ing for better understanding) Cnce a specific plot has been selected8 the user can change the attributes from one vie2 to another)

SVCET,Chittoor

20

Data Mining Lab

SVCET,Chittoor

21

Data Mining Lab


Experi'ent5/
% %A"C@ @E%AT#C>& % A@<< file for labor data 2ith some numeric features % Erelation labor Eattribute name From8 tony8 santuG Eattribute 2age'increase'first'year numeric Eattribute 2age'increase'second'year numeric Eattribute 2or7ing'hours numeric Eattribute pension numeric Eattribute vacation numeric Edata rom8 1448 3448 ,8 -448 *1 tony8 0448 0148 ,8 -448 *1 santu8 3448 3148 ,8 -448 *1

$@E$@C=E55#>:&
#n order to experiment 2ith the application8 the data set needs to be presented to WEKA in a format the program understands) There are rules for the type of data that WEKA 2ill accept and three options for loading data into the program) Cpen <ile' allo2s for the user to select files residing on the local machine or recorded medium

Cpen ;@%' provides a mechanism to locate a file or data source from a different location specified by the user Cpen Database' allo2s the user to retrieve files or data from a database source provided by the user

=%A55#<#=AT#C>&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e)

SVCET,Chittoor

22

Data Mining Lab


There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options)

Cutput&
5cheme& 2e7a)classifiers)rules)Pero@ @elation& labor #nstances& / Attributes& 3 name 2age'increase'first'year 2age'increase'second'year 2or7ing'hours pension vacation Test mode& -'fold cross'validation = =lassifier model full training set! III Pero@ predicts class value& *1)4 Time ta7en to build model& 4 seconds = =ross'validation IIIIII 5ummary III =orrelation coefficient 4 Mean absolute error 4 @oot mean squared error 4 @elative absolute error >a> D @oot relative squared error >a> D Total >umber of #nstances /

=%;5TE@#>:&
The =luster tab opens the process that is used to identify commonalties or clusters of occurrences 2ithin the data set and produce information for the user to analy9e) There are a fe2 options 2ithin the cluster 2indo2 that are similar to those described in the =lassify tab) These options are& use training set8 supplied test set and percentage split) The fourth option is classes to cluster evaluation8 2hich compares ho2 2ell the data compares 2ith a pre'assigned class 2ithin the data) While in cluster mode8 users have the option of ignoring some of the attributes from the data set) This can be useful if there are specific attributes causing the results to be out of range8 or for large data sets) <igure 3 sho2s the =luster 2indo2 and some of its options)

5cheme& @elation& #nstances&

2e7a)clusterers)EM '# *44 '> '* 'M *)4E'3 '5 *44 labor /
23

SVCET,Chittoor

Data Mining Lab


Attributes& 3 name 2age'increase'first'year 2age'increase'second'year 2or7ing'hours pension vacation Test mode& evaluate on training data III Model and evaluation on training set III EMII >umber of clusters selected by cross validation& * =luster Attribute 4 *! IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII name rom tony santu LtotalM 3 2age'increase'first'year mean 144 std) dev) ,*)30.+ 2age'increase'second'year mean 133)333+ std) dev) ,0).,/+

2or7ing'hours mean , std) dev) 4 pension mean -44 std) dev) 4 vacation mean *1 std) dev) 4 =lustered #nstances 4 / *44D! %og li7elihood& -1).4,//

A55C=#AT#C>&
The associate tab opens a 2indo2 to select the options for associations 2ithin the data set) The user selects one of the choices and presses start to yield the results) There are fe2 options for this 2indo2 and one of the most popular8 Apriori8 is sho2n in <igure belo2)

5cheme& 2e7a)associations)<ilteredAssociator '< N2e7a)filters)Multi<ilter '< ON2e7a)filters)unsupervised)attribute)@eplaceMissing?alues ONN 'c '* 'W 2e7a)associations)Apriori '' '> *4 'T 4 '= 4). 'D 4)41 '; *)4 'M 4)* '5 '*)4 'c '*
SVCET,Chittoor 24

Data Mining Lab


@elation& labor #nstances& / Attributes& 3 name 2age'increase'first' year 2age'increase' second'year 2or7ing'hours pension vacation

5E%E=T#>: ATT@#";TE5&
The next tab is used to select the specific attributes used for the calculation process) "y default all of the available attributes are used in the evaluation of the data set) #f the user 2anted to exclude certain categories of the data they 2ould deselect those specific choices from the list in the cluster 2indo2) This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results) The soft2are searches through the selected attributes to decide 2hich of them 2ill best fit the desired calculation) To perform this8 the user has to select t2o options8 an attribute evaluator and a search method) Cnce this is done the program evaluates the data based on the subset of the attributes8 then it performs the necessary search for commonality 2ith the date) <igure , sho2s the opinions of attribute evaluation)

III Attribute 5election on all input data III 5earch Method& "est first) 5tart set& no attributes 5earch direction& for2ard 5tale search after 1 node expansions Total number of subsets evaluated& *. Merit of best subset found& 4 Attribute 5ubset Evaluator supervised8 =lass numeric!& 3 vacation!& =<5 5ubset Evaluat#ncluding locally predictive attributes 5elected attributes& * & * name

?#5;A%#PAT#C>&

The last tab in the 2indo2 is the visuali9ation tab) ;sing the other tabs in the program8 calculations and comparisons have occurred on the data set) 5elections of attributes and methods of manipulation have been chosen) The final piece of the pu99le is loo7ing at the information that has been derived throughout the process) The user can no2 actually see the data displayed in a t2o dimensional representation of the information) The first screen that the user sees 2hen they select the visuali9ation option is a matrix of plots representing the different attributes 2ithin the data set plotted against the other attributes) #f a lot of attributes are selected8 there is a scroll bar to vie2 all of the produced plots) The user can select a specific plot from the matrix to analy9e its contents in a larger8 popup 2indo2) A grid pattern of the plots allo2s the user to select the attribute positioning to their li7ing for better understanding) Cnce a specific plot has been selected8 the user can change the

SVCET,Chittoor

25

Data Mining Lab


attributes from one vie2 to

EQ$E@#ME>TE@&
The We7a Experiment Environment enables the user to create8 run8 modify8 and analyse experiments in a more convenient manner than is possible 2hen processing the schemes individually) <or example8 the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is statistically! better than the other schemes)The Experiment Environment can be run from the command line using the 5imple =%#

SVCET,Chittoor

26

Data Mining Lab


Experi'ent50 =CMMA>D %#>E&java 2e7a)experiment)Experiment 'r 'T dataR2eather)arff

Defining an Experi'ent When the Experimenter is started8 the 5etup 2indo2 actually a pane! is displayed) =lic7 >e2 toinitiali9e an experiment) This causes default parameters to be defined for the experiment)

To define the dataset to be processed by a scheme8 first select B;se relative paths in the Datasetspanel of the 5etup 2indo2 and then clic7 B Add >e2 to open a dialog box belo2
SVCET,Chittoor 27

Data Mining Lab

5elect iris)arff and clic7 Cpen to select the iris dataset)

The dataset name is no2 displayed in the Datasets panel of the 5etup 2indo2) Sa(ing the 6esults of the Experi'ent To identify a dataset to 2hich the results are to be sent8 clic7 on the B=5?@esult%istener entry in the Destination panel) >ote that this 2indo2 and other similar 2indo2s in We7a! is not initially expanded and some of the information in the 2indo2 is not visible) Drag the bottom right'hand corner of the 2indo2 to resi9e the 2indo2 until the scroll bars disappear)
SVCET,Chittoor 28

Data Mining Lab

The output file parameter is near the bottom of the 2indo28 beside the text Boutput<ile) =lic7 on this parameter to display a file selection 2indo2)

SVCET,Chittoor

29

Data Mining Lab


Type the name of the output file8 clic7 5elect8 and then clic7 close x!) The file name is displayed in the output<ile panel) =lic7 on CK to close the 2indo2)

The dataset name is displayed in the Destination panel of the 5etup 2indo2)

Sa(ing the Experi'ent Definition The experiment definition can be saved at any time) 5elect B5ave S at the top of the 5etup 2indo2) Type the dataset name 2ith the extension Bexp or select the dataset name if the experiment definition dataset already exists!)

SVCET,Chittoor

30

Data Mining Lab

The experiment can be restored by selecting Cpen in the 5etup 2indo2 and then selecting Experiment*)exp in the dialog 2indo2) 6unning an Experi'ent To run the current experiment8 clic7 the @un tab at the top of the Experiment Environment 2indo2) The current experiment performs *4 randomi9ed train and test runs on the #ris dataset8 using 33D of the patterns for training and /0D for testing8 and using the Pero@ scheme)

=lic7 5tart to run the experiment)

SVCET,Chittoor

31

Data Mining Lab

#f the experiment 2as defined correctly8 the / messages sho2n above 2ill be displayed in the %og panel) The results of the experiment are saved to the dataset Experiment*)txt)
Dataset8@un85cheme85chemeToptions85chemeTversionT#D8DateTtime8>umberTofTinstances8>umber Tcorrect8>umberTincorrect8>umberTunclassified8$ercentTcorrect8$ercentTincorrect8$ercentT unclassified8MeanTabsoluteTerror8@ootTmeanTsquaredTerror8@elativeTabsoluteTerror8@ootTre lativeTsquaredTerror85<TpriorTentropy85<TschemeTentropy85<TentropyTgain85<TmeanTpriorTen tropy85<TmeanTschemeTentropy85<TmeanTentropyTgain8K"Tinformation8K"TmeanTinformation8K"T relativeTinformation8TrueTpositiveTrate8>umTtrueTpositives8<alseTpositiveTrate8>umTfalse Tpositives8TrueTnegativeTrate8>umTtrueTnegatives8<alseTnegativeTrate8>umTfalseTnegatives 8#@Tprecision8#@Trecall8<Tmeasure85ummary iris8*82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48*1)48/3)484)48 -.)0**+30+41,,-/18+4)1,,-/1-.0**+3184)484)003-/,3-3*3.0-*384)0+/+++/-4011.+1+38*44)48*44

)48,*)1.-/3-.0441038,*)1.-/3-.04410384)48*)1..,14-1/+-3134.8*)1..,14-1/+-3134.84)484)484 )484)484)484)484)484)48*)48/*)48*)48-4)484)484)484)48H iris8-82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48**)4804)484)48 -*)13,3-+014.,4/.08+,)0/*/+-10.4*.3*84)484)01*/30,1.33./1+184)0,40.-*,30300-1108*44)48*4 4)48,/)1,03/4.,*/*4/18,/)1,03/4.,*/*4/184)48*)3/,.*0//-.33,3.38*)3/,.*0//-.33,3.384)484)484)484)484) 484)484)484)48*)48/*)48*)48-4)484)484)484)48H

SVCET,Chittoor

32

Data Mining Lab


Experi'ent51 Aim& to setup standard experiments8 that are run locally on a single machine8 or remote experiments8 2hich are distributed bet2een several hosts for employee relation

Type this command in simple =%# java 2e7a)experiment)Experiment 'r 'T dataRemp)arff

Add ne2 relation using add ne2 button on the right panel And give database connection using jdbc and clic7 o7

=hoose the relation and clic7 o7 button


SVCET,Chittoor 33

Data Mining Lab

=hoose PE@C @ from the menu Bchoose button by clic7ing add ne2 button on the right panel and clic7 o7

=lic7 on the run tab to get the output

SVCET,Chittoor

34

Data Mining Lab

The results of the experiment are saved to the dataset Experiment-)txt)


Dataset8@un85cheme85chemeToptions85chemeTversionT#D8DateTtime8>umberTofTinstances8>umber Tcorrect8>umberTincorrect8>umberTunclassified8$ercentTcorrect8$ercentTincorrect8$ercentT unclassified8MeanTabsoluteTerror8@ootTmeanTsquaredTerror8@elativeTabsoluteTerror8@ootTre lativeTsquaredTerror85<TpriorTentropy85<TschemeTentropy85<TentropyTgain85<TmeanTpriorTen tropy85<TmeanTschemeTentropy85<TmeanTentropyTgain8K"Tinformation8K"TmeanTinformation8K"T relativeTinformation8TrueTpositiveTrate8>umTtrueTpositives8<alseTpositiveTrate8>umTfalse Tpositives8TrueTnegativeTrate8>umTtrueTnegatives8<alseTnegativeTrate8>umTfalseTnegatives 8#@Tprecision8#@Trecall8<Tmeasure85ummary iris8*82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48*1)48/3)484)48 -.)0**+30+41,,-/18+4)1,,-/1-.0**+3184)484)003-/,3-3*3.0-*384)0+/+++/-4011.+1+38*44)48*44

)48,*)1.-/3-.0441038,*)1.-/3-.04410384)48*)1..,14-1/+-3134.8*)1..,14-1/+-3134.84)484)484 )484)484)484)484)484)48*)48/*)48*)48-4)484)484)484)48H iris8-82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48**)4804)484)48 -*)13,3-+014.,4/.08+,)0/*/+-10.4*.3*84)484)01*/30,1.33./1+184)0,40.-*,30300-1108*44)48*4 4)48,/)1,03/4.,*/*4/18,/)1,03/4.,*/*4/184)48*)3/,.*0//-.33,3.38*)3/,.*0//-.33,3.384)484)484)484)484) 484)484)484)48*)48/*)48*)48-4)484)484)484)48H

SVCET,Chittoor

35

Data Mining Lab


Experi'ent52 Aim& to setup standard experiments8 that are run locally on

a single machine8 or remote experiments8 2hich are distributed bet2een several hosts for labor relation

Type this command in simple =%# java 2e7a)experiment)Experiment 'r 'T dataRlabor)arff

Add ne2 relation using add ne2 button on the right panel And give database connection using jdbc and clic7 o7

SVCET,Chittoor

36

Data Mining Lab

=hoose the relation and clic7 o7 button

SVCET,Chittoor

37

Data Mining Lab


=hoose PE@C @ from the menu Bchoose button by clic7ing add ne2 button on the right panel and clic7 o7

=lic7 on the run tab to get the output

SVCET,Chittoor

38

Data Mining Lab


The results of the experiment are saved to the dataset Experiment/)txt)
Dataset8@un85cheme85chemeToptions85chemeTversionT#D8DateTtime8>umberTofTinstances8>umbe r Tcorrect8>umberTincorrect8>umberTunclassified8$ercentTcorrect8$ercentTincorrect8$ercentT unclassified8MeanTabsoluteTerror8@ootTmeanTsquaredTerror8@elativeTabsoluteTerror8@ootTre lativeTsquaredTerror85<TpriorTentropy85<TschemeTentropy85<TentropyTgain85<TmeanTpriorTen tropy85<TmeanTschemeTentropy85<TmeanTentropyTgain8K"Tinformation8K"TmeanTinformation8K" T
relativeTinformation8TrueTpositiveTrate8>umTtrueTpositives8<alseTpositiveTrate8>umTfalse Tpositives8TrueTnegativeTrate8>umTtrueTnegatives8<alseTnegativeTrate8>umTfalseTnegatives 8#@Tprecision8#@Trecall8<Tmeasure85ummary iris8*82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48*1)48/3)484)48 -.)0**+30+41,,-/18+4)1,,-/1-.0**+3184)484)003-/,3-3*3.0-*384)0+/+++/-4011.+1+38*44)48*4 4

)48,*)1.-/3-.0441038,*)1.-/3-.04410384)48*)1..,14-1/+-3134.8*)1..,14-1/+-3134.84)484)484 )484)484)484)484)484)48*)48/*)48*)48-4)484)484)484)48H iris8-82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48**)4804)484)48 -*)13,3-+014.,4/.08+,)0/*/+-10.4*.3*84)484)01*/30,1.33./1+184)0,40.-*,30300-1108*44)48* 4 4)48,/)1,03/4.,*/*4/18,/)1,03/4.,*/*4/184)48*)3/,.*0//-.33,3.38*)3/,.*0//-.33,3.384)484) 484)484)484)484)484)484)48*)48/*)48*)48-4)484)484)484)48H

SVCET,Chittoor

39

Data Mining Lab


Experi'ent5: Aim& to setup standard experiments8 that are run locally on a single machine8 or remote experiments8 2hich are distributed bet2een several hosts for student relation

Type this command in simple =%# java 2e7a)experiment)Experiment 'r 'T dataRstudent)arff

SVCET,Chittoor

40

Data Mining Lab


Add ne2 relation using add ne2 button on the right panel And give database connection using jdbc and clic7 o7

=hoose the relation and clic7 o7 button

SVCET,Chittoor

41

Data Mining Lab

=hoose PE@C @ from the menu Bchoose button by clic7ing add ne2 button on the right panel and clic7 o7

=lic7 on the run tab to get the output

SVCET,Chittoor

42

Data Mining Lab

The results of the experiment are saved to the dataset Experiment0)txt)


Dataset8@un85cheme85chemeToptions85chemeTversionT#D8DateTtime8>umberTofTinstances8>umber Tcorrect8>umberTincorrect8>umberTunclassified8$ercentTcorrect8$ercentTincorrect8$ercentT unclassified8MeanTabsoluteTerror8@ootTmeanTsquaredTerror8@elativeTabsoluteTerror8@ootTre lativeTsquaredTerror85<TpriorTentropy85<TschemeTentropy85<TentropyTgain85<TmeanTpriorTen tropy85<TmeanTschemeTentropy85<TmeanTentropyTgain8K"Tinformation8K"TmeanTinformation8K"T relativeTinformation8TrueTpositiveTrate8>umTtrueTpositives8<alseTpositiveTrate8>umTfalse Tpositives8TrueTnegativeTrate8>umTtrueTnegatives8<alseTnegativeTrate8>umTfalseTnegatives 8#@Tprecision8#@Trecall8<Tmeasure85ummary iris8*82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48*1)48/3)484)48 -.)0**+30+41,,-/18+4)1,,-/1-.0**+3184)484)003-/,3-3*3.0-*384)0+/+++/-4011.+1+38*44)48*44

)48,*)1.-/3-.0441038,*)1.-/3-.04410384)48*)1..,14-1/+-3134.8*)1..,14-1/+-3134.84)484)484 )484)484)484)484)484)48*)48/*)48*)48-4)484)484)484)48H iris8-82e7a)classifiers)Pero@8UU834++10+*+/.-41/4-1,8-)44*4-4-*11,E+81*)48**)4804)484)48 -*)13,3-+014.,4/.08+,)0/*/+-10.4*.3*84)484)01*/30,1.33./1+184)0,40.-*,30300-1108*44)48*4 4)48,/)1,03/4.,*/*4/18,/)1,03/4.,*/*4/184)48*)3/,.*0//-.33,3.38*)3/,.*0//-.33,3.384)484)484)484)484) 484)484)484)48*)48/*)48*)48-4)484)484)484)48H

K>CW%ED:E <%CW

The Kno2ledge <lo2 provides an alternative to the Explorer as a graphical front end to We7aUs core algorithms) The Kno2ledge <lo2 is a 2or7 in progress so some of the functionality from the Explorer is not yet available) Cn the other hand8 there are things that can be done in the Kno2ledge <lo2 but not in the Explorer)

SVCET,Chittoor

43

Data Mining Lab


The Kno2ledge <lo2 presents a Ndata'flo2N inspired interface to We7a) The user can select We7a components from a tool bar8 place them on a layout canvas and connect them together in order to form a N7no2ledge flo2N for processing and analy9ing data) At present8 all of We7aUs classifiers8 filters8 clusterers8 loaders and savers are available in the Kno2ledge<lo2 along 2ith some extra tools) <eatures of the Kno2ledge<lo2& * intuitive data flo2 style layout * process data in batches or incrementally * process multiple batches or streams in parallelV each separate flo2 executes in its o2n thread! * chain filters together * vie2 models produced by classifiers for each fold in a cross validation * visuali9e performance of incremental classifiers during processing scrolling plots of classification accuracy8 @M5 error8 predictions etc! omponents available in the Kno2ledge<lo2& Data5ources& All of We7aUs loaders are available Data5in7s& All of We7aUs savers are available <ilters& All of We7aUs filters are available =lassifiers& All of We7aUs classifiers are available =lusterers& All of We7aUs clusterers are available valuation& Training5etMa7er ' ma7e a data set into a training set Test5etMa7er ' ma7e a data set into a test set =ross?alidation<oldMa7er ' split any data set8 training set or test set into folds TrainTest5plitMa7er ' split any data set8 training set or test set into a training set and a test set =lassAssigner ' assign a column to be the class for any data set8 training set or test set =lass?alue$ic7er ' choose a class value to be considered as the NpositiveN class) This is useful 2hen generating data for @C= style curves see belo2! =lassifier$erformanceEvaluator ' evaluate the performance of batch trainedRtested classifiers
SVCET,Chittoor 44

Data Mining Lab


#ncremental=lassifierEvaluator ' evaluate the performance of incrementally trained classifiers =lusterer$erformanceEvaluator ' evaluate the performance of batch trainedRtested clusterers $redictionAppender ' append classifier predictions to a test set) <or discrete class problems8 can either append predicted class labels or probability distributions ?isuali9ation& Data?isuali9er ' component that can pop up a panel for visuali9ing data in a single large -D scatter plot 5catter$lotMatrix ' component that can pop up a panel containing a matrix of small scatter plots clic7ing on a small plot pops up a large scatter plot! Attribute5ummari9er ' component that can pop up a panel containing a matrix of histogram plots ' one for each of the attributes in the input data Model$erformance=hart ' component that can pop up a panel for visuali9ing threshold i)e) @C= style! curves) Text?ie2er ' component for sho2ing textual data) =an sho2 data sets8 classification performance statistics etc) :raph?ie2er ' component that can pop up a panel for visuali9ing tree based models 5trip=hart ' component that can pop up a panel that displays a scrolling plot of data used for vie2ing the online performance of incremental classifiers!

%aunching the Kno2ledge<lo2

The We7a :;# =hooser 2indo2 is used to launch We7aUs graphical environments) 5elect the button labeled NKno2ledge<lo2N to start the Kno2ledge<lo2) Alternatively8 you can launch the Kno2ledge<lo2 from a terminal 2indo2 by typing Njava2e7a)gui)beans)Kno2ledge<lo2N) At the top of the Kno2ledge<lo2 2indo2 is are seven tabs& Data5ources8 Data5in7s8 <ilters8 =lassifiers8 =lusterers8 Evaluation and ?isuali9ation) The names are pretty much self explanatory)

=omponents
=omponents available in the Kno2ledge<lo2&

Data5ources
All of WEKAAs loaders are available)

SVCET,Chittoor

45

Data Mining Lab

Data5in7s
All of WEKAAs savers are available)

<ilters
All of WEKAAs filters are available)

=lassifiers
All of WEKAAs classifiers are available)

=lusterers
All of WEKAAs clusterers are available)

SVCET,Chittoor

46

Data Mining Lab


Evaluation

Training5etMa7er ' ma7e a data set into a training set) Test5etMa7er ' ma7e a data set into a test set) =ross?alidation<oldMa7er ' split any data set8 training set or test set into folds) TrainTest5plitMa7er ' split any data set8 training set or test set into a training set and a test set) =lassAssigner ' assign a column to be the class for any data set8 training set or test set) =lass?alue$ic7er ' choose a class value to be considered as the Bpositive class) This is useful 2hen generating data for @C= style curves see Model$erformance=hart belo2 and example 3)0)-!) =lassifier$erformanceEvaluator ' evaluate the performance of batch trainedRtested classifiers) #ncremental=lassifierEvaluator ' evaluate the performance of incrementally trained classifiers) =lusterer$erformanceEvaluator ' evaluate the performance of batch trainedRtested clusterers) $redictionAppender ' append classifier predictions to a test set) <or dis' crete class problems8 can either append predicted class labels or probabil' ity distributions)

?isuali9ation

Data?isuali9er ' component that can pop up a panel for visuali9ing data in a single large -D scatter plot) 5catter$lotMatrix ' component that can pop up a panel containing a ma' trix of small scatter plots clic7ing on a small plot pops up a large scatter plot!) Attribute5ummari9er ' component that can pop up a panel containing a matrix of histogram plots ' one for each of the attributes in the input data) Model$erformance=hart ' component that can pop up a panel for visual' i9ing threshold i)e) @C= style! curves)

SVCET,Chittoor

47

Data Mining Lab


Text?ie2er ' component for sho2ing textual data) =an sho2 data sets8 classification performance statistics etc) :raph?ie2er ' component that can pop up a panel for visuali9ing tree based models) 5trip=hart ' component that can pop up a panel that displays a scrolling plot of data used for vie2ing the online performance of incremental clas'iers!

Experi'ent5; Aim& 5etting up a flo2 to load an arff file batch mode! and perform a cross validation using 60, We7aUs =0)1 implementation!)
<irst start the Kno2lege<lo2) >ext clic7 on the Data5ources tab and choose NArff%oaderN from the toolbar the mouse pointer 2ill change to a Ncross hairsN!)

SVCET,Chittoor

48

Data Mining Lab

>ext place the Arff%oader component on the layout area by clic7ing some2here on the layout A copy of the Arff%oader icon 2ill appear on the layout area!) >ext specify an arff file to load by first right clic7ing the mouse over the Arff%oader icon on the layout) A pop'up menu 2ill appear) 5elect N=onfigureN under NEditN in the list from this menu and bro2se to the location of your arff file)

Alternatively8 you can


SVCET,Chittoor 49

Data Mining Lab


double'clic7 on the icon to bring up the configuration dialog

>ext clic7 the NEvaluationN tab at the top of the 2indo2 and choose the N=lassAssignerN allo2s you to choose 2hich column to be the class! component from the toolbar) $lace this on the layout)

>o2 connect the Arff%oader to the =lassAssigner& first right clic7


SVCET,Chittoor 50

Data Mining Lab


over the Arff%oader and select the Ndata5etN under N=onnectionsN in the menu) A Nrubber bandN line 2ill appear)

Move the mouse over the =lassAssigner component and left clic7 ' a red line labeled Ndata5etN 2ill connect the t2o components)

SVCET,Chittoor

51

Data Mining Lab

>ext right clic7 over the =lassAssigner and choose N=onfigureN from the menu) This 2ill pop up a 2indo2 from 2hich you can specify 2hich column is the class in your data last is the default!)

SVCET,Chittoor

52

Data Mining Lab

>ext right clic7 over the =lassAssigner and choose N=onfigureN from the menu) This 2ill pop up a 2indo2 from 2hich you can specify 2hich column is the class in your data last is the default!)

SVCET,Chittoor

53

Data Mining Lab


>ext grab a N=ross?alidation<oldMa7erN component from the Evaluation toolbar and place it on the layout)

=onnect the =lassAssigner to the =ross?alidation<oldMa7er by right clic7ing over N=lassAssignerN and selecting Ndata5etN from under N=onnectionsN in the menu)

SVCET,Chittoor

54

Data Mining Lab


>ext clic7 on the N=lassifiersN tab at the top of the 2indo2 and scroll along the toolbar until you reach the N60,N component in the NtreesN section)

$lace a 60, component on the layout)

=onnect the =ross?alidation<oldMa7er to 60, TW#=E by first choosing Ntraining5etN and then Ntest5etN from the pop'up menu for the =ross?alidation<oldMa7er)

SVCET,Chittoor

55

Data Mining Lab

SVCET,Chittoor

56

Data Mining Lab

>ext go bac7 to the NEvaluationN tab and place a N=lassifier$erformanceEvaluatorN component on the layout)

SVCET,Chittoor

57

Data Mining Lab

=onnect 60, to this component by selecting the Nbatch=lassifierN entry from the pop'up menu for 60,)

SVCET,Chittoor

58

Data Mining Lab

>ext go to the N?isuali9ationN toolbar and place a NText?ie2erN component on the layout)

=onnect the =lassifier$erformanceEvaluator to the Text?ie2er by selecting the NtextN entry from the pop'up menu for =lassifier$erformanceEvaluator)

SVCET,Chittoor

59

Data Mining Lab

>o2 start the flo2 executing by selecting N5tart loadingN from the pop'up menu for Arff%oader)

SVCET,Chittoor

60

Data Mining Lab

When finished you can vie2 the results by choosing sho2 results from the pop'up menu for the Text?ie2er component)

SVCET,Chittoor

61

Data Mining Lab

SVCET,Chittoor

62

Data Mining Lab

Si'ple -SI

The 5imple =%# provides full access to all We7a classes8 i)e)8 classifiers8 filters8 clusterers8 etc)8 but 2ithout the hassle of the =%A55$ATW it facilitates the one8 2ith 2hich We7a 2as started!) #t offers a simple We7a shell 2ith separated commandline and output)

=ommands
The follo2ing commands are available in the 5imple =%#& X java JclassnameY LJargsYM invo7es a java class 2ith the given arguments if any! brea7stops the current thread8 e)g)8 a running classifier8 in a friendly manner /* /- =WA$TE@ /) 5#M$%E =%# 7ill stops the current thread in an unfriendly fashion X cls clears the output area X exit exits the 5imple =%# X help LJcommandYM provides an overvie2 of the available commands if 2ithout a command name as argument8 other2ise more help on the specified command

SVCET,Chittoor

63

Data Mining Lab

=ommands
The follo2ing commands are available in the 5imple =%#& X java JclassnameY LJargsYM invo7es a java class 2ith the given arguments if any! X brea7 stops the current thread8 e)g)8 a running classifier8 in a friendly manner 5#M$%E =%# X 7ill stops the current thread in an unfriendly fashion X cls clears the output area X exit exits the 5imple =%# X help LJcommandYM provides an overvie2 of the available commands if 2ithout a comman

=ommand redirection
5tarting 2ith this version of We7a one can perform a basic redirection& java 2e7a)classifiers)trees)60, 't test)arff Y j0,)txt >ote& the Y must be preceded and follo2ed by a space8 other2ise it is not recogni9ed as redirection8 but part of another parameter)

=ommand completion
=ommands starting 2ith java support completion for classnames and filenames

SVCET,Chittoor

64

Data Mining Lab


via Tab AltZ"ac75pace deletes parts of the command again!) #n case that there are several matches8 We7a lists all possible matches) pac7age name completion java 2e7a)clJTabY results in the follo2ing output of possible matches of pac7age names& $ossible matches& 2e7a)classifiers 2e7a)clusterers classname completion java 2e7a)classifiers)meta)AJTabY lists the follo2ing classes $ossible matches& 2e7a)classifiers)meta)Ada"oostM* 2e7a)classifiers)meta)Additive@egression 2e7a)classifiers)meta)Attribute5elected=lassifier X filename completion #n order for We7a to determine 2hether a the string under the cursor is a classname or a filename8 filenames need to be absolute ;nixR%inx& RsomeRpathRfile[Windo2s& =&O5omeO$athOfile! or relative and starting 2ith a dot ;nixR%inux& )RsomeRotherRpathRfile[Windo2s& )O5omeOCtherO$athOfile!

SVCET,Chittoor

65

Data Mining Lab


EX!E6MIEN+-1< IM5 To design a 7no2ledge flo2 layout8 to load apply attribute selection normali9e the attributes and to store the result in a csv saver) !rocedure5 =lic7 on B7no2ledge :lo2 from 2ea7 :;# chooser) #t opens a 2indo2 called BWe7a 7no2ledge flo2 environment) =lic7 on Bdata sources and select BArff to read data is the arff source) >o2 clic7 on the B7no2ledge flo2 layout area8 2hich laces the Arffloader in the layout) 1! =dlic7 on Bfilters and select on attribute selector from the Bsupervised filters) $lace it on the design layout) 3! >o2 select another filter to normali9e the numeric attribute values 8 from the Bunsupervised filters) $lacae it on the design layout) +! =lic7 on BData sin7s and choose Bcsv8 2hich 2rites to a destination that is in csv format) $lace it on the design layout of 7no2ledge flo2) ,! >o2 right clic7 on BArffloader and clic7 on data set to direct the flo2 to Battribute selection) .! >o2 right clic7 on BAttribute selection and select data set to direct the flo2 to B>ormali9e from 2hich [lthe flo2 is directed to the csv saver in the same 2ay) *4! @ight clic7 on csv saver and clic7 on Bconfigure8 to specify the destination 2here to sotre the results let at be selected as 9&O2e7a E ravi) **! >o2 right clic7 on BAffloader and select Bconfigure to specify the Bsource data) %et BinAs relation has been selected as so) *-! >o2 again right clic7 on the BAffloader and clic7 on Bstart loading 2hich results in the belo2 B7no2ledge flo2 layout) */! We can observe the results of lthe abouve process by opening the file 9&OWe7aEraviOinAs'2e7a)filters)supervised)attributeSMicrosoft office ExcellommaS in notepad8 2hich displays the results # a comma separated value form *! -! /! 0! $etal length8 4)43++.+ 4)43++.+ 4)414,0+ 4)3-+**. 4),/414, 4)3++.33 $etal 2idth 4)40*33+ 4)40*33+ 4)40*33+ 4)10*33+ 4),///// 4)+.*33+ =lass #nAs'setosa #nAs'setosa #nAs'setosa #nAs'versicolor #n8s'virginica #nAs'virginica

SVCET,Chittoor

66

Data Mining Lab

Description of the Ger'an credit dataset in = ttri)ute 6elation 7ile 7or'at> 7or'at5
Structure of 677 7or'at5
Dcomment lines Erelation relation name Eattribute attribute name EData 5et of data items separated by commas) % *) Title& :erman =redit data % % -) 5ource #nformation D % $rofessor Dr) Wans Wofmann % #nstitut fNur 5tatisti7 und NC7onometrie % ;niversitNat Wamburg % <" Wirtschafts2issenschaften % ?on'Melle'$ar7 1 % -444 Wamburg */ % % /) >umber of #nstances& *444 % % T2o datasets are provided) the original dataset8 in the form provided % by $rof) Wofmann8 contains categoricalRsymbolic attributes and % is in the file Ngerman)dataN) D % <or algorithms that need numerical attributes8 5trathclyde ;niversity % produced the file Ngerman)data'numericN) This file has been edited % and several indicator variables added to ma7e it suitable for % algorithms 2hich cannot cope 2ith categorical variables) 5everal % attributes that are ordered categorical such as attribute *+! have % been coded as integer) This 2as the form used by 5tat%og) D D % 3) >umber of Attributes german& -4 + numerical8 */ categorical! % >umber of Attributes german)numer& -0 -0 numerical! D D % +) Attribute description for german % % Attribute *& qualitative! % 5tatus of existing chec7ing account D A** & ))) J 4 DM % A*- & 4 JI ))) J -44 DM % A*/ & ))) YI -44 DM R D salary assignments for at least * year % A*0 & no chec7ing account

677

SVCET,Chittoor

67

Data Mining Lab

% Attribute -& numerical! % Duration in month % % Attribute /& qualitative! % =redit history % A/4 & no credits ta7enR D all credits paid bac7 duly % A/* & all credits at this ban7 paid bac7 duly % A/- & existing credits paid bac7 duly till no2 % A// & delay in paying off in the past % A/0 & critical accountR D other credits existing not at this ban7! D % Attribute 0& qualitative! % $urpose % A04 & car ne2! % A0* & car used! % A0- & furnitureRequipment % A0/ & radioRtelevision % A00 & domestic appliances % A01 & repairs % A03 & education % A0+ & vacation ' does not existH! % A0, & retraining % A0. & business % A0*4 & others D % Attribute 1& numerical! % =redit amount D % Attibute 3& qualitative! % 5avings accountRbonds % A3* & ))) J *44 DM % A3- & *44 JI ))) J 144 DM % A3/ & 144 JI ))) J *444 DM % A30 & )) YI *444 DM % A31 & un7no2nR no savings account % % Attribute +& qualitative! % $resent employment since % A+* & unemployed % A+- & ))) J * year % A+/ & * JI ))) J 0 years % A+0 & 0 JI ))) J + years % A+1 & )) YI + years D % Attribute ,& numerical!

#nstallment rate in percentage of disposable income 68

SVCET,Chittoor

Data Mining Lab


D % Attribute .& qualitative! % $ersonal status and sex % A.* & male & divorcedRseparated % A.- & female & divorcedRseparatedRmarried % A./ & male & single % A.0 & male & marriedR2ido2ed % A.1 & female & single D % Attribute *4& qualitative! % Cther debtors R guarantors % A*4* & none % A*4- & co'applicant % A*4/ & guarantor D % Attribute **& numerical! % $resent residence since % % Attribute *-& qualitative! % $roperty % A*-* & real estate % A*-- & if not A*-* & building society savings agreementR D life insurance % A*-/ & if not A*-*RA*-- & car or other8 not in attribute 3 % A*-0 & un7no2n R no property D % Attribute */& numerical! % Age in years D % Attribute *0& qualitative! % Cther installment plans % A*0* & ban7 % A*0- & stores % A*0/ & none D % Attribute *1& qualitative! % Wousing % A*1* & rent % A*1- & o2n % A*1/ & for free D % Attribute *3& numerical! % >umber of existing credits at this ban7 % % Attribute *+& qualitative! % 6ob % A*+* & unemployedR uns7illed ' non'resident

A*+- & uns7illed ' resident

% %

A*+/ & s7illed employee R official A*+0 & managementR self'employedR 69

SVCET,Chittoor

Data Mining Lab


% highly qualified employeeR officer % % Attribute *,& numerical! % >umber of people being liable to provide maintenance for % % Attribute *.& qualitative! % Telephone % A*.* & none % A*.- & yes8 registered under the customers name D % Attribute -4& qualitative! % foreign 2or7er % A-4* & yes % A-4- & no D D D % ,) =ost Matrix % % This dataset requires use of a cost matrix see belo2! % % D * ''' ''' ''' ''' ''' ''' ''' ''' ''' D ' D * 4 * D ''''''''''''''''''''''' D - 1 4

D % * I :ood8 - I "ad! % % the ro2s represent the actual classification and the columns % the predicted classification) D % #t is 2orse to class a customer as good 2hen they are bad 1!8 % than it is to class a customer as bad 2hen they are good *!) % % % % % % @elabeled values in attribute chec7ingTstatus

D <rom& A** To& UJ4U D <rom& A*To& U4JIQJ-44U D <rom& A*/ To& UYI-44U D <rom& A*0 To& Uno chec7ingU D D D @elabeled values in attribute creditThistory D <rom& A/4 To& Uno creditsRall paidU D <rom& A/* To& Uall paidU D <rom& A/To& Uexisting paidU D <rom& A// To& Udelayed previouslyU D <rom& A/0 To& UcriticalRother existing creditU SVCET,Chittoor 70

Data Mining Lab


D D D @elabeled values in attribute purpose D <rom& A04 To& Une2 carU D <rom& A0* To& Uused carU D <rom& A0To& furnitureRequipment D <rom& A0/ To& radioRtv D <rom& A00 To& Udomestic applianceU D <rom& A01 To& repairs D <rom& A03 To& education D <rom& A0+ To& vacation D <rom& A0, To& retraining D <rom& A0. To& business D <rom& A0*4 To& other D D D @elabeled values in attribute savingsTstatus D <rom& A3* To& UJ*44U D <rom& A3To& U*44JIQJ144U D <rom& A3/ To& U144JIQJ*444U D <rom& A30 To& UYI*444U D <rom& A31 To& Uno 7no2n savingsU D D D @elabeled values in attribute employment D <rom& A+* To& unemployed D <rom& A+To& UJ*U D <rom& A+/ To& U*JIQJ0U D <rom& A+0 To& U0JIQJ+U D <rom& A+1 To& UYI+U D D D @elabeled values in attribute personalTstatus D <rom& A.* To& Umale divRsepU D <rom& A.To& Ufemale divRdepRmarU D <rom& A./ To& Umale singleU D <rom& A.0 To& Umale marR2idU D <rom& A.1 To& Ufemale singleU D D D @elabeled values in attribute otherTparties D <rom& A*4* To& none D <rom& A*4To& Uco applicantU

D <rom& A*4/ To& guarantor D D D @elabeled values in attribute propertyTmagnitude D <rom& A*-* To& Ureal estateU D <rom& A*-To& Ulife insuranceU

SVCET,Chittoor

71

Data Mining Lab


D <rom& A*-/ To& car D <rom& A*-0 To& Uno 7no2n propertyU D D D @elabeled values in attribute otherTpaymentTplans D <rom& A*0* To& ban7 D <rom& A*0To& stores D <rom& A*0/ To& none D D D @elabeled values in attribute housing D <rom& A*1* To& rent D <rom& A*1To& o2n D <rom& A*1/ To& Ufor freeU D D D @elabeled values in attribute job D <rom& A*+* To& UunempRuns7illed non resU D <rom& A*+To& Uuns7illed residentU D <rom& A*+/ To& s7illed D <rom& A*+0 To& Uhigh qualifRself empRmgmtU D D D @elabeled values in attribute o2nTtelephone D <rom& A*.* To& none D <rom& A*.To& yes D D D @elabeled values in attribute foreignT2or7er D <rom& A-4* To& yes D <rom& A-4To& no D D D @elabeled values in attribute class D <rom& * To& good D <rom& To& bad D Erelation germanTcredit Eattribute chec7ingTstatus F UJ4U8 U4JIQJ-44U8 UYI-44U8 Uno chec7ingUG Eattribute duration real Eattribute creditThistory F Uno creditsRall paidU8 Uall paidU8 Uexisting paidU8 Udelayed previouslyU8 UcriticalRother existing creditUG Eattribute purpose F Une2 carU8 Uused carU8 furnitureRequipment8 radioRtv8 Udomestic applianceU8 repairs8 education8 vacation8 retraining8 business8 otherG Eattribute creditTamount real Eattribute savingsTstatus F UJ*44U8 U*44JIQJ144U8 U144JIQJ*444U8 UYI*444U8 Uno 7no2n savingsUG Eattribute employment F unemployed8 UJ*U8 U*JIQJ0U8 U0JIQJ+U8 UYI+UG Eattribute installmentTcommitment real SVCET,Chittoor 72

Data Mining Lab


Eattribute personalTstatus F Umale divRsepU8 Ufemale divRdepRmarU8 Umale singleU8 Umale marR2idU8 Ufemale singleUG Eattribute otherTparties F none8 Uco applicantU8 guarantorG Eattribute residenceTsince real Eattribute propertyTmagnitude F Ureal estateU8 Ulife insuranceU8 car8 Uno 7no2n propertyUG Eattribute age real Eattribute otherTpaymentTplans F ban78 stores8 noneG Eattribute housing F rent8 o2n8 Ufor freeUG Eattribute existingTcredits real Eattribute job F UunempRuns7illed non resU8 Uuns7illed residentU8 s7illed8 Uhigh qualifRself empRmgmtUG Eattribute numTdependents real Eattribute o2nTtelephone F none8 yesG Eattribute foreignT2or7er F yes8 noG Eattribute class F good8 badG Edata UJ4U838UcriticalRother existing creditU8radioRtv8**3.8Uno 7no2n savingsU8UYI+U808Umale singleU8none808Ureal estateU83+8none8o2n8-8s7illed8*8yes8yes8good U4JIQJ-44U80,8Uexisting paidU8radioRtv81.1*8UJ*44U8U*JIQJ0U8-8Ufemale divRdepRmarU8none8-8Ureal estateU8--8none8o2n8*8s7illed8*8none8yes8bad Uno chec7ingU8*-8UcriticalRother existing creditU8education8-4.38UJ*44U8U0JIQJ+U8-8Umale singleU8none8/8Ureal estateU80.8none8o2n8*8Uuns7illed residentU8-8none8yes8good UJ4U80-8Uexisting paidU8furnitureRequipment8+,,-8UJ*44U8U0JIQJ+U8-8Umale singleU8guarantor808Ulife insuranceU8018none8Ufor freeU8*8s7illed8-8none8yes8good UJ4U8-08Udelayed previouslyU8Une2 carU80,+48UJ*44U8U*JIQJ0U8/8Umale singleU8none808Uno 7no2n propertyU81/8none8Ufor freeU8-8s7illed8-8none8yes8bad Uno chec7ingU8/38Uexisting paidU8education8.4118Uno 7no2n savingsU8U*JIQJ0U8-8Umale singleU8none808Uno 7no2n propertyU8/18none8Ufor freeU8*8Uuns7illed residentU8-8yes8yes8good

SVCET,Chittoor

73

Data Mining Lab

La) Experi'ents
1. List all the categorical (or nominal) attributes and the realvalued attributes separately.
<rom the :erman =redit Assessment =ase 5tudy given to us8 the follo2ing attributes are found to be applicable for =redit'@is7 Assessment& +otal ?alid ttri)utes *) chec7ingTstatus -) duration /) credit history 0) purpose 1) credit amount 3) savingsTstatus +) employment duration ,) installment rate .) personal status *4) debitors **) residenceTsince *-) property *0) installment plans *1) housing *3) existing credits *+) job *,) numTdependents *.) telephone -4) foreign 2or7er
-ategorical or No'inal attri)utes=2hich ta7es TrueRfalse8 etc values>

6eal (alued attri)utes *) duration -) credit amount /) credit amount 0) residence 1) age 3) existing credits +) numTdependents

1. h! "ing#$tat%$ 2. r!&it hi$tor' 3. (%r(o$! 4. $a)ing$#$tat%$ 5. !*(+o'*!nt 6. (!r$ona+ $tat%$ 7. &!btor$ 8. (ro(!rt' 9. in$ta++*!nt (+an$ 10. ho%$ing 11. ,ob 12. t!+!(hon! 13. -or!ign .or"!r

SVCET,Chittoor

7 4

Data Mining Lab


2. What attributes do you think might be crucial in making the credit assessment? Come up ith some simple rules in plain !nglish using your selected attributes.
ccording to 'e the follo3ing attri)utes 'a" )e crucial in 'a4ing the credit ris4 assess'ent. *) -) /) 0) 1) 3) +) ,) =reditThistory Employment $ropertyTmagnitude job duration crditTamount installment existing credit

%ased on the a)o(e attri)utes@ 3e can 'a4e a decision 3hether to gi(e credit or not. checking_status = no checking AND other_payment_plans = none AND credit_history = critical/other existing credit: good
checking_status = no checking AND existing_credits <= 1 AND other_payment_plans = none AND purpose = radio/tv: good

checking_status = no checking AND foreign_ orker = yes AND employment = !<="<#: good foreign_ orker = no AND personal_status = male single: good checking_status = no checking AND purpose = used car AND other_payment_plans = none: good duration <= 1$ AND other_parties = guarantor: good duration <= 11 AND credit_history = critical/other existing credit: good checking_status = %=&'' AND num_dependents <= 1 AND property_magnitude = car: good checking_status = no checking AND property_magnitude = real estate AND other_payment_plans = none AND age % &(: good savings_status = %=1''' AND property_magnitude = real estate: good savings_status = $''<="<1''' AND employment = %=#: good credit_history = no credits/all paid AND housing = rent: )ad savings_status = no kno n savings AND checking_status = '<="<&'' AND existing_credits % 1: good SVCET,Chittoor 75

Data Mining Lab


checking_status = %=&'' AND num_dependents <= 1 AND property_magnitude = life insurance: good installment_commitment <= & AND other_parties = co applicant AND existing_credits % 1: )ad installment_commitment <= & AND credit_history = delayed previously AND existing_credits % 1 AND residence_since % 1: good installment_commitment <= & AND credit_history = delayed previously AND existing_credits <= 1: good duration % (' AND savings_status = 1''<="<$'': )ad credit_history = all paid AND other_parties = none AND other_payment_plans = )ank: )ad duration % (' AND savings_status = no kno n savings AND num_dependents % 1: good duration % (' AND credit_history = delayed previously: )ad duration % !& AND savings_status = <1'' AND residence_since % 1: )ad

SVCET,Chittoor

76

Data Mining Lab


". #ne type o$ model that you can create is a %ecision &ree - train a %ecision &ree using the complete dataset as the training data. 'eport the model obtained a$ter training.
A decision tree is a flo2 chart li7e tree structure 2here each internal node non'leaf! denotes a test on the attribute8 each branch represents an outcome of the test 8and each leaf node terminal node!holds a class label) Decision trees can be easily converted into classification rules) e)g) #D/8=0)1 and =A@T)

A/: pruned tree *) ;sing WEKA Tool8 2e can generate a decision tree by selecting the B classif" ta)) -) #n classify tab select choose option 2here a list of different decision trees are available) <rom that list select A/:) /) >o2 under test option 8select training data test option) 0) The resulting 2indo2 in WEKA is as follo2s&

SVCET,Chittoor

77

Data Mining Lab


1) To generate the decision tree8 right clic7 on the result list and select (isualiBe tree option by 2hich the decision tree 2ill be generated)

3) The obtained decision tree for credit ris7 assessment is very large to fit on the screen)

2. +he decision tree a)o(e is unclear due to a large nu')er of attri)utes. SVCET,Chittoor 78

Data Mining Lab


(. )uppose you use your above model trained on the complete dataset* and classi$y credit good+bad $or each o$ the e,amples in the dataset. What - o$ e,amples can you classi$y correctly? (&his is also called testing on the training set) Why do you think you cannot get 1.. - training accuracy?
In the a)o(e 'odel 3e trained co'plete dataset and 3e classified credit goodC)ad for each of the exa'ples in the dataset. <or example& #< purposeIvacation TWE> creditIbad[ E%5E purposeIbusiness TWE> creditIgood[ #n this 2ay 2e classified each of the examples in the dataset) We classified ,1)1D of examples correctly and the remaining *0)1D of examples are incorrectly classified) We canAt get *44D training accuracy because out of the -4 attributes8 2e have some unnecessary attributes 2hich are also been analy9ed and trained) Due to this the accuracy is affected and hence 2e canAt get *44D training accuracy)

SVCET,Chittoor

79

Data Mining Lab


/. 0s testing on the training set as you did above a good idea? Why Why not?
*ad idea+ if take all the data into training set, Then how to test the above classification is correctly or not ? According to the rules8 for the maximum accuracy8 2e have to ta7e -R/ of the dataset as training set and the remaining *R/ as test set) "ut here in the above model 2e have ta7en complete dataset as training set 2hich results only ,1)1D accuracy) This is done for the analy9ing and training of the unnecessary attributes 2hich does not ma7e a crucial role in credit ris7 assessment) And by this complexity is increasing and finally it leads to the minimum accuracy) #f some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results and the time for computation 2ill be less) +his is 3h"@ 3e prefer not to ta4e co'plete dataset as training set. $se+raining Set 6esult for the ta)le Ger'an-reditData5 =orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error @elative absolute error @oot relative squared error Total >umber of #nstances ,11 *01 4)3-1* 4)-/*4)/0 11)4/++ D +0)-4*1 D *444 ,1)1 D *0)1 D

SVCET,Chittoor

80

Data Mining Lab


1. #ne approach $or solving the problem encountered in the previous 2uestion is using cross-validation? %escribe hat cross-validation is brie$ly. &rain a %ecision &ree again using cross-validation and report your results. %oes your accuracy increase+decrease? Why?
-ross (alidation5#n 7'fold cross'validation8 the initial data are randomly portioned into \7A mutually exclusive subsets or folds D*8 D-8 D/8 ) ) ) ) ) )8 D7) Each of approximately equal si9e) Training and testing is performed \7A times) #n iteration #8 partition Di is reserved as the test set and the remaining partitions are collectively used to train the model)

That is in the first iteration subsets D-8 D/8 ) ) ) ) ) )8 D7 collectively serve as the training set in order to obtain as first model) Which is tested on Di) The second trained on the subsets D*8 D/8 ) ) ) ) ) )8 D7 and test on the D- and so onS)

*) 5elect classif" tab and A/: decision tree and in the test option select cross (alidation radio button and the number of folds as 1<. -) >umber of folds indicates number of partition 2ith the set of attributes) /) Kappa statistics nearing * indicates that there is *44D accuracy and hence all the errors 2ill be 9eroed out8 but in reality there is no such training set that gives *44D accuracy)

SVCET,Chittoor

81

Data Mining Lab


-ross ?alidation 6esult at folds5 1< for the ta)le Ger'an-reditData5
=orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error @elative absolute error @oot relative squared error Total >umber of #nstances +41 -.1 4)-03+ 4)/03+ 4)0+.3 ,-)1-// D *40)3131 D *444 +4)1 D -.)1 D

Were there are *444 instances 2ith *44 instances per partition)

-ross ?alidation 6esult at folds5 #< for the ta)le Ger'an-reditData5


=orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error @elative absolute error @oot relative squared error Total >umber of #nstances 3., /44)--30 4)/1+* 4)0,,/ ,1)4443 D *43)11/, D *444 3.), D /4)- D

-ross ?alidation 6esult at folds5 0< for the ta)le Ger'an-reditData5


=orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error SVCET,Chittoor +4. -.* 4)-1/, 4)/0,0 4)0,-1 82 +4). -.)* D D

Data Mining Lab


@elative absolute error @oot relative squared error Total >umber of #nstances ,-)./40 D *41)-,-3 D *444

-ross ?alidation 6esult at folds5 1<< for the ta)le Ger'an-reditData5


=orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error @elative absolute error @oot relative squared error Total >umber of #nstances +*4 -.4 4)-1,+ 4)/000 4)0++* ,*).1. D *40)**30 D *444 +* -. D D

$ercentage split does not allo2 *44D8 it allo2s only till ..).D

SVCET,Chittoor

83

Data Mining Lab

!ercentage Split 6esult at 0<D5


=orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error /3*/, 4)-+-1 4)/--1 4)0+30 +-)0 D -+)3 D

/!+ati)! ab$o+%t! !rror 76.3523 0


@oot relative squared error Total >umber of #nstances *43)0/+/ D 144

SVCET,Chittoor

84

Data Mining Lab

!ercentage Split 6esult at ;;.;D5 =orrectly =lassified #nstances #ncorrectly =lassified #nstances Kappa statistic Mean absolute error @oot mean squared error @elative absolute error @oot relative squared error Total >umber of #nstances 4 * 4 4)333+ 4)333+ --*)+410 D --*)+410 D * 4 *44 D D

SVCET,Chittoor

85

Data Mining Lab


3. Check to see i$ the data sho s a bias against 4$oreign orkers4 (attribute 2.)* or 4personal-status4(attribute 5). #ne ay to do this (6erhaps rather simple minded) is to remove these attributes $rom the dataset and see i$ the decision tree created in those cases is signi$icantly di$$erent $rom the $ull dataset case hich you have already done. &o remove an attribute you can use the reprocess tab in W!789s :;0 !,plorer. %id removing these attributes have any signi$icant e$$ect? %iscuss.
This increases in accuracy because the t2o attributes Bforeign 2or7ers and Bpersonal status Bare not much important in training and analy9ing) "y removing this8 the time has been reduced to some extent and then it results in increase in the accuracy) The decision tree 2hich is created is very large compared to the decision tree 2hich 2e have trained no2) This is the main difference bet2een these t2o decision trees)

After forign 2or7er is removed8 the accuracy is increased to ,1).D

SVCET,Chittoor

86

Data Mining Lab

#f 2e remove .th attribute8 the accuracy is further increased to ,3)3D 2hich sho2s that these t2o attributes are not significant to perform training)

SVCET,Chittoor

87

Data Mining Lab


=ross validation after removing .th attribute)

$ercentage split after removing .th attribute)

SVCET,Chittoor

88

Data Mining Lab


After removing the -4th attribute8 the cross validation is as above)

After removing -4th attribute8 the percentage split is as above)

SVCET,Chittoor

89

Data Mining Lab


<. 8nother 2uestion might be* do you really need to input so many attributes to get good results? =aybe only a $e ould do. >or e,ample* you could try ?ust having attributes 2* "* /* 3* 1.* 13 (and 21* the class attribute (naturally)). &ry out some combinations. (@ou had removed t o attributes in problem 3 'emember to reload the 8'>> data $ile to get all the attributes initially be$ore you start selecting the ones you ant.)
5elect attribute #@*@0@2@1<@12@#1 and clic7 on in(ert to remove the remaining attributes)

Were accuracy is decreased) 5elect random attributes and then chec7 the accuracy)

SVCET,Chittoor

90

Data Mining Lab

After removing the attributes *80838,8.8**8*-8*/8*08*18*38*,8*. and -482e select the left over attributes and visuali9e them)

SVCET,Chittoor

91

Data Mining Lab

After 2e remove *0 attributes8 the accuracy has been decreased to +3)0D hence 2e can further try random combination of attributes to increase the accuracy) =ross validation

SVCET,Chittoor

92

Data Mining Lab


$ercentage split

SVCET,Chittoor

93

Data Mining Lab


5. )ometimes* the cost o$ re?ecting an applicant has a good credit ho actually ho has

Case 1. might be higher than accepting an applicant bad credit

Case 2. 0nstead o$ counting the misclassi$ications e2ually in both cases* give a higher cost to the $irst case (say cost /) and lo er cost to the second case. @ou can do this by using a cost matri, in W!78. &rain your %ecision &ree again and report the %ecision &ree and cross-validation results. 8re they signi$icantly di$$erent $rom results obtained in problem 1 (using e2ual cost)?
#n the $roblem 38 2e used equal cost and 2e trained the decision tree) "ut here8 2e consider t2o cases 2ith different cost) %et us ta7e cost 1 in case * and cost - in case -) When 2e give such costs in both cases and after training the decision tree8 2e can observe that almost equal to that of the decision tree obtained in problem 3) =ase* cost 1! =ase- cost 1!

Total =ost

/,-4 *+41

Average cost /),- *)+41 We donAt find this cost factor in problem 3) As there 2e use equal cost) This is the major difference bet2een the results of problem 3 and problem .) The cost matrices 2e used here& =ase *& 1 * *1 =ase -& - * *-

SVCET,Chittoor

94

Data Mining Lab

*)5elect classif" tab) -) 5elect More Option from Test Cption)

SVCET,Chittoor

95

Data Mining Lab


/)Tic7 on cost sensiti(e E(aluation and go to set)

0)5et classes as -) 1)=lic7 on @esi9e and then 2eAll get cost matrix) 3)Then change the -nd entry in *st ro2 and -nd entry in *st column to 1)4 +)Then confusion matrix 2ill be generated and you can find out the difference bet2een good and bad attribute) ,)=hec7 accuracy 2hether itAs changing or not)

SVCET,Chittoor

96

Data Mining Lab


1.. %o you think it is a good idea to pre$er simple decision trees instead o$ having long comple, decision trees? Ao does the comple,ity o$ a %ecision &ree relate to the bias o$ the model?
When 2e consider long complex decision trees8 2e 2ill have many unnecessary attributes in the tree 2hich results in increase of the bias of the model) "ecause of this8 the accuracy of the model can also effect) This problem can be reduced by considering simple decision tree) The attributes 2ill be less and it decreases the bias of the model) Due to this the result 2ill be more accurate) 5o it is a good idea to prefer simple decision trees instead of long complex trees) *) Cpen any existing A@<< file e)g labour)arff) -) #n preprocess tab8 select LL to select all the attributes) /) :o to classif" tab and then use traning set 2ith A/: algorithm)

SVCET,Chittoor

97

Data Mining Lab

4.

To generate the decision tree8 right clic7 on the result list and select (isualiBe tree option8 by 2hich the decision tree 2ill be generated.

SVCET,Chittoor

98

Data Mining Lab


1) @ight clic7 on A/: algorithm to get :eneric Cbject Editor 2indo2 3) #n this8ma7e the unpruned option as true ) +) Then press O8 and then start) 2e find the tree 2ill become more complex if not pruned)

?isualiBetree

SVCET,Chittoor

99

Data Mining Lab

,) The tree has become more complex)

SVCET,Chittoor

100

Data Mining Lab


11. @ou can make your %ecision &rees simpler by pruning the node s. #ne approach is to use 'educed !rror 6runing - !,plain this idea brie$ly. &ry reduced error pruning $or training your %ecision &rees using cross-validation (you can do this in W!78) and report the %ecision &ree you obtain? 8lso* report your accuracy using the pruned model. %oes your accuracy increase?
6educed-error pruning5The idea of using a separate pruning set for pruning]2hich is applicable to decision trees as 2ell as rule sets]is called reduced'error pruning) The variant described previously prunes a rule immediately after it has been gro2n and is called incremental reduced'error pruning) Another possibility is to build a full8 unpruned rule set first8 pruning it after2ards by discarding individual tests) Wo2ever8 this method is much slo2er) Cf course8 there are many different 2ays to assess the 2orth of a rule based on the pruning set) A simple measure is to consider ho2 2ell the rule 2ould do at discriminating the predicted class from other classes if it 2ere the only rule in the theory8 operating under the closed 2orld assumption) #f it gets p instances right out of the t instances that it covers8 and there are $ instances of this class out of a total T of instances altogether8 then it gets positive instances right) The instances that it does not cover include > ' n negative ones8 2here n I t ^ p is the number of negative instances that the rule covers and > I T ' $ is the total number of negative instances) Thus the rule has an overall success ratio of Lp Z > ' n!M T 8 and this quantity8 evaluated on the test set8 has been used to evaluate the success of a rule 2hen using reduced'error pruning) *) @ight clic7 on 60, algorithm to get :eneric Cbject Editor 2indo2 -) #n this8ma7e reduced error pruning option as true and also the unpruned option as true ) /) Then press O8 and then start. 0) We find that the accuracy has been increased by selecting the reduced error pruning option)

SVCET,Chittoor

101

Data Mining Lab

SVCET,Chittoor

102

Data Mining Lab

12. (!,tra Credit)B Ao then-else rules4.

can you convert a %ecision &rees into 4i$-

=ake up your o n small %ecision &ree consisting o$ 2-" levels and convert it into a set o$ rules. &here also e,ist di$$erent classi$iers that output the model in the $orm o$ rules - one such classi$ier in W!78 is rules. 68'&* train this model and report the set o$ rules obtained. )ometimes ?ust one attribute can be good enough in making the decision* yes* ?ust oneC Can you predict hat attribute that might be in this dataset? #ne' classi$ier uses a single attribute to make decisions (it chooses the attribute based on minimum error). 'eport the rule obtained by training a one ' classi$ier. 'ank the per$ormance o$ ?(<* 68'& and one'.
#n WEKA8 rules)$A@T is one of the classifier 2hich converts the decision trees into B#<'TWE>'E%5E rules) -on(erting Decision trees into EI7-+9EN-ELSEF rules using rules.! 6+ classifier5$A@T decision list outloo7 I overcast& yes 0)4! 2indy I T@;E& no 0)4R*)4! outloo7 I sunny& no /)4R*)4! & yes /)4! >umber of @ules & 0 _es8 sometimes just one attribute can be good enough in ma7ing the decision) #n this dataset Weather!8 5ingle attribute for ma7ing the decision is Eoutloo4F outloo7& sunny 'Y no overcast 'Y yes rainy 'Y yes *4R*0 instances correct! With respect to the ti'e8 the one@ classifier has higher ran7ing and 60, is in -nd place and $A@T gets /rd place)

SVCET,Chittoor

103

Data Mining Lab


60, $A@T one@ T#ME sec! 4)*- 4)*0 4)40 @A>K ## ### # "ut if you consider the accurac"@ The 60, classifier has higher ran7ing8 $A@T gets second place and one@ gets lst place 60, $A@T one@ A==;@A=_ D! +4)1 +4)-D 33),D *)Cpen existing file as 2eather)nomial)arff -)5elect ll)

/):o to classif") 0)5tart)

SVCET,Chittoor

104

Data Mining Lab

9ere the accurac" is 1<<D

SVCET,Chittoor

105

Data Mining Lab


+he tree is so'ething li4e Eif-then-elseF rule #f outloo4Go(ercast then pla"G"es #f outloo4Gsunn" and hu'idit"Ghigh then pla" G no else pla" G "es #f outloo4Grain" and 3ind"Gtrue then pla" G no else pla" G "es +o clic4 out the rules

*) :o to choose then clic7 on 6ule then select ! 6+. -) =lic7 on 5ave and start) /) 5imilarly for one@ algorithm)

SVCET,Chittoor

106

Data Mining Lab

If outlook = overcast then play=yes If outlook = sunny and humidity= high then play=no If outlook = sunny and humidity= lo play=yes then

SVCET,Chittoor

107