You are on page 1of 24

SYRACUSSE

U
NIVERSITY

INTRODUCEDATAMININGGWITHRAPIDMINERR

Clusttering,C
Classificcationan
ndAsso
ociationRules|
Byy Huang, Hu
uaming ; Wu
u, Ge

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

MENU
1.

2.

3.

4.
5.

Abstract:....................................................................................................................................3
IntroductiontoRAPIDMINER....................................................................................................4
1) Introduction:.....................................................................................................................4
2) Preparation........................................................................................................................5
Clustering..................................................................................................................................8
1) Clusteringonirisdatasetswithclass.................................................................................8
2) Clusteringonirisdatasetswithoutclasslabels...............................................................14
Classificationtree....................................................................................................................16
1) ClassifyusingWJ48operatorinRapidMiner:................................................................16
2) ClassifyusingDecisionTreeoperatorinRapidMiner:......................................................19
AssociationRules....................................................................................................................22
ReferenceBooks......................................................................................................................24

Author:
Chapter1,2 Huang,HuaMing
Chapter3,4,Chart Wu,Ge
2/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

Abstract:
Inthisproject,wedotheclassification,clustering,associationrulesinRapidMiner,andintroduce
howtousetheRapidMinertodotheseactions.Atthesametime,wewillusethetooltoanalyze
irisdatasetsanddiabetesdatasets.
Data Mining is more and more important in the information industry and in society. It affects
almostallaspectsofourlivessuchasmarketanalysis,frauddetection,andcustomerretention,
toproductioncontrolandscienceexploration.
Dataminingreferstoextractingorminingknowledgefromlargeamountsofdata.[Reference
fromMorganKaufmannDataMiningConceptsandTechniques,2nd]
TherearealsomanyDMtoolsfordatamining,suchasSAS,WEKA,MineSetandRapidMiner.We
willfocusonRapidMinerinthistopic.

3/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

1. IntroductiontoRAPIDMINER
1) Introduction:
IntroducetoRapidMiner(fromwww.rapidminer.com):

RapidMiner (formerly YALE) isthe worldwide leading opensource data mining solution due
to the combination of its leadingedge technologies and its functional range. Applications of
RapidMinercoverawiderangeofrealworlddataminingtasks.

Use RapidMiner and explore your data! Simplify the construction of experiments and the
evaluation of different approaches. Try to find the best combination of preprocessing and
learningstepsorletRapidMinerdothatautomaticallyforyou.

Feature
The modular operator concept of RapidMiner (formerly YALE) allows the design of complex
nestedoperatorchainsforahugenumberoflearningproblemsinaveryfastandefficientway
(rapidprototyping).Thedatahandlingistransparenttotheoperators.Theydonothavetocope
with the actual data format or different data views the RapidMiner core takes care of all
necessarytransformations.ReadhereaboutthemostimportantfeaturesofRapidMiner.

OperatorOverview
RapidMiner(formerlyYALE)anditspluginsprovidemorethan400operatorsforallaspectsof
Data Mining. Meta operators automatically optimize the experiment designs and users no
longer need to tune single steps or parameters any longer. A huge amount of visualization
techniques and the possibility to place breakpoints after each operator give insight into the
successofyourdesignevenonlineforrunningexperiments.Onthispagewediscussthemain
groupsofoperatorsandgiveoperatorexamplesforeachofthegroups.

RapidMinerdownloadlink:http://rapidi.com/content/view/26/82/
RapidMinerInstallationGuide:http://rapidi.com/content/view/17/40/
RapidMinerTutorial:
http://sourceforge.net/project/downloading.php?groupname=yale&filename=rapidminer4.0tut
orial.pdf&use_mirror=internap
4/24

IntroducetoDataM
MiningwithRapidMiner,2008,Syracu
useUniversitty,EECS
RapiidMinerGUIManual:http
p://download
ds.sourceforge
e.net/yale/raapidminer4.0
0guimanual.pdf

2) Preparration
Rapid Min
ner supports database maanagement system
s
like Oracle,
O
SQL Seerver, PostgreSQL
andmySQ
QL,italsosup
pportsmanyffileformatslikepopularfileformatarfff,excel,csv.
Whenwe beginourprrocess,firstsstepistoope
enthedataffilewhichweeneed.Open
nthe
RapidMineer,wecanseeesuchinterfaace:

RapidMineer has two different


d
interface, one is for our configuration ed
dit mode for data
mining,th
heotherisfo
ordataresulttsanalysis,ussercanswitcchthesetwo interfacesattany
timebytw
wobuttonsassbelow:

isforeditm
mode,

mode.
isforresultm

Now,letusbeginourclusteringprocess.
Clickthe

button inthemain menutocre


eateanewproject.Youcanseesuch page

aftercreatted.

5/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

Wecreatedtheprojectbutcurrentlydonothaveanydatasource,sothenextstepisto
link the data source to the project. In the left side, right click on the Root icon, and
choosetheArffExampleSourcemenutoselectinputdatasetsasbelow.

WeneedtosetdatasourcelocationtoArffExampleSource:clickthebuttontoselect
thedatafilelocationinyourcomputer.

6/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

RapidMinerdisplaysthepathofdatafileinparameterspageofArffExampleSource.

Pleasenotethatthelabel_attributecolumnbelowthedata_fileisaspecialparameter
whichcanlettheRapidMinerignorethedatafieldyouspecify.
7/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

2. Clustering
1) Clusteringonirisdatasetswithclass
Now, we let the label_attribute of parameter of ArffExampleSource to be blank, which can
makethetoolconsideralldataofirisdatasetsincludingclasslabelstodotheclustering.
Right click the Root menu of left side to select the simpleKMean method to do the
clustering.

ActuallyRapidMinerprovidesmorethan17catalogsclusteringmethodstododifferentkindsof
clustering.WechooseWSimpleKMeanshere.

WSimpleKMeans has two parameters, one is N which means number of


clusters,realtype,defaultis2;theotherisSwhichmeansRandomnumberseeds,defaultis
8/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS
10.
AftersettingtheS=3andN=10,clickthe

buttonintopsidetoruntheclustering.

ClusterModelPanel:

RapidMiner will show the clustering results after finish the process. ClusterModel panel
providesallkindsofinformationaboutclustergroupsincludingdetailrecordsofclustergroups
andgraphviewofclusters.

RapidMinershowsthesummaryinformationinTextViewofClusterModelpanelsuchas:

Clustercluster1[characterization:cluster1]:50items
Clustercluster0[characterization:cluster0]:50items
Clustercluster2[characterization:cluster2]:50items
Totalnumberofitems:150

Obviously,duetoclasslabelofthedatasets,RapidMinerdotheclusteringbytheclasslabelfirst.

9/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS
UserscancheckthedetaildatarecordsofeachclustergroupinFolderView.

ClicktheCluster1inFolderViewcheckbox,userscancheckthedetailoriginalrecordsofthat
clustergroup.Cluster1isgroupedbyclassofirissetosa.

TheGraphViewofClusterModelpanelshowsusersthegraphviewofthewholepictureof
clusteringandprovidesvisualizationimageforrelationshipsofallclusters,suchasbelow:

10/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

Morethantreeview,RapidMineralsoprovidesBallon,KKLayout,FRLayout,ISOM,Circle,Spring
viewsforourclustersresults.Choosetheredcircleabovetoselectdifferentviews.

Whenclicksclustersinthetreeview,alldatasetsofthatclusterwillappearintherightsideofthe
panel.

ExampleSetPanel:
ExampleSetpanelprovidesdataviewfortheuserswhichisdifferentfromtheClusterModel.Data
viewfocusonthedatapointsamongtheclustering.IthasthreekindsofdataviewsuchasMeta
DataView,DataViewandPlotView.
11/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS
First,MetaViewshowsthedatatypes,valuetypesanddatastatisticsofdatasets.

The raw data of iris has five fields which are sepallenth, sepalwidth, petallength,
petalwidth,class.Afterclustering,RapidMinerwilladdoneclusterlabelfieldnamedcluster.

Second,DataViewdisplaysrawdatarecordsofdatasource,alsoyoucandofilterfordatasets:

LastviewisPlotViewwhichshowsthechartofclusteringdatasetpoints.InPlotView,you
canchoosedifferentcharttypessuchasScatter,ScatterMatrix,Scatter3DColor,Bubble.
Indifferentcharttype,therearedifferentparameters.
Differentclusteroftenhasdifferentcolororshapeintheplotview.

12/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

13/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

ClusterModelandExampleSetpanelsshowinformationofallaspectsaboutclusteringwecare
about. They are our basic materials for future analysis. RapidMiner provides us a very flexible,
easytouseandvisualizationtoolfordatamining.

2) Clusteringonirisdatasetswithoutclasslabels
Now,weletthelabel_attributeofparameterofArffExampleSourcetobeclasswhichisthe
fieldnameofclasslabelinourrawdataofiris.ThissettingwillletRapidMinerignoretheoriginal
classlabelamongrawdatawhendoclustering.
PartofirisARFFdatasetsareshowedbelow:

@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa

Whenwedoclusteringinthiscase,theresultswilldifferentfromthedatawithclasslabel.

Theclustersaregroupbycombinationoftheother
fieldsexcepttheclass.Thistime,thenumbersof
clustersare61,50,39whicharenot50,50,50inbefore.

14/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS
FromDataView,datapointswithsameclassarenotclusteredintothesameclusterthistime.
Seebelowfordetail:

Thereasonofthissituationisthatsomepointsarelocatedintheedgesofbetweencluster0and
cluster2 that are hard to grouped only depend on the values of data fields of sepallength,
sepalwidth, petallength, petalwidth. That is why we see different cluster plot view from
classandclusterabove.

15/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

3. Classificationtree
1) ClassifyusingWJ48operatorinRapidMiner:
RapidMinercandeveloptheclassificationtreelikeWEKA.WeusetheIRISdatasettoshowhow
todeveloptheclassificationtreeinRapidMiner.

Firstofall,wecreateaAriffExampleSourseoperatortoopentheIRIS.arifffile.
Secondly, we create a choose New Operator> Learner >Supervised >Wake>Tress>WJ48 to
createaWJ48operator.

Choose
WJ48

Thirdly,weneedtosettheparametersfortheWJ48operator.TheMeaningofeachparameteris
showedinthefollowinggraph.

16/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

TheMeaningofeachparameterisshowedinthefollowingpicture.

17/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

The parameters
is the same as
what we saw in
Weka.

ClicktheRunButtontoanlysisourdatatodevelopdecisiontree.

WegetthedecisiontreeofIRISdatasetasfollowing.
Therearetwowaystoshowourresult.ThatisTextViewandGraphView.Choosethecorrespond
icontodisplaytheresult.

18/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

Itrepresentstheconcept of predicated class of iris plant. That is, itpredicts what class a plant
likely to be. In the graph, internal nodes are denoted by ovals, and leaf nodes are denoted by
rectangles.
Forexample, Irisvirginica(46.0/1.0)meansthattotally46pointsinthisbranchbutonly1point
doesnotmatch(error).

TheclassificationerrororperformancewillbeshowedinanotheroperatorofRapidMiner.

2) ClassifyusingDecisionTreeoperatorinRapidMiner:
We can choose New Operator > Learner > Supervised > Tress > DecisionTree to create a
DecisionTreeoperator.

19/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

WeneedtosettheparametersfortheDecisionTreeoperator.

20/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS
ClicktheRunButtontoanalysisourdatatodevelopdecisiontree.
ThereareTextViewandGraphView.Choosethecorrespondicontodisplaytheresult.

In the Graph View mode, the each color in the leaf node denotes a value of label class. As
showed above, the green denotes Irisversicolor; the blue denotes Irissetosa, the red denotes
Irisvirginica.
Ifanodehastwoorthreecolorinit,thatmeanstheratiooferrorinthatnode.
For example : ,

means there 25% Irisversicolor (green) are classified to Irisvirginica

(red)bymistake.
21/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

4. AssociationRules
RapidMiner can do the association rules like WEKA and MineSet. Because We cannot use the
numbervaluetodotheassociationrules,weusetheweather.nominaldatasetinsteadofIRISto
seehowtousetheassociationrulesinRapidMiner.

After create a AriffExampleSource operator and open the weather.nominal.arff file, we can
chooseNewOperator>Learner>Unsupervised>Itemsets>Wake>WAprioritocreatethe
WApriorioperator.

ChoosetheWApriorioperatortosettheparameters.

22/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

ClicktheRunButtontoanlysisourdataandtheresultisshowedasfollowing.

TheresultofAssociationruleanalysis:

WApriori
Apriori
=======
23/24

IntroducetoDataMiningwithRapidMiner,2008,SyracuseUniversity,EECS

Minimumsupport:0.1(1instances)
Minimummetric<confidence>:0.9
Numberofcyclesperformed:18

Generatedsetsoflargeitemsets:

SizeofsetoflargeitemsetsL(1):7

SizeofsetoflargeitemsetsL(2):16

SizeofsetoflargeitemsetsL(3):13

SizeofsetoflargeitemsetsL(4):3

Bestrulesfound:

1.outlook=rainyplay=no2==>windy=TRUE2 conf:(1)
2.outlook=rainywindy=TRUE2==>play=no2 conf:(1)
3.temperature=hotplay=no2==>humidity=high2 conf:(1)
4.outlook=overcasttemperature=cool1==>windy=TRUE1 conf:(1)
5.temperature=coolplay=no1==>outlook=rainy1 conf:(1)
6.temperature=hotwindy=TRUE1==>humidity=high1 conf:(1)
7.temperature=hotwindy=TRUE1==>play=no1 conf:(1)
8.temperature=coolplay=no1==>windy=TRUE1 conf:(1)
9.temperature=coolwindy=TRUEplay=no1==>outlook=rainy1 conf:(1)
10.outlook=rainytemperature=coolplay=no1==>windy=TRUE1 conf:(1)

The result discovers elements that cooccur frequently within the Weather dataset and shows
somerules,suchasimplicationorcorrelation,whichrelatecooccurringelements.

5. ReferenceBooks
RapidMinerTutorialGuide
RapidMinerGUIManual

24/24