You are on page 1of 11

Short manual for WEKA

This document is a short manual for installing and using the basic functionality of the
Machine Learning tool WEKA. By far not all functionality of Weka will be explained in the
manual. More information can be found on the internet.

Installing and running WEKA
Download the appropriate WEKA software from the WEKA webpage and install it on
your laptop or desktop.

Mac users
For people using a Mac laptop: you also have to copy the folder weka-X-X-XX to the
“Applications’’ folder.
You can run WEKA by (double) clicking the WEKA icon.

Figure 1: WEKA startup screen: the Weka GUI Chooser

WEKA data format
Before actually using WEKA one should know a little bit of the data format arff used by
WEKA. The arff format is a comma separated file. In the header the relation (name of
the data) is defined together with the attributes (also called features) The last attribute is
the class attribute and defines the class of each data sample. for an example see Figure
2.

2.0.3.Iris-setosa 4.2.1.1.4.4.Iris-setosa 4.Iris-setosa 5.2.4.6.1. Then the following screen will appear: Figure 3: WEKA Explorer window .Iris-setosa 4.3.2.0.Iris-setosa 5.0.1.Iris- @DATA 5.3.Iris-setosa 4.2. Click on the “Explorer” button in the “Weka GUI Chooser” window.9.Iris-setosa 4.0.3.Iris-versicolor.3.3.1.0.3.9.5.0.1.2.@RELATION iris @ATTRIBUTE @ATTRIBUTE @ATTRIBUTE @ATTRIBUTE @ATTRIBUTE virginica} sepallength REAL sepalwidth REAL petallength REAL petalwidth REAL class {Iris-setosa.0.2.0.7.7.4.4.1.4.0.1.5.6.2.9.4.0.Iris-setosa 5.5.1.0.1.3.Iris-setosa Figure 2: WEKA arff format Explorer One of the basic applications to explore the data and classifiers is the Explorer.4.2.0.1.4.3.6.4. see Figure 1.3.

After opening the file the following screen appears. On my Windows machine this is the folder “C:\Program Files (x86)\Weka-3-6\data”. see Figure 4 Figure 4: Initial Explorer screen Iris data file Exploring attributes (aka features) In the window bottom left you see the different attributes and for the selected attribute sepalwidth you see on the right some statistics and the class labels for different values of the attribute.arff file which can be found in the Weka data folder. This visualizations on the bottom right gives an indication for the discriminating power of the selected attribute. . Exercise 1 Which of the attributes (except class) has the most discriminating power? For an overview of all attributes click the “Visualize all” button. For this manual we select the iris.Loading data Click on the “Open file…” button and select the arff data file you want to analyze.

Exploring classifiers. Click in the Explorer window in the Classify tab. One of the basic classifiers we consider in the course is the Decision Tree. You should keep this in mind when applying classifiers to this problem. Exercise 2 Which two attributes together have the most discriminating power and why? It follows from the data visualizations that the class Iris setosa is very different (can be discriminated/separated very well) from the other two classes but there is some mixing of (confusion between) the classes Iris-versicolor and Iris-virginica. see Figure 4.A single attribute has most of the time not that much discriminating power. Clicking on a sub-window in the top window will enlarge it. Figure 5: Data visualization. . The following window should appear. The following window should appear. To explore the discriminating power of two attributes click on the visualize tab of the Explorer screen.

When the classifier evaluation is finished the right window “Classifier output” is filled with evaluation metrics. these options will be explained in the Machine Learning lectures. Above the “Start” button the attribute which is the class attribute is selected. in this case the attribute “class”. Click on the “Start” button and Weka starts to evaluate the Decision tree classifier J48 on the data using 10-fold cross validation evaluation strategy. For this manual we go for the standard option 10-fold cross validation. in this case J48 which is in the “trees” subfolder. Under test options you see different options to evaluate the performance of the decision tree classifier J48. . see also Figure 7.Figure 6: Exploring classifiers Click on the Choose button and select the classifier you want to explore.

96% in this case. In this confusion matrix the rows are the actual classes and the columns the classes given by the classifier.Figure 7: Evaluation metrics Important aspects of this evaluation is the performance (correctly classified instances). the precision and recall for the different classes and the confusion matrix. . For example 49 instances of Iris setosa are classified as Iris-setosa and 1 instance of Iris-setosa is classified as Iris-versicolor.

This can be done by manually removing attributes in the “Preprocess” tab. Which classifier is better. one has a lot of features. select the attributes you want to remove and click the remove button. In that case a classifier can benefit from attribute selection. Another option is to use the Knowledge flow functionality of Weka. Another option is to apply an automatic feature selection method which can be done in the “Select attributes” tab. see Figure 4. Attribute selection In some cases. One problem with Weka Explorer is that you have to remember and repeat all the steps by selecting tabs and clicking buttons. for instance if one wants to classify emails or texts. the decision tree J48 or the Naïve Bayes classifier? Explain why! Is there a difference between the two classifiers? (Hint: look at the confusion matrix).Exercise 3 Is the confusion in line with what observed in the data visualization concerning the separation of the classes? Explain why! Exercise 4 Apply a Naïve Bayes classifier (in the subfolder “bayes”) to the above data and give the performance and the confusion matrix. Figure 8: Attribute selection tab .

and attribute 2 “sepalwidth” the lowest.871. More information More information on Weka and the Explorer can be found on the WekaMOOC: http://www. Pushing the “Start” button will lead to the following output: Figure 9: Attributes ranked based on Information Gain So attribute 4 “petalwidth” has the highest information gain.cs. Exercise 5 Based on the information gain what would be the best 2 attributes to select? Compare this with the decision tree of Figure 7. This will lead to using “Ranker” as search method.ac.nz/ml/weka/mooc/dataminingwithweka/ Especially Class 1 is an introduction to the Explorer .waikato. What do you observe? There are many more options to select attributes but we will not discuss this in this manual. 0.Choose as Attribute Evaluator “GainRatioAttributeEval” this evaluates attributes based on their information gain.

Class assigner Under the “Evaluation” tab select “ClassAssigner” and put in on the canvas. Connect the arrow to . and select the Iris Flower data file “iris.”.Knowledge flow In order to start the Knowledge flow click the “KnowledgeFlow” button in the Weka Gui Chooser. The “Arff Loader” will appear on the canvas. Figure 10: Weka Knowledge Flow window With the Knowledge Flow one can design and store Machine Learning pipelines which can be used later on. Double click on the “Arff Loader” button and click afterwards on the canvas. Right click on the “ArffLoader” icon on the canvas and select “”Configure. right from the “ArffLoader”. The following window should appear. The main part is the canvas on which the pipeline can be constructed by using “drag and drop” and mouse clicks. Arff Loader The first component of the ML pipeline is the “Arff Loader”.arff” from the Weka data folder.. Right click on “Arffloader” and select “dataset”.

. We leave as it is. for instance “J48” and put it on the canvas. By right clicking the “CrossValidationFoldMaker” and select “Configure…” one can set the folds and the seed.the “ClassAssigner” by moving the cursor to the “ClassAssigner”and afterwards left mouse click. Figure 11: Look of the canvas Cross Validation Next put the “CrossValidationFoldMaker on the canvas. put it on the canvas and connect it to the “ClassifierPerformanceEvaluator” by selecting “text”. put it on the canvas and connect it to the classifier by connecting “batchClassifier” to the “ClassifierPerformanceEvaluator”. Evaluation From the “Evaluation” tab select “ClassifierPerformanceEvaluator”. Connect the training and testset from the “CrossValidationFoldMaker” to the classifier. Right click “ClassAssigner” and select “dataSet” and connect it to the “CrossValidationFoldMaker”. Classification Next select your favorite classifier from the “Classifiers” tab. The complete pipeline should similar to the pipeline in Figure 12. The canvas should look similar to the canvas in Figure 11. From the “Visualization” tab select “TextViewer” . Next right click and select “Configure…” and select “(Nom)class” as class label.

Extensions The pipeline can be extended in many ways see for instance Figure 13 Figure 13: Extended pipeline. Now one can start the pipeline by right clicking the “Arffloader” and select “Start loading”. left top of the window. .Figure 12: Complete pipeline. can be used to compare a Naïve Bayes classifier with a decision tree. When the pipeline is finished on can view the results by right clicking the “TextViewer” and selecting “Show results”. Saving The pipeline can be saved for later use by clicking the save button.