You are on page 1of 8

ITB term paper

WEKA Techniques
ITB term Paper
Submitted by Swarnabha Shankar
Ray [10BM60092]

ITB term paper WEKA


Data mining is indispensible in todays world where information plays an important role in shaping business success. But data mining isn't solely the domain of big companies and expensive software. A free software WEKA is capable of performing most of the data mining activities. WEKA (Waikato Environment for Knowledge Analysis) is a product of the University of Waikato (New Zealand) and was first implemented in its modern form in 1997. It uses the GNU General Public License (GPL). The software is written in the Java language and contains a GUI for interacting with data files and producing visual results (think tables and curves). WEKA's preferred method for loading data is in the Attribute-Relation File Format (ARFF), where we can define the type of data being loaded, then supply the data itself. When we start WEKA, the GUI chooser pops up as shown in figure It lets us choose four ways to work with WEKA and our data. The four ways are Explorer Experimenter Knowledge Flow Simple CLI

We click on the explorer tab. The sample data used in this case is car data available from http://tunedit.org/search?q=arff in .arff format. Attribute-Relation File Format (ARFF) file has two distinct sections. The first section is the Header information, which is followed the Data information. The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types

Data File

ITB term paper


% % Title: Car Evaluation Database Sources: Creator: Marko Bohanec

% Relevant Information Paragraph: % Car Evaluation Database was derived from a simple hierarchical % decision model originally .The model evaluates cars according to the following concept structure: % % CAR car acceptability % PRICE overall price % buying buying price % paint price of the maintenance % TECH technical characteristics % COMFORT comfort % doors number of doors % persons capacity in terms of persons to carry % luggage boot the size of luggage boot % safety estimated safety of the car % Number of Instances: 1728 % % % % % % % % % Number of Attributes: 6 Attribute Values: buying v-high, high, med, low maint v-high, high, med, low doors 2, 3, 4, 5-more persons 2, 4, more lug_boot small, med, big safety low, med, high

@ Relation Car @attribute buying {vhigh,high,med,low} @attribute maint {vhigh,high,med,low} @attribute doors {2,3,4,5more} @attribute persons {2,4,more} @attribute lug_boot {small,med,big} @attribute safety {low,med,high} @attribute class {unacc,acc,good,vgood} @ sample data vhigh,vhigh,2,2,small,low,unacc vhigh,vhigh,2,2,small,med,unacc vhigh,vhigh,2,2,small,high,unacc

ITB term paper


Loading the file
Select Explorer from Weka GUI chooser. In Weka Explorer under the Preprocess tab select open file and go to the location of the file to load the file. In the Explorer screen, select the Preprocess tab. Select the Open File button and select the ARFF file.

After selecting the file the explorer window looks as below:

In the left section of the Explorer window, it outlines all of the columns in the data (Attributes) and the number of rows of data supplied (Instances). By selecting each column, the right section of the Explorer window will also give the information about the data in that column of the data set. For example, by selecting the buying column in the left section, the right-section should change to show the additional statistical information about the column. No. Label Count Weight 1 vhigh 432 432.0 2 high 432 432.0 3 med 432 432.0 4 low 432 432.0 Finally, there's a visual way of examining the data, which can be viewed by clicking the Visualize All button.

ITB term paper


To create the model, click on the Classify tab. The first step is to select the model we want to build, so WEKA knows how to work with the data, and how to create the appropriate model: o Click the Choose button, then expand the functions branch. o Select the Logistic leaf

Logistic*
logistic regression (sometimes called the logistic model or logit model) is a type of regression analysis used for predicting the outcome of a binary dependent variable (a variable which can take only two possible outcomes, e.g. "yes" vs. "no" or "success" vs. "failure") based on one or more predictor variables.
Sources- * http://en.wikipedia.org/wiki/Logistic_regression

This tells WEKA that we want to build a Logistic model.

When we have selected the right model, your WEKA Explorer should look as above The other three choices are Supplied test set, where we can supply a different set of data to build the model; Cross-validation, which lets WEKA build a model based on subsets of the supplied data and then average them out to create a final model; and Percentage split, where WEKA takes a percentile subset of the supplied data to build a final model. With logistic, we can simply choose Use training set. Finally, the last step to creating our model is to choose the dependent variable (the column we are looking to predict). We know this should be the safety, since that's what we're trying to determine for purchasing a car. Right below the test options, there's a combo box that lets you choose the dependent variable. The column safety should be selected by default To create our model, click Start

ITB term paper


Figure below shows the output window

Interpretation of the result


=== Evaluation on training set === === Summary === Correctly Classified Instances 1634 94.5602 % Incorrectly Classified Instances 94 5.4398 % Kappa statistic 0.8819 Mean absolute error 0.0393 Root mean squared error 0.1387 Relative absolute error 17.1827 % Root relative squared error 41.0335 % Coverage of cases (0.95 level) 99.7106 % Mean rel. region size (0.95 level) 30.9606 % Total Number of Instances 1728 === Detailed Accuracy By Class === Precision 0.973 0.872 0.922 0.913 Recall 0.964 0.901 0.855 0.969 F-Measure 0.968 0.886 0.887 0.94 Class unacc acc good vgood

Regression

ITB term paper

Regression is the easiest technique to use, but is also probably the least powerful. In effect, regression models all fit the same general pattern. There are a number of independent variables, which, when taken together, produce a result a dependent variable. The regression model is then used to predict the result of an unknown dependent variable, given the values of the independent variables. We will perform Regression on the Smoking and Cancer data comparing them on basis of various attributes. Various attributes are: Occupational Groups Smoking Mortality To create our regression model ,we perform the same process as in previous model to load the following data

By selecting the Smoking column in the left section the right-section should change to show the additional statistical information about the column. Finally, there's a visual way of examining the data, which can be viewed by clicking the Visualize All button.

To create the model, click on the Classify tab. The first step is to select the model we want to build, so WEKA knows how to work with the data, and how to create the appropriate model: 1. Click the Choose button, then expand the functions branch. 2. Select the LinearRegression leaf. This tells WEKA that we want to build a regression model.

ITB term paper


Finally, Choosing no attribute method to determine each attributes contribution to regression. Choosing Mortality as dependent variable The last step to creating our model is to choose the dependent variable one by one all numerical attributes.

Interpretation of the result


Mortality = ..+ (10* Painter & decorator) + (5* construction worker) + (2 * Labourers) + (9* Furnace) +(0* Smoking ) - 51 Interpreting the pattern and conclusion that our model generated we see that besides just a strict house value: Occupational group type affects smoking habit WEKA tells us that occupational group affects the smoking tendency among people Smoking affects the overall mortality === Summary === Correlation coefficient Relative absolute error Root relative squared error Total Number of Instances 1 0 % 0 % 25

Thus WEKA is a very effective tool for data mining which supports several standard data mining tasks,

more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection

You might also like