You are on page 1of 24

Lab 12 Introduction to Rapidminer/Weka.

Objective
 Introduction to Rapidminer/Weka.
 Create a New Process in Rapidminer.

Current Lab Learning Outcomes (LLO)

 Familiarity with the Rapidminer software interface


 How to use Rapidminer
 Familiarity with the Weka software interface
 How to use Weka

Lab Assessment
The following exercise should be completed at the end of this lab,

1. Download RapidMiner. Do Exercise 1 shown in the description below.

2. Use any preexisting dataset from the WEKA library. Choose any classifier and perform all the
steps mentioned in the description and summarize the following:

o Classifier used
o Test Options used
o Confusion Matrix in the form of a table

Lab Requirements
Technical Requirements (Software: RapidMiner, WEKA).
Lab Description
This lab explains about the RapidMiner and WEKA software used for machine learning.
Students should get familiar with:
 How to use RapidMiner/WEKA software
 How to execute RapidMiner/WEKA Program
 How to perform liner regression

Perspectives and Views:

 After starting RapidMiner Studio, the list in the center shows the typical actions, which you will
perform frequently.
 New Process: Opens the design perspective and creates a new analysis process.
 Open: Opens a repository browser, if you click on the button. You can choose and open an
existing process in the design perspective.

Figure 01: Perspectives and Views of Rapidminer

Design Perspective:

This is the central RapidMiner Studio perspective where all analysis processes are created, edited
and managed.
2
Figure 1: Design Perspective:

Operators View

All work steps (operators) available in RapidMiner Studio are presented in groups here and can
therefore be included in the current process. You can navigate within the groups in a simple manner
and browse in the operators provided to your heart's desire.

3
Figure 2: Design Perspective

Repositories View:

The repository is a central component of RapidMiner Studio. It is used for the management and
structuring of your analysis processes into projects and at the same time as both a source of data as
well as of the associated meta data.

Process View:

The Process View shows the individual steps within the analysis process as well as their
interconnections. New steps can be added to the current process in several ways.

4
Figure 3: Process View of Rapidminer

Inserting Operators:

You can insert new operators into the process in different ways:
 Via drag & drop from the Operators View as described above,
 Via double click on an operator in the Operators View,

5
Figure 4: Inserting operators
Parameters View:

After an operator offering parameters has been selected in the Process View, its parameters are shown
in the Parameters View.

Figure 5: Parameter View


6
Help View:

Each time you select an operator in the Operators View or in the Process View, the help window
within the Help View shows a description of this operator.

Figure 6: Help View

Creating a New Process:

 Expand the group “Utility" in the Operators View.


 Then the group “Data Generation ".
 This includes the operator “Generate Sales Data“.
 Drag this operator onto the white area, and connect the output port of the new operator with the
result port.

7
Figure 7: Creating New Process

Transforming Meta Data:

 The ability to compute the output of an operator or process beforehand and to even do this
during the design time, so without having to load the actual data or even perform the process.
 This is typically much less voluminous than the data itself and gives an excellent idea of which
characteristics a particular data set has.
 In RapidMiner Studio the meta data are provided at the ports. Go over the output port of the
recently inserted operator with the cursor and see what happens.
 If we look at the last two attributes, we see that the number and the individual price of the
objects are given within the transaction, the associated total turnover however is not.

8
Figure 8: Transforming Meta Data

 Therefore we want to generate a new attribute with the name \total price", the values of which
correspond to the product from amount and single price.
 Go to the group “Data Transformation” => “Attribute Set Reduction and Transformation” =>
“Generation”.
 Drag the operator and connect the output port of the data generator with the input port of the
new operator and connect the output port of the latter with the result output of the total process.

9
Figure 9: Generating new attribute

 Go to the “function descriptions “parameter and enter the desired computation as shown in
figure.

Figure 10: Editing parameter list

 Open the group “Data Transformation" => “Attribute Set Reduction and Transformation" =>
”Selection" and drag the operator named “Select Attributes“.
 Select the new operator and select the option “subset" in its parameters for the parameter
“attribute filter type".
 The parameters should be like in Fig.

10
Figure 11: Selecting parameters

Execute the Process:

• To execute the process press the large play button in the toolbar of RapidMiner.

Figure 12: Executing the process

EX1: Linear Regression: Estimating/predicting the value of GBP with change in USD
• Drag and drop “Read CSV” Operator on Process window.
• Use Import Configuration Wizard to import the data.

11
Figure 13: Reading CSV file

Step 1: Select the data set.


Step 2: As the file is comma separate, please select the Comma radio button from the top right
Column.
Step 3: Show the data
Step 4: Select USD and GBP attributes, and identify the role for GBP as label.

12
Figure 14: Liner regression operator selection

• Drag and drop linear regression operator on Process window and connect the output port out of
Read CSV Operator with input port of Linear Regression operator.

• Connect the output port named mod of linear regression operator with process window res port

13
Figure 15: Liner Regression

• To create the model, simple click the blue play button from tool bar.

Figure 16: Creating a model

• Now we can use the model to estimate/predict the value of GBP using the value of USD.
• Read the test data from .CSV file.
• Drag and drop Apply Model Operator which has two input ports:
14
o -mod to provide the model input
o -unl to provide the data to perform prediction.

Figure 17: Applying model operators

• Execute the process and you will get the predicted values.

15
Figure 18: Predicted values

Introduction to WEKA

Introduction
WEKA stands for Waikato Environment for Knowledge Learning. It was developed by the
University of Waikato, New Zealand. WEKA supports many data mining tasks such as data re-
processing, classification, clustering, regression and feature selection to name a few. The
workflow of WEKA would be as follows:

Data Pre-processing Data Mining Knowledge

Getting started with WEKA:


Choose “WEKA 3.7.x” from Programs.

16
Figure 19: Weka Interface

Explorer: An environment for exploring data. It supports data preprocessing, attribute


selection, learning and visualization
Experimenter: An environment for performing experiments and conducting statistical tests
between machine learning algorithms.
Knowledge Flow: It is similar to Explorer but has a drag-and-drop interface. It gives a
visual design of the KDD process.
Simple CLI: Provides a simple command-line interface for executing WEKA
commands.

17
WEKA TOOL:
• Preprocessing Filters: The data file needs to be loaded first , below is an example.

Figure 21
Figure 20: weka Tool

The supported data formats are ARFF, CSV, C4.5 and binary. Alternatively you
could also import from URL or an SQL database.

After loading the data, preprocessing filters could be used f o r adding/removing attributes,
discretization, Sampling, randomizing etc.

18
Select attributes: WEKA has a very flexible combination of search and evaluation
methods for the dataset’s attributes. Search methods include Best-first, Ranker,
Genetic-search, etc. Evaluation measures include Information Gain, Gain Ratio
etc.

Classification: The predicted target must be categorical. WEKA includes methods


such as Decision Trees, Navie Bayes and Neural Networks to name a few.
Evaluation methods also include test data set and cross validation.

Clustering: The learning process occurs from data clusters. Methods include k-means,
Cobweb and FarthestFirst.

Regression: The predicted target is continuous. Methods such as linear


regression, Neural networks and Regression trees are included in the library.

Exercise 2 (Using built-in dataset)

1. Click Explorer on the first interface screen and load a dataset from the library. Given
here is an illustration for the dataset ‘weather.arff’.

19
Figure 22: Loading dataset

2. Click over each attribute to visualize the distribution of the samples for each of them. You
can also visualize all of them at the same time by clicking the ‘Visualize all’ on the right
pane.

3. Under the Classify tab, click ‘Choose’ and select a classifier from the drop-down menu. E.g.:
‘Decision Stump’

20
Figure 23: Applying classification

4. Once, a classifier is chosen, select percentage split and leaves it with its default values. The
default ratio is 66% for training and 34% for testing.

5. Click ‘Start’ to train and test the classifier. The interface will now like this:

21
Figure 24: Execution of classification algorithm

6. You could also try using ‘Cross validation’ method to train and test the data.

7. The right pane shows the results for training and testing. It also indicates the number of
correctly classified and misclassified samples.

8. You could right click on the model generated and do various operations. You could
also save the model if you wanted. Another performance measure is the ROC curve that can
be viewed as shown in the next picture. Select ‘no’ in the option to view the curve.

22
Figure 25: Different operations on the model

9. Click on the ‘Select Attributes’ tab and to analyze the attributes. A number of ‘Attribute
Evaluator’ and ‘Search methods’ can be combined to gain insight about the attributes.
Given below is an example.

23
Figure 26: Analyzing the attributes

10. Click on the Visualize tab to see the pair wise relationship of the attributes.

Performance Analysis

Once the model has been trained and tested, we need to measure the performance of the model.
For this purpose, we used three measures namely: precision, recall and accuracy.

Precision (P) = tp/(tp+fp) Recall (R) =


tp/(tp+fn)

Accuracy (A) = (tp+tn)/Total # samples

Where tp, fp, tn and fn are true positive, false positive, true negative and false negative respectively.

Extra (supplementary) Materials

Nil…

You might also like