Data Mining - Session #1 - Unlocked

Idlib University
Faculty of Informatics
Fifth Academic Year
Data Mining - Practical Session 1

Data Mining with WEKA Workbench
Prepared By Eng. Osama AL Mustafa & Assoc. prof. Dr. Mohamed Kurdi
Introduction
WEKA was developed at the University of Waikato in New Zealand, the name
stands for Waikato Environment for Knowledge Analysis. we just call it Weka.
Weka is a data mining software. It is collection of machine learning algorithms for

data mining tasks. It contains tools for data preparation, classification, regression,
clustering, association rules mining, and visualization. WEKA is a comprehensive
workbench, and it's free, open source software. It runs on any computer. It's written
in Java, runs on Linux, Windows, Mac.
WEKA is availables from http://www.cs.waikato.ac.nz/ml/weka. You can

download either a platform-specific installer or an executable Java jar file that you
run in the usual way if Java is installed.
Main window
There are five interfaces in Weka. The Explorer which will be explained, there is
the Experimenter: for large scale performance comparisons for different machine
learning methods on different datasets, KnowledgeFlow interface: which is a
graphical interface to the Weka tools, and there's a command-line interface.
Exploring the Explorer
In the Explorer, there are five panels: the Preprocess panel, the Classify panel
where you build classifiers for datasets, Clustering is another procedure Weka is
good at, Association rules, Attribute selection, and Visualization.
1
In this lecture we'll use mainly the Preprocess panel to open files and so on, the
Classify panel to experiment with classifiers, and the Visualize panel to visualize
our datasets.
Let’s open is the weather data.
You can find data samples coming with WEKA in the downloading folder of the
software. Open weather.nominal.arff. Following figure shows how to do that. All
Weka data files are called ARFF (Attribute Relation File Format).
2
It's got 14 instances, 14 days, and for each of these days, we have recorded the
values of five attributes. Four are to do with the weather: Outlook, Temperature,
Humidity, and Windy. The fifth, Play, is whether or not we're going to play.
Actually, what we're going to do is predicting the Play attribute from the other
attributes.
This is the weather data. If you select one of the attributes - outlook is selected in the
figure - we can see the values. The values for the outlook attribute are sunny, overcast,
and rainy. The number of times they appear in the dataset: 5 sunny days, 4
3
overcast days, and 3 rainy days, for a total of 14 days, 14 instances. If we look at
the temperature attribute, hot, mild, and cool are the possible values. If we go to the
play attribute. There are two values for play, yes and no.
Blue corresponds to yes, and red corresponds to no.
If you look at one of the other attributes, like outlook, you can see that when the
outlook is sunny - this is like a histogram - there are three "no" instances and two
"yes" instances. When the outlook is overcast, there are four "yes" instances and
zero "no" instances. These are like a histogram of the attribute values in terms of
the attribute we're trying to predict.
4
If you go to the Edit panel, you’ll see the data in the form with the 14 days down
and the 5 attributes across, this is another view of the data, you can actually change
this dataset.
5
Exploring datasets
Weather data, it has 14 days, or instances, and each instance, is described by five
attributes, four to do with the weather, and the last attribute, which we call the
"class" value - the thing that we're trying to predict, whether or not to play a game.
This is called a classification problem. We're trying to predict the class value.
As you see, you can see the size of the dataset, the number of instances (14), you
can see the attributes, you can click any of these attributes and get the values for
those attributes. You also get at the bottom a histogram of the attribute values with
respect to the different class values. The different class values are blue for "yes",
play, and red for "no".
By default, the last attribute in Weka is always the class value. You can change this
if you like. If you change it here you can decide to predict a different one other
than the last attribute.
6
As we said, it's a classification problem, sometimes called a supervised learning

problem. Supervised because you get to know the class values of the training
instances. We take as input a data set as classified examples, these examples are
independent examples with a class value attached.
The idea is to produce automatically some kind of model that can classify new
examples. That's a "classification" problem.
7
These attributes, or features, can be discrete or continuous. What we looked at in

the weather data were discrete, we call them nominal attribute values when they
belong to a certain fixed set, or they can be numeric or continuous values. Also, the
class can be discrete or continuous. We're looking at a discrete class, "yes" or "no",
in the case of the weather data. Another kind of machine learning problem would
involve continuous classes, where we're trying to predict a number. That's called a
"regression" problem.
There is a similar dataset to the last weather dataset: the numeric weather dataset.
Open it in Weka, weather.numeric.arff.
It's very similar, almost identical in fact, with 14 instances, 5 attributes, the same
attributes. If you look at this dataset in the edit panel. You can see that two of the
attributes - temperature and humidity - are numeric attributes, whereas previously
they were nominal attributes. So here there are numbers. What we see when we
look at the attributes values for outlook, just as before, we have sunny, overcast
and rainy. For temperature, we can't enumerate the values, there are too many
numbers to enumerate. We have the minimum and maximum value, mean, and
standard deviation. That's what Weka gives for numeric values.
Now open the glass dataset, which is a rather more extensive dataset. It's a real
world dataset. We've got 214 instances and 10 attributes. Look at the class, by
default the last attribute shown, there are seven values for the class, and the labels
8
of these values give some indication of what this dataset is about. We have
headlamps, tableware, and containers, then we have building and vehicle windows,
both float and non-float. These are seven different kinds of glass.
ARFF file format, e.g. glass file. It starts with comments about the glass database.
Those lines beginning with percentage signs (%) are comments. You can see the
attributes, they are refractive index, sodium, magnesium, and so on. And the type
of glass. The relation has a name, the attributes are defined, they are real valued
attributes, numeric attributes.
Then we have an '@data' line, and following that in the ARFF format are simply
the instances, one after another, with the attribute values all on one line, ending
with the class by default.
Build a classifier
Now, we're going to actually build a classifier. We're going to use a system called
J48 to analyze the glass dataset that we looked.
To build a classifier go to the Classify panel, choose a classifier. There are different
kinds of classifiers. Weka has Bayes classifiers, functions classifiers, lazy
classifiers, meta classifiers, and so on. We're going to use a tree classifier: J48 is a
tree classifier. Go to "trees" and click J48.
9
So, if you run it. just press "Start", you've got the classifier.
Let's take a look, there is some information about the dataset, the glass dataset: the
number of instances and attributes. Then it's printed out a representation of a tree.
Note that this tree has 30 leaves and 59 nodes altogether. The overall accuracy is
66.8%. So it's done pretty well.
Down at the bottom, we've got a confusion matrix
10
Remember there were about seven different kinds of glass. The building window
made of float glass, you can see that 50 of these have been classified as 'a', which is
correctly classified. 15 of them have been classified as 'b', which is building
window, non-float glass, so those are errors, and 3 have been classified as 'c', and
so on. Note most of the weight is under the main diagonal, which we like to see
because that indicates correct classifications. Everything except the main diagonal
indicates a misclassification.
Let's investigate this a bit further. We're going to open a configuration panel. Open
the configuration panel by clicking on the edit text next to Choose button:
Change the "unpruned" parameter to make it "true", and build an unpruned tree.
Run it again. now you have a different classifier. We have 67% correct
classification. While we’ve got before 66.82% accuracy for pruned tree.
The numbers in brackets are the number of instances that get to the leaf. When
there are two numbers, this means that one incorrectly classified instance got to this
leaf and five correctly classified instances got there.
Again, click the configuration panel, and go to change the "minNumObj"

parameter. What is that? It's the minimum number of instances per leaf. Change
that from 2 up to 15, to have larger leaves. Click start.
11
Now we've got a worse result, 61% correct classification, but a much smaller tree,
with only eight leaves. Now, you can visualize this tree. If you right click on the
result you will get a little menu, select visualize tree.
This is the decision tree. This says first look at the Barium (Ba) content. If it's
large, then it must be headlamps. If it's small, then Magnesium (Mg). If that's
small, then let's look at potassium (K), and if that's small, then we've got tableware.
12
This is a visualization of the tree, and the following is a different representation of

the same tree.
From the configuration panel, use "More" button to get more information about the
classifier, here about J48. It's always useful to look at that to see where these
classifiers have come from.
J48, it's based on a famous system called C4.5, "Programs for Machine Learning",
by an Australian computer scientist called Ross Quinlan. He started out with a
system called ID3, then it went up to C4.8, and then he went commercial. Up until
then, these were all open source systems. WEKA developers took the latest version
of C4.5, which was C4.8, and rewrote it. Weka's written in Java, so it’s called J48.
Using a filter
Filters are one of the preprocessing tools, so they are usually being applied before
applying a classifier.
13
We will use a filter to remove an attribute from the weather data. Open the weather
data. We'll remove the humidity attribute: that's attribute number 3. You can look
at filters, just like you chose classifier using the Choose button on the Classify
panel, choose filters by using the Choose button on the Filter panel.
There are a lot of different filters. Allfilter and MultiFilter are ways of combining
filters. We have supervised and unsupervised filters. Supervised filters are ones
that use a class value for their operation. They aren't so common as unsupervised
filters, which don't use the class value. There are attribute filters and instance
filters. We want to remove an attribute. So we're looking for an attribute filter.
There are so many filters in Weka that you just have to learn to look around and
find what you want.
We’re going to look for removing an attribute. The filter is “Remove”. By clicking
on the Filter panel we can configure the filter. This is "A filter that removes a range
of attributes from the dataset". You can specify a range of attributes. We just want
to remove one, this was attribute number 3 we were going to remove. You can
invert the selection and remove all the other attributes and leave 3. Click OK, and
watch humidity go when we apply the filter. Luckily you can undo the effect of
that and put it back by pressing the Undo button.
Actually, there is a much easier way to remove an attribute: you don't need to use a
filter at all. If you just want to remove an attribute, you can select it and click the
"Remove" button at the bottom. It does the same job.
14
Filters are really useful, and can do much more complex things than that. Let's, for
example, imagine removing, not an attribute, but let's remove all instances where
humidity has the value "high". That is, attribute number 3 has this first value. Let's
look for a filter to do that. We want to remove instances, so it's going to be an
instance filter.
How about RemoveWithValues? Select the RemoveWithValues filter. you can

configure it. Set the attribute index; we want the third attribute (humidity), and the
first value. We can remove a number of different values; we'll just remove the first
value. Nothing happens until we apply the filter. Watch what happens when we
apply it.
We still have the humidity attribute there, but we have zero elements with high
humidity. In fact, the dataset has been reduced to only 7 instances. You can save
the results.
We removed the instances where humidity is high. We have to think about, when
we're looking for filters, whether we want a supervised or an unsupervised filter,
whether we want an attribute filter or instance filter, and then just use your
common sense to look down the list of filters to see which one you want.
Sometimes when you filter data you get much better classification.
Here's a really simple example. We’re going to open the glass dataset that we saw
before. We’re going to use J48, which we did before. It's a tree classifier. Start, you
will get a classifier with accuracy of 66.8%. Let's remove Fe, that is, Iron. Remove
15
this attribute, and you will get a smaller dataset. Go and run J48 again. Now we get
an accuracy of 67.3%. So we've improved the accuracy a little bit by removing that
attribute.
Visualizing data
It is necessary to get close to your data, look at it in every possible way. We're
going to look at visualizing data.
We're going to use the Visualize panel. Open the Iris dataset. It has numeric
attributes, four numeric attributes: sepallength, sepalwidth, petallength, petalwidth.
The class are three kinds of iris flower: Iris-setosa, Iris-versicolor, and Iris-
virginica.
Let's go to the Visualize panel and visualize this data.
16
There is a matrix of two dimensional plots, a five-by-five matrix of plots. You can
select one of these plots, for example if you select a plot of sepalwidth on the x-
axis and petalwidth on the y-axis, that's a plot of the data. The colors correspond to
the three classes. You can actually change the colors -- if you don't like those.
You can look at individual data points by clicking on them. This is talking about
instance number 86 with a sepallength of 6, sepalwidth of 3.4, and so on. That's a
versicolor, which is why this spot is colored red. We can look individual instances.
17
We can change the x- and y-axis by changing the menus up. Better still, if we click
on the little set of bars right, these represent the attributes. If you left click on this
the x-axis will change, and right click will change the y-axis. So you can quickly
browse around these different plots.
Sometimes, points sit right on top of each other, you can use Jitter to add a little bit
of randomness to the x- and the y-axes. With a little bit of jitter, the darker spots
represent multiple instances. If you click on one of those, you can see that the point
represents three separate instances, all of class Iris-setosa, and they all have the
same value of petallength and sepalwidth. The sepalwidth and petallength are 3.0
and 1.4 for each of the three instances.
18
Another thing we can do is select some of this dataset. Choose "select rectangle"
here. If you draw a rectangle now, you can select these points. If you were to
submit this rectangle, then all other points would be excluded and just these points
would appear on the graph, with the access re-scaled appropriately. This might be a
way of cleaning up outliers in your data, by selecting rectangles and saving the new
dataset. Then click on Reset button to show entire data.
That's visualizing the dataset itself. What about visualizing the result of a
classifier?
Let's go back to the Preprocess panel. We’re going to use a classifier. Use J48, run
it, then if you right click on the result in the log area, you can view classifier errors.
19
Here we've got the class plotted against the predicted class. The square boxes
represent errors. If you click on one of these boxes, you can see where the errors
are. There are two instances where the predicted class is virginica and the actual
class is versicolor.
We can see these in the confusion matrix.
There's a filter that allows you to add the classifications as a new attribute. Let's
just go and have a look at that. We're going to add an attribute. It's supervised
because it uses the "class". Add an attribute, AddClassification. Here choose in the
configuration panel the machine learning scheme, choose J48 and go to
outputClassification -- make that "true", apply it. It will add a new attribute, and
this attribute is the classification according to J48.
20
References:
1. The WEKA Workbench, Online Appendix for “Data Mining: Practical Machine Learning
th
Tools and Techniques”, Eibe Frank, Mark A. Hall, and Ian H. Witten, 4 edition
2. https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
21

Data Mining - Session #1 - Unlocked

Uploaded by

Copyright:

Available Formats

You might also like

Data Mining - Session #1 - Unlocked

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Session #1 - Unlocked

Uploaded by

Copyright:

Available Formats

Idlib University

Fifth Academic Year

Data Mining - Practical Session 1

Weka is a data mining software. It is collection of machine learning algorithms for

WEKA is availables from http://www.cs.waikato.ac.nz/ml/weka. You can

Exploring the Explorer

Let’s open is the weather data.

Blue corresponds to yes, and red corresponds to no.

As we said, it's a classification problem, sometimes called a supervised learning

These attributes, or features, can be discrete or continuous. What we looked at in

Down at the bottom, we've got a confusion matrix

Again, click the configuration panel, and go to change the "minNumObj"

This is a visualization of the tree, and the following is a different representation of

How about RemoveWithValues? Select the RemoveWithValues filter. you can

Let's go to the Visualize panel and visualize this data.

We can see these in the confusion matrix.

You might also like