Professional Documents
Culture Documents
Data Mining - Session #1 - Unlocked
Data Mining - Session #1 - Unlocked
Data Mining - Session #1 - Unlocked
Faculty of Informatics
Prepared By Eng. Osama AL Mustafa & Assoc. prof. Dr. Mohamed Kurdi
Data Mining with WEKA Workbench
Introduction
WEKA was developed at the University of Waikato in New Zealand, the name
stands for Waikato Environment for Knowledge Analysis. we just call it Weka.
Main window
There are five interfaces in Weka. The Explorer which will be explained, there is
the Experimenter: for large scale performance comparisons for different machine
learning methods on different datasets, KnowledgeFlow interface: which is a
graphical interface to the Weka tools, and there's a command-line interface.
In the Explorer, there are five panels: the Preprocess panel, the Classify panel
where you build classifiers for datasets, Clustering is another procedure Weka is
good at, Association rules, Attribute selection, and Visualization.
1
Data Mining with WEKA Workbench
In this lecture we'll use mainly the Preprocess panel to open files and so on, the
Classify panel to experiment with classifiers, and the Visualize panel to visualize
our datasets.
You can find data samples coming with WEKA in the downloading folder of the
software. Open weather.nominal.arff. Following figure shows how to do that. All
Weka data files are called ARFF (Attribute Relation File Format).
2
Data Mining with WEKA Workbench
It's got 14 instances, 14 days, and for each of these days, we have recorded the
values of five attributes. Four are to do with the weather: Outlook, Temperature,
Humidity, and Windy. The fifth, Play, is whether or not we're going to play.
Actually, what we're going to do is predicting the Play attribute from the other
attributes.
This is the weather data. If you select one of the attributes - outlook is selected in the
figure - we can see the values. The values for the outlook attribute are sunny, overcast,
and rainy. The number of times they appear in the dataset: 5 sunny days, 4
3
Data Mining with WEKA Workbench
overcast days, and 3 rainy days, for a total of 14 days, 14 instances. If we look at
the temperature attribute, hot, mild, and cool are the possible values. If we go to the
play attribute. There are two values for play, yes and no.
If you look at one of the other attributes, like outlook, you can see that when the
outlook is sunny - this is like a histogram - there are three "no" instances and two
"yes" instances. When the outlook is overcast, there are four "yes" instances and
zero "no" instances. These are like a histogram of the attribute values in terms of
the attribute we're trying to predict.
4
Data Mining with WEKA Workbench
If you go to the Edit panel, you’ll see the data in the form with the 14 days down
and the 5 attributes across, this is another view of the data, you can actually change
this dataset.
5
Data Mining with WEKA Workbench
Exploring datasets
Weather data, it has 14 days, or instances, and each instance, is described by five
attributes, four to do with the weather, and the last attribute, which we call the
"class" value - the thing that we're trying to predict, whether or not to play a game.
This is called a classification problem. We're trying to predict the class value.
As you see, you can see the size of the dataset, the number of instances (14), you
can see the attributes, you can click any of these attributes and get the values for
those attributes. You also get at the bottom a histogram of the attribute values with
respect to the different class values. The different class values are blue for "yes",
play, and red for "no".
By default, the last attribute in Weka is always the class value. You can change this
if you like. If you change it here you can decide to predict a different one other
than the last attribute.
6
Data Mining with WEKA Workbench
The idea is to produce automatically some kind of model that can classify new
examples. That's a "classification" problem.
7
Data Mining with WEKA Workbench
There is a similar dataset to the last weather dataset: the numeric weather dataset.
Open it in Weka, weather.numeric.arff.
It's very similar, almost identical in fact, with 14 instances, 5 attributes, the same
attributes. If you look at this dataset in the edit panel. You can see that two of the
attributes - temperature and humidity - are numeric attributes, whereas previously
they were nominal attributes. So here there are numbers. What we see when we
look at the attributes values for outlook, just as before, we have sunny, overcast
and rainy. For temperature, we can't enumerate the values, there are too many
numbers to enumerate. We have the minimum and maximum value, mean, and
standard deviation. That's what Weka gives for numeric values.
Now open the glass dataset, which is a rather more extensive dataset. It's a real
world dataset. We've got 214 instances and 10 attributes. Look at the class, by
default the last attribute shown, there are seven values for the class, and the labels
8
Data Mining with WEKA Workbench
of these values give some indication of what this dataset is about. We have
headlamps, tableware, and containers, then we have building and vehicle windows,
both float and non-float. These are seven different kinds of glass.
ARFF file format, e.g. glass file. It starts with comments about the glass database.
Those lines beginning with percentage signs (%) are comments. You can see the
attributes, they are refractive index, sodium, magnesium, and so on. And the type
of glass. The relation has a name, the attributes are defined, they are real valued
attributes, numeric attributes.
Then we have an '@data' line, and following that in the ARFF format are simply
the instances, one after another, with the attribute values all on one line, ending
with the class by default.
Build a classifier
Now, we're going to actually build a classifier. We're going to use a system called
J48 to analyze the glass dataset that we looked.
To build a classifier go to the Classify panel, choose a classifier. There are different
kinds of classifiers. Weka has Bayes classifiers, functions classifiers, lazy
classifiers, meta classifiers, and so on. We're going to use a tree classifier: J48 is a
tree classifier. Go to "trees" and click J48.
9
Data Mining with WEKA Workbench
So, if you run it. just press "Start", you've got the classifier.
Let's take a look, there is some information about the dataset, the glass dataset: the
number of instances and attributes. Then it's printed out a representation of a tree.
Note that this tree has 30 leaves and 59 nodes altogether. The overall accuracy is
66.8%. So it's done pretty well.
10
Data Mining with WEKA Workbench
Remember there were about seven different kinds of glass. The building window
made of float glass, you can see that 50 of these have been classified as 'a', which is
correctly classified. 15 of them have been classified as 'b', which is building
window, non-float glass, so those are errors, and 3 have been classified as 'c', and
so on. Note most of the weight is under the main diagonal, which we like to see
because that indicates correct classifications. Everything except the main diagonal
indicates a misclassification.
Let's investigate this a bit further. We're going to open a configuration panel. Open
the configuration panel by clicking on the edit text next to Choose button:
Change the "unpruned" parameter to make it "true", and build an unpruned tree.
Run it again. now you have a different classifier. We have 67% correct
classification. While we’ve got before 66.82% accuracy for pruned tree.
The numbers in brackets are the number of instances that get to the leaf. When
there are two numbers, this means that one incorrectly classified instance got to this
leaf and five correctly classified instances got there.
11
Data Mining with WEKA Workbench
Now we've got a worse result, 61% correct classification, but a much smaller tree,
with only eight leaves. Now, you can visualize this tree. If you right click on the
result you will get a little menu, select visualize tree.
This is the decision tree. This says first look at the Barium (Ba) content. If it's
large, then it must be headlamps. If it's small, then Magnesium (Mg). If that's
small, then let's look at potassium (K), and if that's small, then we've got tableware.
12
Data Mining with WEKA Workbench
From the configuration panel, use "More" button to get more information about the
classifier, here about J48. It's always useful to look at that to see where these
classifiers have come from.
J48, it's based on a famous system called C4.5, "Programs for Machine Learning",
by an Australian computer scientist called Ross Quinlan. He started out with a
system called ID3, then it went up to C4.8, and then he went commercial. Up until
then, these were all open source systems. WEKA developers took the latest version
of C4.5, which was C4.8, and rewrote it. Weka's written in Java, so it’s called J48.
Using a filter
Filters are one of the preprocessing tools, so they are usually being applied before
applying a classifier.
13
Data Mining with WEKA Workbench
We will use a filter to remove an attribute from the weather data. Open the weather
data. We'll remove the humidity attribute: that's attribute number 3. You can look
at filters, just like you chose classifier using the Choose button on the Classify
panel, choose filters by using the Choose button on the Filter panel.
There are a lot of different filters. Allfilter and MultiFilter are ways of combining
filters. We have supervised and unsupervised filters. Supervised filters are ones
that use a class value for their operation. They aren't so common as unsupervised
filters, which don't use the class value. There are attribute filters and instance
filters. We want to remove an attribute. So we're looking for an attribute filter.
There are so many filters in Weka that you just have to learn to look around and
find what you want.
We’re going to look for removing an attribute. The filter is “Remove”. By clicking
on the Filter panel we can configure the filter. This is "A filter that removes a range
of attributes from the dataset". You can specify a range of attributes. We just want
to remove one, this was attribute number 3 we were going to remove. You can
invert the selection and remove all the other attributes and leave 3. Click OK, and
watch humidity go when we apply the filter. Luckily you can undo the effect of
that and put it back by pressing the Undo button.
Actually, there is a much easier way to remove an attribute: you don't need to use a
filter at all. If you just want to remove an attribute, you can select it and click the
"Remove" button at the bottom. It does the same job.
14
Data Mining with WEKA Workbench
Filters are really useful, and can do much more complex things than that. Let's, for
example, imagine removing, not an attribute, but let's remove all instances where
humidity has the value "high". That is, attribute number 3 has this first value. Let's
look for a filter to do that. We want to remove instances, so it's going to be an
instance filter.
We still have the humidity attribute there, but we have zero elements with high
humidity. In fact, the dataset has been reduced to only 7 instances. You can save
the results.
We removed the instances where humidity is high. We have to think about, when
we're looking for filters, whether we want a supervised or an unsupervised filter,
whether we want an attribute filter or instance filter, and then just use your
common sense to look down the list of filters to see which one you want.
Sometimes when you filter data you get much better classification.
Here's a really simple example. We’re going to open the glass dataset that we saw
before. We’re going to use J48, which we did before. It's a tree classifier. Start, you
will get a classifier with accuracy of 66.8%. Let's remove Fe, that is, Iron. Remove
15
Data Mining with WEKA Workbench
this attribute, and you will get a smaller dataset. Go and run J48 again. Now we get
an accuracy of 67.3%. So we've improved the accuracy a little bit by removing that
attribute.
Visualizing data
It is necessary to get close to your data, look at it in every possible way. We're
going to look at visualizing data.
We're going to use the Visualize panel. Open the Iris dataset. It has numeric
attributes, four numeric attributes: sepallength, sepalwidth, petallength, petalwidth.
The class are three kinds of iris flower: Iris-setosa, Iris-versicolor, and Iris-
virginica.
16
Data Mining with WEKA Workbench
There is a matrix of two dimensional plots, a five-by-five matrix of plots. You can
select one of these plots, for example if you select a plot of sepalwidth on the x-
axis and petalwidth on the y-axis, that's a plot of the data. The colors correspond to
the three classes. You can actually change the colors -- if you don't like those.
You can look at individual data points by clicking on them. This is talking about
instance number 86 with a sepallength of 6, sepalwidth of 3.4, and so on. That's a
versicolor, which is why this spot is colored red. We can look individual instances.
17
Data Mining with WEKA Workbench
We can change the x- and y-axis by changing the menus up. Better still, if we click
on the little set of bars right, these represent the attributes. If you left click on this
the x-axis will change, and right click will change the y-axis. So you can quickly
browse around these different plots.
Sometimes, points sit right on top of each other, you can use Jitter to add a little bit
of randomness to the x- and the y-axes. With a little bit of jitter, the darker spots
represent multiple instances. If you click on one of those, you can see that the point
represents three separate instances, all of class Iris-setosa, and they all have the
same value of petallength and sepalwidth. The sepalwidth and petallength are 3.0
and 1.4 for each of the three instances.
18
Data Mining with WEKA Workbench
Another thing we can do is select some of this dataset. Choose "select rectangle"
here. If you draw a rectangle now, you can select these points. If you were to
submit this rectangle, then all other points would be excluded and just these points
would appear on the graph, with the access re-scaled appropriately. This might be a
way of cleaning up outliers in your data, by selecting rectangles and saving the new
dataset. Then click on Reset button to show entire data.
That's visualizing the dataset itself. What about visualizing the result of a
classifier?
Let's go back to the Preprocess panel. We’re going to use a classifier. Use J48, run
it, then if you right click on the result in the log area, you can view classifier errors.
19
Data Mining with WEKA Workbench
Here we've got the class plotted against the predicted class. The square boxes
represent errors. If you click on one of these boxes, you can see where the errors
are. There are two instances where the predicted class is virginica and the actual
class is versicolor.
There's a filter that allows you to add the classifications as a new attribute. Let's
just go and have a look at that. We're going to add an attribute. It's supervised
because it uses the "class". Add an attribute, AddClassification. Here choose in the
configuration panel the machine learning scheme, choose J48 and go to
outputClassification -- make that "true", apply it. It will add a new attribute, and
this attribute is the classification according to J48.
20
Data Mining with WEKA Workbench
References:
1. The WEKA Workbench, Online Appendix for “Data Mining: Practical Machine Learning
th
Tools and Techniques”, Eibe Frank, Mark A. Hall, and Ian H. Witten, 4 edition
2. https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
21