You are on page 1of 104

An Introduction to WEKA Explorer

In part from:Yizhou Sun


2008
What is WEKA?
—  Waikato Environment for Knowledge Analysis
—  It’s a data mining/machine learning tool developed by
Department of Computer Science, University of Waikato, New
Zealand.
—  Weka is also a bird found only on the islands of New Zealand.

2 22/01/16
Download and Install WEKA
—  Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
—  Support multiple platforms (written in java):
—  Windows, Mac OS X and Linux

3 22/01/16
Main Features
—  49 data preprocessing tools
—  76 classification/regression algorithms
—  8 clustering algorithms
—  3 algorithms for finding association rules
—  15 attribute/subset evaluators + 10 search algorithms
for feature selection

4 22/01/16
Main GUI
—  Three graphical user interfaces
—  “The Explorer” (exploratory data analysis)
—  “The Experimenter” (experimental
environment)
—  “The KnowledgeFlow” (new process model
inspired interface)
—  Simple CLI- provides users without a graphic
interface option the ability to execute commands
from a terminal window

5 22/01/16
Explorer
—  The Explorer:
—  Preprocess data
—  Classification
—  Clustering
—  Association Rules
—  Attribute Selection
—  Data Visualization
—  References and Resources

6 22/01/16
Explorer: pre-processing the data
—  Data can be imported from a file in various formats: ARFF,
CSV, C4.5, binary
—  Data can also be read from a URL or from an SQL database
(using JDBC)
—  Pre-processing tools in WEKA are called “filters”
—  WEKA contains filters for:
—  Discretization, normalization, resampling, attribute selection,
transforming and combining attributes, …

7 22/01/16
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

8 22/01/16
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

9 22/01/16
10 University of Waikato 22/01/16
11 University of Waikato 22/01/16
IRIS dataset
—  5 attributes, one is the classification
—  3 classes: setosa, versicolor, virginica
15 University of Waikato 22/01/16
Attribute data
—  Min, max and average value of attributes
—  distribution of values :number of items for which:

ai = v j | ai ∈ A,v j ∈ V

—  class: distribution of attribute values in the classes


18 University of Waikato 22/01/16
19 University of Waikato 22/01/16
20 University of Waikato 22/01/16
21 University of Waikato 22/01/16
Filtering attributes
—  Once the initial data has been selected and loaded the user
can select options for refining the experimental data.
—  The options in the preprocess window include selection of
optional filters to apply and the user can select or remove
different attributes of the data set as necessary to identify
specific information.
—  The user can modify the attribute selection and change the
relationship among the different attributes by deselecting
different choices from the original data set.
—  There are many different filtering options available within the
preprocessing window and the user can select the different
options based on need and type of data present.
23 University of Waikato 22/01/16
24 University of Waikato 22/01/16
25 University of Waikato 22/01/16
26 University of Waikato 22/01/16
27 University of Waikato 22/01/16
28 University of Waikato 22/01/16
29 University of Waikato 22/01/16
30 University of Waikato 22/01/16
Discretizes in 10 bins of equal frequency

31 University of Waikato 22/01/16


Discretizes in 10 bins of equal frequency

32 University of Waikato 22/01/16


Discretizes in 10 bins of equal frequency

33 University of Waikato 22/01/16


34 University of Waikato 22/01/16
35 University of Waikato 22/01/16
36 University of Waikato 22/01/16
Explorer: building “classifiers”
—  “Classifiers” in WEKA are machine learning algorithmsfor
predicting nominal or numeric quantities
—  Implemented learning algorithms include:
—  Conjunctive rules, decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …

37 22/01/16
Explore Conjunctive Rules learner

Need a simple dataset with few attributes , let’s select the weather dataset
Select a Classifier
Select training method
Right-click to select parameters

numAntds= number of antecedents, -1= empty rule


Select numAntds=10
Results are shown in the right
window (can be scrolled)
Can change the right hand side
variable
Performance data
Decision Trees with WEKA
47 University of Waikato 22/01/16
48 University of Waikato 22/01/16
49 University of Waikato 22/01/16
50 University of Waikato 22/01/16
51 University of Waikato 22/01/16
52 University of Waikato 22/01/16
53 University of Waikato 22/01/16
54 University of Waikato 22/01/16
55 University of Waikato 22/01/16
56 University of Waikato 22/01/16
57 University of Waikato 22/01/16
58 University of Waikato 22/01/16
59 University of Waikato 22/01/16
60 University of Waikato 22/01/16
61 University of Waikato 22/01/16
62 University of Waikato 22/01/16
63 University of Waikato 22/01/16
64 University of Waikato 22/01/16
65 University of Waikato 22/01/16
66 University of Waikato 22/01/16
67 University of Waikato 22/01/16
68 University of Waikato 22/01/16
Explorer: clustering data
—  WEKA contains “clusterers” for finding groups of similar
instances in a dataset
—  Implemented schemes are:
—  k-Means, EM, Cobweb, X-means, FarthestFirst
—  Clusters can be visualized and compared to “true” clusters
(if given)
—  Evaluation based on loglikelihood if clustering scheme
produces a probability distribution

69 22/01/16
The K-Means Clustering Method
—  Given k, the k-means algorithm is implemented in four steps:
—  Partition objects into k nonempty subsets
—  Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of
the cluster)
—  Assign each object to the cluster with the nearest seed point
—  Go back to Step 2, stop when no more new assignment

70 gennaio 22, 2016


right click: visualize cluster assignement
Explorer: finding associations
—  WEKA contains an implementation of the Apriori algorithm
for learning association rules
—  Works only with discrete data
—  Can identify statistical dependencies between groups of
attributes:
—  milk, butter ⇒ bread, eggs (with confidence 0.9 and support
2000)
—  Apriori can compute all rules that have a given minimum
support and exceed a given confidence

73 22/01/16
Basic Concepts: Frequent Patterns
Tid Items bought —  itemset: A set of one or more items
10 Beer, Nuts, Diaper —  k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
—  (absolute) support, or, support count of X:
30 Beer, Diaper, Eggs
Frequency or occurrence of an itemset
40 Nuts, Eggs, Milk X
50 Nuts, Coffee, Diaper, Eggs, Milk
—  (relative) support, s, is the fraction of
Customer Customer
transactions that contains X (i.e., the
buys both buys diaper probability that a transaction contains X)
—  An itemset X is frequent if X’s support is
no less than a minsup threshold

Customer
buys beer
74 gennaio 22, 2016
Basic Concepts: Association Rules
Tid Items bought —  Find all the rules X à Y with minimum
10 Beer, Nuts, Diaper
support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs —  support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X ∪ Y
50 Nuts, Coffee, Diaper, Eggs, Milk
—  confidence, c, conditional probability
Customer
buys both
Customer that a transaction having X also
buys contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer n  Association rules: (many more!)
n  Beer à Diaper (60%, 100%)
n  Diaper à Beer (60%, 75%)
75 gennaio 22, 2016
77 University of Waikato 22/01/16
78 University of Waikato 22/01/16
79 University of Waikato 22/01/16
80 University of Waikato 22/01/16
81 University of Waikato 22/01/16
1.  adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==>
Class=democrat 219 conf:(1)
2.  adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-
nicaraguan-contras=y 198 ==> Class=democrat 198 conf:(1)
3.  physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==>
Class=democrat 210 conf:(1)

ecc.
Explorer: attribute selection
—  Panel that can be used to investigate which (subsets of)
attributes are the most predictive ones
—  Attribute selection methods contain two parts:
—  A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
—  An evaluation method: correlation-based, wrapper, information
gain, chi-squared, …
—  Very flexible: WEKA allows (almost) arbitrary combinations
of these two

83 22/01/16
84 University of Waikato 22/01/16
85 University of Waikato 22/01/16
86 University of Waikato 22/01/16
87 University of Waikato 22/01/16
88 University of Waikato 22/01/16
89 University of Waikato 22/01/16
90 University of Waikato 22/01/16
91 University of Waikato 22/01/16
Explorer: data visualization
—  Visualization very useful in practice: e.g. helps to determine
difficulty of the learning problem
—  WEKA can visualize single attributes (1-d) and pairs of
attributes (2-d)
—  To do: rotating 3-d visualizations (Xgobi-style)
—  Color-coded class values
—  “Jitter” option to deal with nominal attributes (and to detect
“hidden” data points)
—  “Zoom-in” function

92 22/01/16
94 University of Waikato 22/01/16
95 University of Waikato 22/01/16
96 University of Waikato 22/01/16
97 University of Waikato 22/01/16
click on a cell

98 University of Waikato 22/01/16


99 University of Waikato 22/01/16
100 University of Waikato 22/01/16
101 University of Waikato 22/01/16
102 University of Waikato 22/01/16
103 University of Waikato 22/01/16
References and Resources
—  References:
—  WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
—  WEKA Tutorial:
—  Machine Learning with WEKA: A presentation demonstrating all graphical user
interfaces (GUI) in Weka.
—  A presentation which explains how to use Weka for exploratory data mining.
—  WEKA Data Mining Book:
—  Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools
and Techniques (Second Edition)
—  WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/
Main_Page
—  Others:
—  Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
2nd ed.

You might also like