You are on page 1of 88

An Introduction to

WEKA
WSU-MSc-IT-Class
2020

1 Compiled by Aklilu E. (MSc in IT) 01/23/21


Content
What is WEKA?
The Explorer:
Preprocess data
Classification
Clustering
Association Rules
Attribute Selection
Data Visualization
References and Resources

2 Compiled by Aklilu E. (MSc in IT) 01/23/21


What is WEKA?
Waikato Environment for Knowledge Analysis
It’s a data mining/machine learning tool developed by
Department of Computer Science, University of
Waikato, New Zealand.
Weka is also a bird found only on the islands of New
Zealand.

3 Compiled by Aklilu E. (MSc in IT) 01/23/21


What is WEKA?
Weka is a collection of machine learning algorithms
for data mining tasks. It contains tools for data
preparation, classification, regression, clustering,
association rules mining, and visualization.
Found only on the islands of New Zealand, the Weka
is a flightless bird with an inquisitive nature. The name
is pronounced like this, and the bird sounds like this.
Weka is open source software issued under the GNU
General Public License.

4 Compiled by Aklilu E. (MSc in IT) 01/23/21


Download and Install WEKA
Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Support multiple platforms (written in java):
Windows, Mac OS X and Linux

5 Compiled by Aklilu E. (MSc in IT) 01/23/21


Main Features
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
3 algorithms for finding association rules
15 attribute/subset evaluators + 10 search
algorithms for feature selection

6 Compiled by Aklilu E. (MSc in IT) 01/23/21


Main GUI
 Three graphical user interfaces
“The Explorer” (exploratory data analysis)
“The Experimenter” (experimental
environment)
“The KnowledgeFlow” (new process
model inspired interface)

7 Compiled by Aklilu E. (MSc in IT) 01/23/21


Content
What is WEKA?
The Explorer:
Preprocess data
Classification
Clustering
Association Rules
Attribute Selection
Data Visualization
References and Resources

8 Compiled by Aklilu E. (MSc in IT) 01/23/21


Explorer: pre-processing the data
Data can be imported from a file in various formats: ARFF
(Attribute-Relation File Format), CSV (Comma-separated values), C4.5, binary
 C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan.
 C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated
by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a
statistical classifier.
Data can also be read from a URL or from an SQL database
(using JDBC)
Pre-processing tools in WEKA are called “filters”
WEKA contains filters for:
Discretization, normalization, resampling, attribute selection,
transforming and combining attributes, …

9 Compiled by Aklilu E. (MSc in IT) 01/23/21


WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

10 Compiled by Aklilu E. (MSc in IT) 01/23/21


WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

11 Compiled by Aklilu E. (MSc in IT) 01/23/21


How to get Sample Data???
 You can simply get sample .csv data for analysis on WEKA
Environment from
 https://www.stats.govt.nz/large-datasets/csv-files-for-download/
 The data you get from this site are grouped by topics like
Business, Economy, Government finance, Health, Industries,
Labour market, Population and Society
Information about these files:
What are CSV files?
Find variable names in Infoshare

12 Compiled by Aklilu E. (MSc in IT) 01/23/21


How to get Sample Data???
You can get full material from
https://slideplayer.com/slide/12774221/
You can get sample .ARFF training data for WEKA
from
https://www.cs.ubc.ca/labs/beta/Projects/autoweka
/datasets/
You can get Sample .CSV data up to 1.5 million sales
record from http://eforexcel.com/wp/downloads-18-
sample-csv-files-data-sets-for-testing-sales/

13 Compiled by Aklilu E. (MSc in IT) 01/23/21


14 Compiled by Aklilu E. (MSc in IT) 01/23/21
15 Compiled by Aklilu E. (MSc in IT) 01/23/21
16 Compiled by Aklilu E. (MSc in IT) 01/23/21
17 Compiled by Aklilu E. (MSc in IT) 01/23/21
18 Compiled by Aklilu E. (MSc in IT) 01/23/21
19 Compiled by Aklilu E. (MSc in IT) 01/23/21
20 Compiled by Aklilu E. (MSc in IT) 01/23/21
21 Compiled by Aklilu E. (MSc in IT) 01/23/21
22 Compiled by Aklilu E. (MSc in IT) 01/23/21
23 Compiled by Aklilu E. (MSc in IT) 01/23/21
24 Compiled by Aklilu E. (MSc in IT) 01/23/21
25 Compiled by Aklilu E. (MSc in IT) 01/23/21
26 Compiled by Aklilu E. (MSc in IT) 01/23/21
27 Compiled by Aklilu E. (MSc in IT) 01/23/21
28 Compiled by Aklilu E. (MSc in IT) 01/23/21
29 Compiled by Aklilu E. (MSc in IT) 01/23/21
30 Compiled by Aklilu E. (MSc in IT) 01/23/21
31 Compiled by Aklilu E. (MSc in IT) 01/23/21
32 Compiled by Aklilu E. (MSc in IT) 01/23/21
33 Compiled by Aklilu E. (MSc in IT) 01/23/21
34 Compiled by Aklilu E. (MSc in IT) 01/23/21
Explorer: building “classifiers”
Classifiers in WEKA are models for predicting
nominal or numeric quantities
Implemented learning schemes include:
Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …

35 Compiled by Aklilu E. (MSc in IT) 01/23/21


Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example of >40 low yes fair yes
Quinlan’s >40 low yes excellent no
31…40 low yes excellent yes
ID3 <=30 medium no fair no
(Playing <=30 low yes fair yes
Tennis) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
36 Compiled by Aklilu E. (MSc in IT) 01/23/21
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

37 Compiled by Aklilu E. (MSc in IT) 01/23/21


39 Compiled by Aklilu E. (MSc in IT) 01/23/21
40 Compiled by Aklilu E. (MSc in IT) 01/23/21
41 Compiled by Aklilu E. (MSc in IT) 01/23/21
42 Compiled by Aklilu E. (MSc in IT) 01/23/21
43 Compiled by Aklilu E. (MSc in IT) 01/23/21
44 Compiled by Aklilu E. (MSc in IT) 01/23/21
45 Compiled by Aklilu E. (MSc in IT) 01/23/21
46 Compiled by Aklilu E. (MSc in IT) 01/23/21
47 Compiled by Aklilu E. (MSc in IT) 01/23/21
48 Compiled by Aklilu E. (MSc in IT) 01/23/21
49 Compiled by Aklilu E. (MSc in IT) 01/23/21
50 Compiled by Aklilu E. (MSc in IT) 01/23/21
51 Compiled by Aklilu E. (MSc in IT) 01/23/21
52 Compiled by Aklilu E. (MSc in IT) 01/23/21
53 Compiled by Aklilu E. (MSc in IT) 01/23/21
54 Compiled by Aklilu E. (MSc in IT) 01/23/21
55 Compiled by Aklilu E. (MSc in IT) 01/23/21
56 Compiled by Aklilu E. (MSc in IT) 01/23/21
57 Compiled by Aklilu E. (MSc in IT) 01/23/21
58 Compiled by Aklilu E. (MSc in IT) 01/23/21
59 Compiled by Aklilu E. (MSc in IT) 01/23/21
60 Compiled by Aklilu E. (MSc in IT) 01/23/21
Explorer: finding associations
WEKA contains an implementation of the Apriori
algorithm for learning association rules
Works only with discrete data
Can identify statistical dependencies between groups
of attributes:
milk, butter  bread, eggs (with confidence 0.9 and
support 2000)
Apriori can compute all rules that have a given
minimum support and exceed a given confidence

64 Compiled by Aklilu E. (MSc in IT) 01/23/21


Basic Concepts: Frequent Patterns
Tid Items bought  itemset: A set of one or more items
10 Beer, Nuts, Diaper  k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper  (absolute) support, or, support count
30 Beer, Diaper, Eggs
of X: Frequency or occurrence of an
40 Nuts, Eggs, Milk itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk  (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
65 Compiled by Aklilu E. (MSc in IT) 01/23/21
Basic Concepts: Association Rules
Tid Items bought  Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
 support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
Customer
Customer probability that a transaction
buys both
buys having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys beer  Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
66 Compiled by Aklilu E. (MSc in IT) 01/23/21
67 Compiled by Aklilu E. (MSc in IT) 01/23/21
68 Compiled by Aklilu E. (MSc in IT) 01/23/21
69 Compiled by Aklilu E. (MSc in IT) 01/23/21
70 Compiled by Aklilu E. (MSc in IT) 01/23/21
71 Compiled by Aklilu E. (MSc in IT) 01/23/21
Explorer: attribute selection
Panel that can be used to investigate which (subsets of)
attributes are the most predictive ones
Attribute selection methods contain two parts:
A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
Very flexible: WEKA allows (almost) arbitrary
combinations of these two

72 Compiled by Aklilu E. (MSc in IT) 01/23/21


73 Compiled by Aklilu E. (MSc in IT) 01/23/21
74 Compiled by Aklilu E. (MSc in IT) 01/23/21
75 Compiled by Aklilu E. (MSc in IT) 01/23/21
76 Compiled by Aklilu E. (MSc in IT) 01/23/21
77 Compiled by Aklilu E. (MSc in IT) 01/23/21
78 Compiled by Aklilu E. (MSc in IT) 01/23/21
79 Compiled by Aklilu E. (MSc in IT) 01/23/21
80 Compiled by Aklilu E. (MSc in IT) 01/23/21
Explorer: data visualization
Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and pairs
of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes (and to
detect “hidden” data points)
“Zoom-in” function

81 Compiled by Aklilu E. (MSc in IT) 01/23/21


82 Compiled by Aklilu E. (MSc in IT) 01/23/21
83 Compiled by Aklilu E. (MSc in IT) 01/23/21
84 Compiled by Aklilu E. (MSc in IT) 01/23/21
85 Compiled by Aklilu E. (MSc in IT) 01/23/21
86 Compiled by Aklilu E. (MSc in IT) 01/23/21
87 Compiled by Aklilu E. (MSc in IT) 01/23/21
88 Compiled by Aklilu E. (MSc in IT) 01/23/21
89 Compiled by Aklilu E. (MSc in IT) 01/23/21
90 Compiled by Aklilu E. (MSc in IT) 01/23/21
91 Compiled by Aklilu E. (MSc in IT) 01/23/21
References and Resources
 References:
WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
WEKA Tutorial:
 Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka.
 A presentation which explains how to use Weka for exploratory data
mining.
WEKA Data Mining Book:
 Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)
WEKA Wiki:
http://weka.sourceforge.net/wiki/index.php/Main_Page
Others:
 Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 2nd ed.

92 Compiled by Aklilu E. (MSc in IT) 01/23/21

You might also like