Weka Tutorial

An Introduction to the WEKA Data Mining System
Zdravko Markov
Central Connecticut State University
markovz@ccsu.edu
Ingrid Russell
University of Hartford
irussell@hartford.edu
Data Mining
"Drowning in Data yet Starving for Knowledge"

???
"Computers have promised us a fountain of wisdom but delivered a flood of data"

William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
Data Mining: "The non trivial extraction of implicit, previously unknown, and potentially
useful information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
Data mining finds valuable information hidden in large volumes of data.
Data mining is the analysis of data and the use of software techniques for finding
patterns and regularities in sets of data.
Data Mining is an interdisciplinary field involving:

Databases
Statistics
Machine Learning
High Performance Computing
Visualization
Mathematics
Data Mining Software

KDnuggets : Polls : Data Mining Tools You Used in
2005 (May 2005) PollData mining/Analytic tools you
used in 2005 [376 voters, 860 votes total]
Enterprise-level: (US $10,000 and more)
Fair Isaac, IBM, Insightful, KXEN, Oracle, SAS, and
SPSS
Department-level: (from $1,000 to $9,999)
Angoss, CART/MARS/TreeNet/Random Forests,
Equbits, GhostMiner, Gornik, Mineset, MATLAB,
Megaputer, Microsoft SQL Server, Statsoft Statistica,
ThinkAnalytics
Personal-level: (from $1 to $999): Excel, See5
Free: C4.5, R, Weka, Xelopes
Weka Data Mining Software

KDnuggets : News : 2005 : n13 : item2
SIGKDD Service Award is the highest service award in the field of data mining and knowledge discovery. It is is given
to one individual or one group who has performed significant service to the data mining and knowledge discovery
field, including professional volunteer services in disseminating technical information to the field, education, and
research funding.
The 2005 ACM SIGKDD Service Award is presented to the Weka team for their development of the freely-available
Weka Data Mining Software, including the accompanying book Data Mining: Practical Machine Learning Tools and
Techniques (now in second edition) and much other documentation.
The Weka team includes Ian H. Witten and Eibe Frank, and the following major contributors (in alphabetical order of
last names): Remco R. Bouckaert, John G. Cleary, Sally Jo Cunningham, Andrew Donkin, Dale Fletcher, Steve
Garner, Mark A. Hall, Geoffrey Holmes, Matt Humphrey, Lyn Hunt, Stuart Inglis, Ashraf M. Kibriya, Richard
Kirkby, Brent Martin, Bob McQueen, Craig G. Nevill-Manning, Bernhard Pfahringer, Peter Reutemann, Gabi
Schmidberger, Lloyd A. Smith, Tony C. Smith, Kai Ming Ting, Leonard E. Trigg, Yong Wang, Malcolm Ware, and
Xin Xu.
The Weka team has put a tremendous amount of effort into continuously developing and maintaining the system since
1994. The development of Weka was funded by a grant from the New Zealand Government's Foundation for
Research, Science and Technology.
The key features responsible for Weka's success are:

it provides many different algorithms for data mining and machine learning
is is open source and freely available
it is platform-independent
it is easily useable by people who are not data mining specialists
it provides flexible facilities for scripting experiments
it has kept up-to-date, with new algorithms being added as they appear in the research literature.
Weka Data Mining Software

KDnuggets : News : 2005 : n13 : item2 (cont.)
The Weka Data Mining Software has been downloaded 200,000 times since it was put on SourceForge in April
2000, and is currently downloaded at a rate of 10,000/month. The Weka mailing list has over 1100
subscribers in 50 countries, including subscribers from many major companies.
There are 15 well-documented substantial projects that incorporate, wrap or extend Weka, and no doubt many
more that have not been reported on Sourceforge.
Ian H. Witten and Eibe Frank also wrote a very popular book "Data Mining: Practical Machine Learning
Tools and Techniques" (now in the second edition), that seamlessly integrates Weka system into teaching
of data mining and machine learning. In addition, they provided excellent teaching material on the book
website.
This book became one of the most popular textbooks for data mining and machine learning, and is very
frequently cited in scientific publications.
Weka is a landmark system in the history of the data mining and machine learning research communities,
because it is the only toolkit that has gained such widespread adoption and survived for an extended period
of time (the first version of Weka was released 11 years ago). Other data mining and machine learning
systems that have achieved this are individual systems, such as C4.5, not toolkits.
Since Weka is freely available for download and offers many powerful features (sometimes not found in
commercial data mining software), it has become one of the most widely used data mining systems. Weka
also became one of the favorite vehicles for data mining research and helped to advance it by making many
powerful features available to all.
In sum, the Weka team has made an outstanding contribution to the data mining field.
Using Weka to teach Machine Learning, Data and Web Mining

http://uhaweb.hartford.edu/compsci/ccli/
Machine Learning, Data and Web Mining

by Example
(learning by doing approach)
Data preprocessing and visualization

Attribute selection
Classification (OneR, Decision trees)
Prediction (Nearest neighbor)
Model evaluation
Clustering (K-means, Cobweb)
Association rules

Initial Data Preparation
(Weka data input)
Raw data (Japanese loan data)
Web/Text documents (Department data)

Japanese loan data (a sample from a loan history database of a Japanese bank)
Clients: s1,..., s20
Approved loan: s1, s2, s4, s5, s6, s7, s8, s9, s14, s15, s17, s18, s19
Rejected loan: s3, s10, s11, s12, s13, s16, s20
Clients data:
unemployed clients: s3, s10, s12

loan is to buy a personal computer: s1, s2, s3, s4, s5, s6, s7, s8, s9, s10
loan is to buy a car: s11, s12, s13, s14, s15, s16, s17, s18, s19, s20
male clients: s6, s7, s8, s9, s10, s16, s17, s18, s19, s20
not married: s1, s2, s5, s6, s7, s11, s13, s14, s16, s18
live in problematic area: s3, s5
age: s1=18, s2=20, s3=25, s4=40, s5=50, s6=18, s7=22, s8=28, s9=40, s10=50, s11=18, s12=20,
s13=25, s14=38, s15=50, s16=19, s17=21, s18=25, s19=38, s20=50
money in a bank (x10000 yen): s1=20, s2=10, s3=5, s4=5, s5=5, s6=10, s7=10, s8=15, s9=20, s10=5,
s11=50, s12=50, s13=50, s14=150, s15=50, s16=50, s17=150, s18=150, s19=100, s20=50
monthly pay (x10000 yen): s1=2, s2=2, s3=4, s4=7, s5=4, s6=5, s7=3, s8=4, s9=2, s10=4, s11=8,
s12=10, s13=5, s14=10, s15=15, s16=7, s17=3, s18=10, s19=10, s20=10
months for the loan: s1=15, s2=20, s3=12, s4=12, s5=12, s6=8, s7=8, s8=10, s9=20, s10=12, s11=20,
s12=20, s13=20, s14=20, s15=20, s16=20, s17=20, s18=20, s19=20, s20=30
years with the last employer: s1=1, s2=2, s3=0, s4=2, s5=25, s6=1, s7=4, s8=5, s9=15, s10=0, s11=1,
s12=2, s13=5, s14=15, s15=8, s16=2, s17=3, s18=2, s19=15, s20=2

Relations, attributes, tuples (instances)
Loan data CVS format
(LoanData.cvs)

Attribute-Relation File Format (ARFF) - http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Download and install Weka - http://www.cs.waikato.ac.nz/~ml/weka/

Run Weka and select the Explorer

Load data into Weka ARFF format or CVS format (click on Open file)

Converting data formats through Weka (click on Save)

Editing data in Weka (click on Edit)

Examining data
Attribute type and properties
Class (last attribute) distribution

Click on Visualize All

Web/Text documents - Department data
http://www.cs.ccsu.edu/~markov/
Download Ch1, DMW Book
Download datasets

Convert HTML to Text

Loading text data in Weka
String format for ID and content
One document per line
Add class (nominal) if needed

Converting a string attribute into nominal
Choose filters/unsupervised/attribute/StringToNominaland and set the index to 1

Converting a string attribute into nominal
Click on Apply document_name is now nominal

Converting text data into TFIDF (Term Frequency Inverted Document Frequency) attribute format
Choose filters/unsupervised/attribute/StringToWordVector
Set the parameters as needed (see More)
Click on Apply

Make the class attribute last
Choose filters/unsupervised/attribute/Copy
Set the index to 2 and click on Apply
Remove attribute 2

Change the attributes to nominal (use NumericToBinary filter)
Save data on a file for further use

ARFF file representing the department data in binary format (NonSparse)
Note the format (see

SparseToNonSparse
instance filter)
Attribute Selection
Finding a minimal set of attributes that preserve the class distribution
Attribute relevance with respect to the class not relevant attribute (accounting)
IF accounting=1 THEN class=A (Error=0, Coverage = 1 instance overfitting )

IF accounting=0 THEN class=B (Error=10/19, Coverage = 19 instances low accuracy)
Attribute Selection
Attribute relevance with respect to the class relevant attribute (science)
IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance)

IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)
Attribute Selection (with document_name)
Attribute Selection (without document_name)
Attribute Selection (ranking)
Attribute Selection (explanation of ranking)
Attribute Selection (using filters)

Choose filters/supervised/attribute/AttributeSelection
Set parameters to InfoGainAttributeEval and Ranker
Click on Apply and see the attribute ordering
Attribute Selection (using filters)
Classification creating models (hypotheses)

Mapping (independent attributes -> class)
Inferring rudimentary rules - OneR
Weather data (weather.nominal.arff)
Attribute
outlook
Rules
Errors
2/5
sunny -> no
overcast -> yes 0/4
2/5
rainy -> yes
Total
error
4/14
temperature hot -> no

mild -> yes
cool -> yes
2/4
2/6
1/4
5/14
humidity
3/7
1/7
4/14
false -> yes
2/8
5/14
true -> no
3/5
high -> no
normal -> yes
windy
Classification OneR
Classification decision tree

Right click on the highlighted line in Result list and choose Visualize tree
Classification decision tree

Top-down induction of decision trees (TDIDT, old approach know
from pattern recognition):
Select an attribute for root node and create a branch for each
possible attribute value.
Split the instances into subsets (one for each branch
extending from the node).
Repeat the procedure recursively for each branch, using only
instances that reach the branch (those that satisfy the
conditions along the path from the root to the branch).
Stop if all instances have the same class.
ID3, C4.5, J48 (Weka): Select the attribute that minimizes the
class entropy in the split.
Classification numeric attributes

weather.arff
Classification predicting class

Click on Set
Click on Open file

Right click on the highlighted line in Result list and choose Visualize classifier errors
Click on the square

Click on Save
Prediction (no model, lazy learning)

test: (sunny, cool, high, TRUE, ?)
K-nearest neighbor (KNN, IBk)

Take the class of the nearest neighbor
or the majority class among K neighbors
K=1 -> no
K=3 -> no
K=5 -> yes
K=14 -> yes (Majority predictor, ZeroR)
Weighted K-nearest neighbor

K=5 -> undecided
no=1/1+1/2=1.5
yes=1/2+1/2+1/2=1.5
11
12
10
Distance(test,X)
play
no
no
yes
yes
yes
yes
Distance is calculated as the number of different attribute values

Euclidean distance for numeric attributes

Departments-binary-test.arff
Departments-binary-training
Model evaluation holdout (percentage split)

Click on More options
Model evaluation cross validation
Model evaluation leave one out cross validation
Model evaluation confusion (contingency) matrix

actual
predicted
a
Clustering k-means
Click on Ignore attributes
Hierarchical Clustering Cobweb
Association Rules (A => B)

Confidence (accuracy): P(B|A) = (# of tuples containing both A and B) / (# of tuples containing A).
Support (coverage): P(A,B) = (# of tuples containing both A and B) / (total # of tuples)
Association Rules
And many more
Thank you!

Weka Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weka Tutorial

Uploaded by

Copyright:

Available Formats

An Introduction to the WEKA Data Mining System

"Drowning in Data yet Starving for Knowledge"

"Computers have promised us a fountain of wisdom but delivered a flood of data"

Data mining finds valuable information hidden in large volumes of data.

Data Mining is an interdisciplinary field involving:

Data Mining Software

Weka Data Mining Software

The key features responsible for Weka's success are:

Weka Data Mining Software

Using Weka to teach Machine Learning, Data and Web Mining

Machine Learning, Data and Web Mining

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

unemployed clients: s3, s10, s12

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Data preprocessing and visualization

Note the format (see

IF accounting=1 THEN class=A (Error=0, Coverage = 1 instance overfitting )

IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance)

Attribute Selection (with document_name)

Attribute Selection (without document_name)

Attribute Selection (ranking)

Attribute Selection (explanation of ranking)

Attribute Selection (using filters)

Attribute Selection (using filters)

Classification creating models (hypotheses)

temperature hot -> no

false -> yes

Classification decision tree

Classification decision tree

Classification numeric attributes

Classification predicting class

Click on Open file

Classification predicting class

Classification predicting class

Prediction (no model, lazy learning)

K-nearest neighbor (KNN, IBk)

Weighted K-nearest neighbor

Distance is calculated as the number of different attribute values

Prediction (no model, lazy learning)

Prediction (no model, lazy learning)

Prediction (no model, lazy learning)

Model evaluation holdout (percentage split)

Model evaluation cross validation

Model evaluation leave one out cross validation

Model evaluation confusion (contingency) matrix

Hierarchical Clustering Cobweb