You are on page 1of 17

Appendix: The WEKA Data Mining Software


1. Version 2. researchers can implement new data mining algorithms to add in WEKA WEKA is the best-known open-source data mining software.   It includes algorithms for regression. developed by Waikato University. Version 3.0. association rule mining and attribute selection. 1996).4.3.WEKA: Introduction     WEKA. 1998. It also has data visualization facilities. 1999. New Zealand. WEKA provides a collection of data mining.6.    WEKA is an environment for comparing learning algorithms With WEKA. classification. machine learning algorithms and preprocessing tools. 2003. WEKA (Waikato Environment for Knowledge Analysis) History: 1st version (version 2. Version 3. Version 3. 2 . 2008. clustering.

including files.      It can work on Windows. 3 . Database access is provided through Java Database Connectivity.   WEKA 3.4 consists of 271477 lines of code. WEKA 3.WEKA: Introduction  WEKA was written in Java. Experimenter and Knowledge Flow. It consists of three main graphical user interfaces: Explorer. Data can be loaded from various sources.6 consists of 509903 lines of code. the main graphical user interface. Users can access its components through Java programming or through a command-line interface. Linux and Macintosh. The easiest way to use WEKA is through Explorer. URLs and databases.

WEKA allows CSV. An ARFF file consists of a list of instances We can create an ARFF file by using Notepad or Word. LibSVM. and C4. 4 .5’s format.  Beside ARFF format.WEKA data format     WEKA stores data in flat files (ARFF format).    The name of the dataset is with @relation Attribute information is with @attribute The data is with @data. It’s easy to transform EXCEL file to ARFF format.

yes …………………………… 5 . yes rainy. TRUE. 85. FALSE. 96. 86. 80. overcast. no overcast. 80. FALSE. 70. 83. 85. FALSE} @attribute play {yes. 68. no sunny. no} @data sunny. yes rainy. rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE. FALSE.WEKA ARFF format @relation weather @attribute outlook {sunny. FALSE. 90.

Explorer GUI  Consists of 6 panels. … 6 .  Preprocess:   WEKA contains filters for:  Discretization. normalization. resampling. attribute selection. each for one data mining tasks:       Preprocess Classify Cluster Associate Select Attributes Visualize. to use WEKA’s data preprocessing tools (called “filters”) to transform the dataset in several ways. transforming and combining attributes.

Explorer (cont.)  Classify:  Regression techniques (predictors of “continuous classes”)     Linear regression Logistic regression Neural network Support vector machine  Classification algorithms  Decision trees – ID3. Bayes network  k-nearest-neighbors  Rule learners: Ripper. Prism  Lazy rule learners  Meta learners (bagging. C4. boosting) 7 .5 (called J48)  Naïve Bayes.

 Clustering  Clustering algorithms:    K-Means.  A wide range of filtering criteria. genetic search and random search. best-first search.  Filter method: the attribute set is filtered to produce the most promising subset before learning begins. FarthestFirst Likelihood-based clustering: EM (Expectation-Maximization) Cobweb (incremental clustering algorithm)  Clusters can be visualized and compared to “true” clusters (if given) Attribute Selection: This provides access to various methods for measuring the utility of attributes and identifying the most important attributes in a dataset. information. support-machine-based criterion.  A variety of search methods: forward and backward selection. including correlation-based feature selection.  PCA (principal component analysis) to reduce the dimensionality of a problem.  Discretizing numeric attributes. the chi-square statistic.  8 . gain ratio. X-Means.

Trees.)  Assocation rule mining  Apriori algorithm  Work only with discrete data  Visualization     Scatter plots. Color-coded class values. “Zoom-in” function 9 . ROC curves. graphs WEKA can visualize single attributes (1-d) and pairs of attributes (2-d).Explorer (cont.

10 .

Explorer GUI (Classify) 11 .

so that it can be re-visited. it can be saved in either XML or binary form. Once an experiment has been set up. Experiments can involves many algorithms that are run on multiple datasets. Can also iterate over different parameter settings Experiments can also be distributed across different computer nodes in a network.WEKA Experimenter      This interface is designed to facilitate experimental comparisons of the performance of algorithms based on many different evaluation criteria. 12 .

13 .

Knowledge-flow interface can handle incremental updates. It can load and preprocess individual instances before feeding them into incremental learning algorithms.Knowledge Flow Interface     The Explorer is designed for batch-based data processing: training data is loaded into memory and then processed. However WEKA has implemented some incremental algorithms. Knowledge-flow also provides nodes for visualization and evaluation. 14 .

15 .

visualization and bioinformatics.6 can read and write data in the format used by the well known LibSVM and SVM-Light support vector machine implementations. WEKA is weaker in classical statistics but stronger in machine learning (data mining) algorithms. WEKA 3. PMML is a XML-based standard fro expressing statistical and data mining models. such as text mining.Conclusions      Comparison to R. WEKA has 2 limitations:  Most of the algorithms require all the data stored in main memory. WEKA 3.6 includes support for importing PMML models (Predictive Modeling Markup Language). So it restricts application to small or medium-sized datasets.  Java implementation is somewhat slower than an equivalent in C/C++ 16 . WEKA has developed a set of extensions covering diverse areas.

R. Witten and E.. Morgan Kaufmann.H. WEKA – A Machine Learning Workbench for Data Mining. E. 11. Frank et al. Hall and E. Frank. 2000. No..0.6. 2008. J. 17 . 2003. 1. Bouckaert et al. R. Vol. SIGKDD Explorations. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.References     I. The WEKA Data Mining Software: An Update. WEKA Manual for Version 3. 2008. M. San Francisco.