You are on page 1of 101

Introduction:

WEKA  Waikato Environment for Knowledge Analysis


The WEKA GUI Chooser provides a starting point for launching WEKA’S main GUI
applications and supporting tools. The GUI Chooser consists of four buttons—one for each of the
four major Weka applications—and four menus.
The buttons can be used to start the following applications:

1. Explorer: An environment for exploring data with WEKA.


2. Experimenter: An environment for performing experiments and conducting
statistical tests between learning schemes.
3. Knowledge Flow: It supports essentially the same functions as the explorer but with
a drag and drop interface. One advantage is that it supports incremental learning.
4.
Simple CLI: Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line interface.

The menu consists of four sections like Program, Tools, Visualization, and Help.

EXPLORER:
It is a user interface which contains a group of tabs just below the title bar. The tabs are as
follows:

1. Preprocess
2. Classify
3. Cluster
4. Associate
5. Select Attributes
6. Visualize
The bottom of the window contains status box, log and WEKA bird.

1. PREPROCESSING:
LOADING DATA
The first four buttons at the top of the preprocess section enable you to load
Data into WEKA:

1. Open file: It shows a dialog box allowing you to browse for the data file on the local file
system.
2. Open URL: Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB: Reads data from a database.
4. Generate: It is used to generate artificial data from a variety of Data Generators.

Using the Open file button we can read files in a variety of formats like WEKA’s ARFF
format, CSV format. Typically ARFF files have .arff extension and CSV files .csv extension.
THE CURRENT RELATION
The Current relation box contains the currently loaded data i.e. interpreted as a
single relational table in database terminology, which has three entries:

1. Relation: It provides the name of the relation in the file from which it was
loaded. Filters are used modify the name of a relation.
2. Instances: The number of instances (data points/records) in the data.
3. Attributes: The number of attributes (features) in the data.
ATTRIBUTES
It is located below the current relation box which contains four buttons, they are: 1)
All is used to tick all boxes 2) None is used to clear all boxes 3) Invert is used make ticked
boxes unticked. 4) Pattern is used to select attributes by representing an expression. E.g. a.*
is used to select all the attributes that begins with a.
SELECTED ATTRIBUTE:
It is located beside the current relation box which contains the following:

1. Name: It specifies the name of the attribute i.e. same as in the attribute list.
2. Type: It specifies the type of attribute, most commonly Nominal or Numeric.
3. Missing: It provides a numeric value of instances in the data for which an attribute is
missing.
4. Distinct: It provides the number of different values that the data contains for an attribute.
5. Unique: it provides the number of instances in the data having a value for an attribute that
no other instances have. FILTERS
By clicking the Choose button at the left of the Filter box, it is possible to select one
of the filters in WEKA. Once a filter has been selected, its name and options are shown in the
field next to the Choose button, by clicking on this box with the left mouse button it shows a
GenericObjectEditor dialog box which is used to configure the filter.

2. CLASSIFICATION

Classification has a text box which gives the name of currently selected classifier, and
its options. By clicking it with the left mouse button it shows a GenericObjectEditor dialog
box, which is same as for filters i.e. used to configure the current classifier options.
TEST OPTIONS
The result of applying the chosen classifier will be tested according to the options
that are set by clicking in the Test options box. There are four test modes:
1. Use
training
set.
2. Supplied
test set. 3.
Cross-
validation.
4. Percentage split.

Once the classifier, test options and class have all been set, the learning process is
started by clicking on the Start button. We can stop the training process at any time by
clicking on the Stop button.
The Classifier output area to the right of the display is filled with text describing the
results of training and testing.
After training several classifiers, the Result List will contain several entries using
which we can move over various results that have been generated. By pressing Delete we
can remove a selected entry from the results.

3. CLUSTERING

By clicking the text box beside the choose button in the Clusterer box, it shows a
dialog box used to choose a new clustering scheme.
The Cluster mode box is used to choose what to cluster and how to evaluate the
results. The first three options in it are same as in classification like Use training set,
Supplied test set and Percentage split. The fourth option is classes to clusters
evaluation
An additional option in the Cluster mode box is the Store clusters for visualization
which finds whether or not it will be possible to visualize the clusters once training is
complete.

Ignore Attributes: when clustering, some attributes in the data should be ignored. It
shows a small window that allows you to select which attributes are ignored. 4.
ASSOCIATING

It contains schemes for learning association rules, and the learners are chosen
and configured in the same way as the clusterer, filters, and classifiers in the other
panels. 5. SELECTING ATTRIBUTES
Attribute selection involves searching through all possible combinations of attributes in the
data to find which subset of attributes works best for prediction. To do this, two objects must be set
up: an attribute evaluator and a search method. The evaluator determines what method is used to
assign a worth to each subset of attributes. The search method determines what style of search is
performed.
The Attribute Selection Mode box has two options:

1. Use full training set: The worth of the attribute subset is determined using the full
set of training data.
2. Cross-validation: The worth of the attribute subset is determined by a process of
crossvalidation. The Fold and Seed fields set the number of folds to use and the random seed used
when shuffling the data.
6. VISUALIZING

WEKA’s visualization section allows you to visualize 2D plots of the current relation.

EXPERIMENTER: The Weka Experiment Environment enables the user to create, run, modify,
and analyses experiments in a more convenient manner. It can also be run from the command line
using the Simple CLI.
New Experiment:
After clicking on new default parameters for an Experiment are defined.

We can choose the experiment in two different modes 1) Simple and 2) Advanced
Result Destination:
By default, an ARFF file is the destination for the results output. But we can also
choose CSV file as the destination for output file. The advantage of ARFF or CSV files is that
they can be created without any additional classes. The drawback is the lack of ability to
resume the interrupted experiment.
Experiment type:
The user can choose between the following three different types:

1. Cross-validation: it is a default type and it performs stratified cross-validation with


the given number of folds.
2. Train/Test Percentage Split: it splits a dataset according to the given percentage into
a train and a test file after the order of the data has been randomized and stratified.
3. Train/Test Percentage Split: As it is impossible to specify an explicit train/test files
pair, one can abuse this type to un-merge previously merged train and test file into the two original
files.
Additionally, one can choose between Classification and Regression, depending on
the datasets and classifiers one uses.

Data Sets:
One can add dataset files either with an absolute path or with a relative path.

Iteration control:
1. Number of repetitions: In order to get statistically meaningful results, the default
number of iterations is 10.
2. Data sets first/Algorithms first: As soon as one has more than one dataset and algorithm,
it can be useful to switch from datasets being iterated over first to algorithms.
Algorithms: New algorithms can be added via the “Add New” button.

Opening this dialog for the first time, ZeroR is presented.


By clicking on the Choose button one can choose another classifier which is as shown
in the below diagram:

The “Filter...” button enables us to highlight classifiers that can handle certain
attributes and class types. With “Remove Filter” button one can clear the classifiers that are
highlighted earlier.

With the Load options... and Save options... buttons one can load and save the setup
of a selected classifier from and to XML.
Running an Experiment:
To run the current experiment, click the Run tab at the top of the Experiment Environment
window. The current experiment performs 10 runs of 10-fold stratified cross-validation.
After clicking the Run tab, it shows a window with start button and stop button, by clicking
on start button we can run the experiment and by clicking on stop button we can run the
experiment.

If the experiment was defined correctly, the 3 messages shown above will be displayed in
the Log panel.
Advanced
Defining an experiment: When the Experimenter is started in Advanced mode, the Setup tab
is displayed. Now click New to initialize an experiment.
To define the dataset to be processed by a scheme, first select Use relative paths in the
Datasets panel of the Setup tab and then click on Add new... button.

Saving Results of the experiment


To identify a dataset to which the results are to be sent, click on the Instances-
ResultListener entry in the Destination panel, which opens a dialog box with a label named as
“output file”.

Now give the name of the output file and click on OK button. The dataset name is now
displayed in the Datasets panel of the Setup tab. This is as shown in the following figure:
Now we can run the experiment by clicking the Run tab at the top of the experiment environment
window. The current experiment performs 10 randomized train and test runs.

To change from random train and test experiments to cross-validation experiments, click on
the Result generator entry.

Using analysis tab in experiment environment window one can analyze the results of experiments
using experiment analyzer.

KNOWLEDGE FLOW:
The Knowledge Flow provides an alternative to the Explorer as a graphical front end to WEKA’s core
algorithms. It is represented as shown in the following figure. The Knowledge Flow presents a
dataflow inspired interface to WEKA.
The Knowledge Flow offers the following features:

1. Intuitive data flow style layout.


2. Process data in batches or incrementally.
3. Process multiple batches or streams in parallel (each separate flow executes in its own
thread).
4. Chain filters together.
5. View models produced by classifiers for each fold in a cross validation.
6. visualize performance of incremental classifiers during processing
7. Plug-in facility for allowing easy addition of new components to the Knowledge Flow.
Components:

1. Data Sources: All WEKA loaders are available.


2. Data Sinks: All WEKA savers are available.
3. Filters: All WEKA’s filters are available.
4. Classifiers: All WEKA classifiers are available.
5. Clusterers: All WEKA clusterers are available.
6. Evaluation: It contains different kinds of techniques like TrainingSetMaker, TestSetMaker,
CrossValidationFoldMaker, TrainTestSplitMaker, ClassAssigner, ClassValuePicker,
ClassifierPerformanceEvaluator, IncrementalClassifierEvaluator,
ClustererPerformanceEvaluator, and PredictionAppender.
7. Visualization: It contains different models like DataVisualizer, ScatterPlotMatrix,
AttributeSummarizer, ModelPerformanceChart, TextViewer, GraphViewerbased, and
StripChart.
Plug-in Facility:
The Knowledge Flow offers the ability to easily add new components via a plug-in
mechanism. SIMPLE CLI:
The Simple CLI provides full access to all Weka classes like classifiers, filters, clusterers, etc.,
but without the hassle of the CLASSPATH.
It offers a simple Weka shell with separated command line and output.

The simple command line interface is represented as shown in the following figure:

The following commands are available in the Simple CLI:

1. java <classname> [<args>]: - invokes a java class with the given arguments (if any)
2. break: - it stops the current thread in a friendly manner. e.g., a running classifier
3. kill: - stops the current thread in an unfriendly fashion
4. cls:- clears the output area
5. exit:- exits the Simple CLI
6. help [<command>]:- provides an information about the command available in the
simple CLI. Also it provides an overview of all commands available if the command is not
specified as an argument.

In order to invoke a Weka class, only the way is one has to prefix the class with”java”. This
command tells the Simple CLI to load a class and execute it with any given parameters.
For example: java weka.classifiers.trees.J48 -t c:/temp/iris.arff which results in the
following output.
Using simple CLI we can also perform command redirection using the operator “>”. For
example: java weka.classifiers.trees.J48 -t test.arff > j48.txt

EX.no 1: Creation of a Data Warehouse


Aim:

This experiment illustrates some of the basic data pre-processing


operations that can be performed using WEKA-Explorer. The sample
dataset used for this example is the student data available in arff format.

ALGORITHM:

Step1: Loading the data. We can load the dataset into weka by
clicking on open button in preprocessing interface and selecting the
appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes
and during the scan of the data weka will compute some basic strategies
on each attribute. The left panel in the above figure shows the list of
recognized attributes while the top panel indicates the names of the base
relation or table and the current working relation (which are same
initially).
Step3:Clicking on an attribute in the left panel will show the basic statistics on
the attributes for the categorical attributes the frequency of each attribute value is
shown, while for continuous attributes we can obtain min, max, mean, standard
deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-


tabulation across two attributes.

Note:we can select another attribute using the dropdown list.

Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using


the attribute filters in weka.In the filter model panel,click on choose button,This will
show a popup window with a list of available filters.

Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.

Step 6:a)Next click the textbox immediately to the right of the choose
button.In the resulting dialog box enter the index of the attribute to be filtered out.

b)Make sure that invert selection option is set to false.The click OK


now in the filter box.you will see “Remove-R-7”.

c)Click the apply button to apply filter to this data.This will remove
the attribute and create new working relation.
d)Save the new working relation as an arff file by clicking save button
on the top(button)panel.(student.arff)

Discretization

1)Sometimes association rule mining can only be performed on categorical


data.This requires performing discretization on numeric or continuous attributes.In
the following example let us discretize age attribute.

Let us divide the values of age attribute into three bins(intervals).

First load the dataset into weka(student.arff)

Select the age attribute.

Activate filter-dialog box and select


“WEKA.filters.unsupervised.attribute.discretize”from the list.

To change the defaults for the filters,click on the box immediately to the
right of the choose button.

We enter the index for the attribute to be discretized.In this case the
attribute is age.So we must enter ‘1’ corresponding to the age attribute.

Enter ‘3’ as the number of bins.Leave the remaining field values as they are.

Click OK button.

Click apply in the filter panel.This will result in a new working relation
with the selected attribute partition into 3 bins.

Save the new working relation in a file called student-data-discretized.arff


Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
The following screenshot shows the effect of discretization.
\
Ex NO:2 Apriori Algorithm.

Aim:

This experiment illustrates some of the basic elements of asscociation rule


mining using WEKA. The sample dataset used for this example is test.arff

ALGORITHM:

Step1: Open the data file in Weka Explorer. It is presumed that the required
data fields have been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association
rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of the choose button.
Dataset test.arff

@relation test

@attribute admissionyear {2005,2006,2007,2008,2009,2010}

@attribute course {cse,mech,it,ece}

@data

2005, cse

2005, it

2005, cse

2006, mech

2006, it

2006, ece

2007, it

2007, cse

2008, it

2008, cse

2009, it

2009, ece

%
The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.
EX.NO:3 Bayesian Classification.

Aim:
This experiment illustrates the use of j-48 classifier in weka. The sample data
set used in this experiment is “student” data available at arff format. This document
assumes that appropriate data pre processing has been performed.
ALGORITHM:
Step-1: We begin the experiment by loading the data (student.arff)into weka.

Step2: Next we select the “classify” tab and click “choose” button t o select
the “j48”classifier.

Step3: Now we specify the various parameters. These can be specified by


clicking in the text box to the right of the chose button. In this example, we accept the
default values. The default version does perform some pruning but does not perform
error pruning.

Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the Ascii version of the
tree as well as evaluation statistic will appear in the right panel when the model
construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this


indicates that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification
tree. This can be done by right clicking the last result set and selecting “visualize
tree” from the pop-up menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text” options click the “supplied test set”
radio button and then click the “set” button. This wills pop-up a window which will
allow you to open the file containing test instances.
Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
EX.NO:4 K-means clustering
Aim:

This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the student data
available in ARFF format. This document assumes that appropriate preprocessing
has been performed. This istudent dataset includes 14 instances.
Algorithm:

Step 1: Run the Weka explorer and load the data file student.arff in
preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer
and click on the choose button. This step results in a dropdown list of available
clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup
window shown in the screenshots. In this window we enter six on the number of
clusters and we leave the value of the seed on as it is. The seed value is used in
generating a random number which is used for making the internal assignments of
instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering
algorithm there we must make sure that they are in the ‘cluster mode’ panel. The use
of training set option is selected and then we click ‘start’ button. This process and
resulting window are shown in the following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as
statistics on the number and the percent of instances assigned to different clusters.
Here clusters centroid are means vectors for each clusters. This clusters can be used
to characterized the cluster.

Step 7: Another way of understanding characterstics of each cluster through


visualization ,we can do this, try right clicking the result set on the result. List panel
and selecting the visualize cluster assignments.

Interpretation of the above visualization

From the above visualization, we can understand the distribution of age and instance
number in each cluster. For instance, for each cluster is dominated by age. In this case
by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance
along with its assign cluster. To do so we click the save button in the visualization
window

and save the result student k-mean .The top portion of this file is shown in the
following figure.
Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low,medium,high}

@attribute student {yes,no}

@attribute credit-rating {fair,excellent}

@attribute buyspc {yes,no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
The following screenshot shows the clustering rules that were generated when simple
kmeans algorithm is applied on the given dataset.
EX.NO:5 Decision Tree.

Aim:
This experiment illustrates the use of j-48 classifier in weka. The sample data
set used in this experiment is “student” data available at arff format. This document
assumes that appropriate data pre processing has been performed.
ALGORITHM:
Step-1: We begin the experiment by loading the data (student.arff)into weka.

Step2: Next we select the “classify” tab and click “choose” button t o select
the “j48”classifier.

Step3: Now we specify the various parameters. These can be specified by


clicking in the text box to the right of the chose button. In this example, we accept the
default values. The default version does perform some pruning but does not perform
error pruning.

Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation data
set, this is necessary to get a reasonable idea of accuracy of generated model.

Step-5: We now click ”start” to generate the model .the Ascii version of the
tree as well as evaluation statistic will appear in the right panel when the model
construction is complete.

Step-6: Note that the classification accuracy of model is about 69%.this


indicates that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)

Step-7: Now weka also lets us a view a graphical version of the classification
tree. This can be done by right clicking the last result set and selecting “visualize
tree” from the pop-up menu.

Step-8: We will use our model to classify the new instances.

Step-9: In the main panel under “text” options click the “supplied test set”
radio button and then click the “set” button. This wills pop-up a window which will
allow you to open the file containing test instances.
Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
EX.NO: 6 FP-Growth Algorithm
Aim:

This experiment illustrates some of the basic elements of asscociation rule


mining using WEKA. The sample dataset used for this example is test.arff

ALGORITHM:

Step1: Open the data file in Weka Explorer. It is presumed that the required
data fields have been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association
rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of the choose button.
Ex.no:7 Support Vector Machines.

Aim:
This experiment illustrates the use of j-48 classifier in weka. The sample
data set used in this experiment is “student” data available at arff
format. This document assumes that appropriate data pre processing
has been performed.
ALGORITHM:
Step-1: We begin the experiment by loading the data (student.arff)into
weka.
Step2: Next we select the “classify” tab and click “choose” button t o
select the “j48”classifier.
Step3: Now we specify the various parameters. These can be specified
by clicking in the text box to the right of the chose button. In this
example, we accept the default values. The default version does
perform some pruning but does not perform error pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold
cross validation as our evaluation approach. Since we don’t have
separate evaluation data set, this is necessary to get a reasonable idea
of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version
of the tree as well as evaluation statistic will appear in the right panel
when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this
indicates that we may find more work. (Either in preprocessing or in
selecting current parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the
classification tree. This can be done by right clicking the last result set
and selecting “visualize tree” from the pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text” options click the “supplied test
set” radio button and then click the “set” button. This wills pop-up a
window which will allow you to open the file containing test instances.

Dataset student .arff


@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%
Ex.no:8 One Hierarchical clustering algorithm

Aim:

This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the student data
available in ARFF format. This document assumes that appropriate preprocessing
has been performed. This istudent dataset includes 14 instances.
Algorithm:

Step 1: Run the Weka explorer and load the data file student.arff in
preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer
and click on the choose button. This step results in a dropdown list of available
clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup
window shown in the screenshots. In this window we enter six on the number of
clusters and we leave the value of the seed on as it is. The seed value is used in
generating a random number which is used for making the internal assignments of
instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering
algorithm there we must make sure that they are in the ‘cluster mode’ panel. The use
of training set option is selected and then we click ‘start’ button. This process and
resulting window are shown in the following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as
statistics on the number and the percent of instances assigned to different clusters.
Here clusters centroid are means vectors for each clusters. This clusters can be used
to characterized the cluster.

Step 7: Another way of understanding characterstics of each cluster through


visualization ,we can do this, try right clicking the result set on the result. List panel
and selecting the visualize cluster assignments.

Interpretation of the above visualization

From the above visualization, we can understand the distribution of age and instance
number in each cluster. For instance, for each cluster is dominated by age. In this case
by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance
along with its assign cluster. To do so we click the save button in the visualization
window

and save the result student k-mean .The top portion of this file is shown in the
following figure.
Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low,medium,high}

@attribute student {yes,no}

@attribute credit-rating {fair,excellent}

@attribute buyspc {yes,no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
EX.NO:9 Applications of classification for web mining.

Aim:

This experiment illustrates some of the basic data pre-processing


operations that can be performed using WEKA-Explorer. The sample
dataset used for this example is the student data available in arff format.

ALGORITHM:

Step1: Loading the data. We can load the dataset into weka by
clicking on open button in preprocessing interface and selecting the
appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes
and during the scan of the data weka will compute some basic strategies
on each attribute. The left panel in the above figure shows the list of
recognized attributes while the top panel indicates the names of the base
relation or table and the current working relation (which are same
initially).
Step3:Clicking on an attribute in the left panel will show the basic statistics on
the attributes for the categorical attributes the frequency of each attribute value is
shown, while for continuous attributes we can obtain min, max, mean, standard
deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-


tabulation across two attributes.

Note:we can select another attribute using the dropdown list.

Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using


the attribute filters in weka.In the filter model panel,click on choose button,This will
show a popup window with a list of available filters.

Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.

Step 6:a)Next click the textbox immediately to the right of the choose
button.In the resulting dialog box enter the index of the attribute to be filtered out.

b)Make sure that invert selection option is set to false.The click OK


now in the filter box.you will see “Remove-R-7”.

c)Click the apply button to apply filter to this data.This will remove
the attribute and create new working relation.

d)Save the new working relation as an arff file by clicking save button
on the top(button)panel.(student.arff)

Discretization

1)Sometimes association rule mining can only be performed on categorical


data.This requires performing discretization on numeric or continuous attributes.In
the following example let us discretize age attribute.

Let us divide the values of age attribute into three bins(intervals).

First load the dataset into weka(student.arff)

Select the age attribute.

Activate filter-dialog box and select


“WEKA.filters.unsupervised.attribute.discretize”from the list.

To change the defaults for the filters,click on the box immediately to the
right of the choose button.

We enter the index for the attribute to be discretized.In this case the
attribute is age.So we must enter ‘1’ corresponding to the age attribute.
Enter ‘3’ as the number of bins.Leave the remaining field values as they are.

Click OK button.

Click apply in the filter panel.This will result in a new working relation
with the selected attribute partition into 3 bins.

Save the new working relation in a file called student-data-discretized.arff

Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes


>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

%
EX.NO: 10 Case Study on Text Mining or any commercial
application.

Aim:

This experiment illustrates some of the basic operations that can be


Case Study on Text Mining or any commercial application. The sample
dataset used for this example is the student data available in arff format.

You might also like