You are on page 1of 15

Machine learning Lab assignment Using Weka

NAME:

Submitted to: - Dr. V.Anitha


Machine Learning Individual Assignment on Weka

QUESTION 1: - Create your own data set and find out how to convert to arff file

Step1:Create a dataset using Excel sheet

Step 2:Before you apply the


algorithm to your data, you need to
convert your data into comma-
separated file into ARFF format
(.arff). To save your data in comma-
separated format, select the ‘Save
As…’ menu item from Excel ‘File’
pull-down menu. In the ensuing
dialog box select ‘CSV (Comma
Delimited)’ from the file type pop-up
menu.

Then, enter a name of the file, and click ‘Save’ button. Ignore all messages that appear by
clicking ‘OK’.

1
Machine Learning Individual Assignment on Weka

Step3: Open the file with Note pad and you need to change the first line, which holds the
attribute names, into the header structure that makes up the beginning of an ARFF file. Add a
@relation tag with the dataset’s name, @attribute tag with the attribute information, and a
@data tag as shown below

Step4: Save the file using the file extension called .arff

2
Machine Learning Individual Assignment on Weka

Step5: Open Weka Explorer then, click on ‘Open file…’ button.

It brings up a dialog box allowing you to browse for the data file on the local file system,

Step 6: Choose “LidiyaAssign.arff” file from Desktop and open it

3
Machine Learning Individual Assignment on Weka

QUESTION 2:- Know about the metrics ROC, MCC, Kappa Statistics

I. ROC (Receiver Operating Characteristic)


A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the
diagnostic ability of a binary classifier system as its discrimination threshold is varied. The
method was originally developed for operators of military radar receivers, which is why it is
so named.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings. The true-positive rate is also known
as sensitivity, recall or probability of detectionin machine learning. The false-positive rate is
also known as probability of false alarm and can be calculated as (1 − specificity). It can also
be thought of as a plot of the power as a function of the Type I Error of the decision rule
(when the performance is calculated from just a sample of the population, it can be thought of
as estimators of these quantities).
ROC analysis provides tools to select possibly optimal models and to discard suboptimal
ones independently from (and prior to specifying) the cost context or the class distribution.
ROC analysis is related in a direct and natural way to cost/benefit analysis of
diagnostic decision making.
The ROC curve was first developed by electrical engineers and radar engineers during World
War II for detecting enemy objects in battlefields and was soon introduced to psychology to
account for perceptual detection of stimuli. ROC analysis since then has been used
in medicine,radiology, biometrics, forecasting of natural hazards, meteorology, model
performance assessment, and other areas for many decades and is increasingly used
in machine learning and data mining research.
The ROC is also known as a relative operating characteristic curve, because it is a
comparison of two operating characteristics (TPR and FPR) as the criterion changes

4
Machine Learning Individual Assignment on Weka

II. MCC (Matthews correlation coefficient)


 The Matthews correlation coefficient (MCC) or phi coefficient is used in machine
learning as a measure of the quality of binary (two-class) classifications, introduced
by biochemist Brian W. Matthews in 1975. 
 The MCC is defined identically to Pearson's phi coefficient, introduced by Karl
Pearson, also known as the Yule phi coefficient from its introduction by Udny Yule in
1912. Despite these antecedents which predate Matthews's use by several decades, the
term
 MCC is widely used in the field of bioinformatics and machine learning.
 The coefficient takes into account true and false positives and negatives and is
generally regarded as a balanced measure which can be used even if the classes are of
very different sizes. 
 The MCC is in essence a correlation coefficient between the observed and predicted
binary classifications; it returns a value between −1 and +1. A coefficient of +1
represents a perfect prediction, 0 no better than random prediction and −1 indicates
total disagreement between prediction and observation.
 MCC is closely related to the chi-square statistic for a 2×2 contingency table

MCC=SQRT(X2/n)
Where n is the total number of observations.
While there is no perfect way of describing the confusion matrix of true and false positives
and negatives by a single number, the Matthews correlation coefficient is generally regarded
as being one of the best such measures. Other measures, such as the proportion of correct
predictions (also termed accuracy), are not useful when the two classes are of very different
sizes. For example, assigning every object to the larger set achieves a high proportion of
correct predictions, but is not generally a useful classification.
The MCC can be calculated directly from the confusion matrix using the formula:
MCC= TP*TN-FP*FN/SQRT((TP+FP)(TP-FN)(TN+FP)(TN+FN))

Example
Given a sample of 13 pictures, 8 of cats and 5 of dogs, where cats belong to class 1 and dogs
belong to class 0,

actual = [ 1,1,1,1,1,1,1,1,0,0,0,0,0],
assume that a classifier that distinguishes between cats and dogs is trained, and we take the
13 pictures and run them through the classifier, and the classifier makes 8 accurate
predictions and misses 5: 3 cats wrongly predicted as dogs (first 3 predictions) and 2 dogs
wrongly predicted as cats (last 2 predictions).
prediction = [0,0,0,1,1,1,1,1,0,0,0,1,1]
With these two labelled sets (actual and predictions) we can create a confusion matrix that
will summarize the results of testing the classifier:
Actual class

Cat Dog

Predicte Cat 5 2

5
Machine Learning Individual Assignment on Weka

d
class
Dog 3 3

In this confusion matrix, of the 8 cat pictures, the system judged that 3 were dogs, and of the
5 dog pictures, it predicted that 2 were cats. All correct predictions are located in the diagonal
of the table (highlighted in bold), so it is easy to visually inspect the table for prediction
errors, as they will be represented by values outside the diagonal.
In abstract terms, the confusion matrix is as follows:

Actual class

P N

Predicte P TP FP
d
class N FN TN

Where: P = Positive; N = Negative; TP = True Positive; FP = False Positive; TN = True


Negative; FN = False Negative.
Plugging the numbers from the formula:
MCC = [(5*3) - (2*3)]/ SQRT[(5+2)*(5+3)*(3+2)*(3+3)] = 9/SQRT[1680] = 0.219

III. Kappa Statistics


The Kappa statistic (or value) is a metric that compares an Observed Accuracy with
an Expected Accuracy (random chance). It is used not only to evaluate a single classifier, but
also to evaluate classifiers amongst themselves. In addition, it takes into account random
chance (agreement with a random classifier), which generally means it is less misleading than
simply using accuracy as a metric (an Observed Accuracy of 80% is a lot less impressive
with an Expected Accuracy of 75% versus an Expected Accuracy of 50%). Computation
of Observed Accuracy and Expected Accuracy is integral to comprehension of the kappa
statistic, and is most easily illustrated through use of a confusion matrix.
Example 1: -Let’s begin with a simple confusion matrix from a simple binary classification
of Cats and Dogs:

Cats Dogs
Cats 10 7
Dog 5 8 Assume that a model was built using supervised machine learning on
s
labelled data. This doesn't always have to be the case; the kappa
statistic is often used as a measure of reliability between two human raters. Regardless,
columns correspond to one "rater" while rows correspond to another "rater". In supervised
machine learning, one "rater" reflects ground truth (the actual values of each instance to be
classified), obtained from labeled data, and the other "rater" is the machine learning

6
Machine Learning Individual Assignment on Weka

classifier used to perform the classification. Ultimately it doesn't matter which is which to


compute the kappa statistic, but for clarity's sake let’s say that the columns reflect ground
truth and the rows reflect the machine learning classifier classifications.
From the confusion matrix we can see there are 30 instances total (10 + 7 + 5 + 8 = 30).
 According to the first column 15 were labelled as Cats (10 + 5 = 15), and
 According to the second column 15 were labelled as Dogs (7 + 8 = 15).
 We can also see that the model classified 17 instances as Cats (10 + 7 = 17)
and 13 instances as Dogs (5 + 8 = 13).
Observed Accuracy
 It is simply the number of instances that were classified correctly throughout the
entire confusion matrix, i.e. the number of instances that were labelled
as Cats via ground truth and then classified as Cats by the machine learning classifier,
or labelled as Dogs via ground truth and then classified as Dogs by the machine
learning classifier
 To calculate Observed Accuracy, we simply add the number of instances that
the machine learning classifier agreed with the ground truth label, and divide by the
total number of instances.
Observed Accuracy= (10 + 8) / 30 = 0.60
Expected Accuracy
 This value is defined as the accuracy that any random classifier would be expected to
achieve based on the confusion matrix.
 It is directly related to the number of instances of each class (Cats and Dogs), along
with the number of instances that the machine learning classifier agreed with
the ground truth label.
 To calculate Expected Accuracy for our confusion matrix, first multiply the marginal
frequency of Cats for one "rater" by the marginal frequency of Cats for the second
"rater", and divide by the total number of instances.
 The marginal frequency for a certain class by a certain "rater" is just the sum
of all instances the "rater" indicated were that class.
For the first class: - 
Instances that are labelled as Cats according to ground truth=15 (10 + 5 = 15), and
Instances that are classified as Cats by the machine learning classifier=17 (10 + 7 = 17)
The result will be  (15 * 17) / 30 =8.5
For the second class: - 
Instances that are labelled as Dogs according to ground truth=15 (7 + 8 = 15), and
Instances that are classified as Dogs by the machine learning classifier=13 (8 + 5 = 13)
This result will be (15 * 13) / 30 = 6.5
 The final step is to add all these values together, and finally divide again by the total
number of instances
Expected Accuracy = (8.5 + 6.5) / 30 = 0.50.
Kappa = (observed accuracy - expected accuracy)/ (1 - expected accuracy)
kappa statistic= (0.60 - 0.50)/ (1 - 0.50) = 0.20

7
Machine Learning Individual Assignment on Weka

Example 2: - here is a less balanced confusion matrix and the corresponding calculations:

Cats Dogs
Cats 22 9
Dog 7 13
s
Ground truth: Cats (29), Dogs (22)
Machine Learning Classifier: Cats (31), Dogs (20)
Total: (51)
Observed Accuracy: ((22 + 13) / 51) = 0.69
Expected Accuracy: ((29 * 31 / 51) + (22 * 20 / 51)) / 51 = 0.51
Kappa: (0.69 - 0.51) / (1 - 0.51) = 0.37

QUESTION 3:-Try to perform the hierarchical clustering on your own data set
Clustering is a data mining (machine learning) technique that finds similarities between data
according to the characteristics found in the data & groupssimilardataobjectsintoonecluster

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that


groups similar objects into groups called clusters. The endpoint is a set of clusters, where
each cluster is distinct from each other cluster, and the objects within each cluster are broadly
similar to each other.
Dendrogram is a tree-like diagram that records the sequences of merges or splits. It is used
to visualize hierarchical clustering
Hierarchical clustering technique is divided into two types:
1. Agglomerative
2. Divisive
To perform Hierarchical clustering:
Step1: Create a dataset

Step 2: Open weka explorer and click on open file

8
Machine Learning Individual Assignment on Weka

Step 3: Open the file named MyCluster.arff, then click on cluster button and choose
hierarchical cluster from the choose button

Here you can see how many clusters you have, the link type and others by clicking on the
name of the cluster

Step 4: Click start to begin clustering

9
Machine Learning Individual Assignment on Weka

QUESTION 4: - Explore more about select attributes tab in weka

Select Attributes Tab


Select Attributes allows you to select features based on multiple algorithms such as Classifier
SubsetEval, Prinicipal Components, etc.
Many feature selection techniques are supported in Weka. A good place to get started
exploring feature selection in Weka is in the Weka Explorer.
1. Open the Weka GUI Chooser.
2. Click the “Explorer” button to launch the Explorer.
3. Open the dataset you want (example the dataset LidiyaAssign.arff).
4. Click the “Select attributes” tab to access the feature selection methods.

Feature selection is divided into two parts:


 Attribute Evaluator

10
Machine Learning Individual Assignment on Weka

 Search Method.
Each section has multiple techniques from which to choose.
The attribute evaluator is the technique by which each attribute in your dataset (also called a
column or feature) is evaluated in the context of the output variable (e.g. the class). The
search method is the technique by which to try or navigate different combinations of
attributes in the dataset in order to arrive on a short list of chosen features.
Some Attribute Evaluator techniques require the use of specific Search Methods. For
example, the CorrelationAttributeEval technique can only be used with a Ranker Search
Method, that evaluates each attribute and lists the results in a rank order. When selecting
different Attribute Evaluators, the interface may ask you to change the Search Method to
something compatible with the chosen technique.

Both the Attribute Evaluator and Search Method techniques can be configured. Once chosen,
click on the name of the technique to get access to its configuration details.

Click the “More” button to get more documentation on the feature selection technique and
configuration parameters. Make your mouse cursor over a configuration parameter to get a
tooltip containing more details.

11
Machine Learning Individual Assignment on Weka

How to use some popular methods on our chosen standard dataset


1. Correlation Based Feature Selection
A popular technique for selecting the most relevant attributes in your dataset is to use
correlation. It is more formally referred to as Pearson’s correlation coefficient in statistics.
You can calculate the correlation between each attribute and the output variable and select
only those attributes that have a moderate-to-high positive or negative correlation (close to -1
or 1) and drop those attributes with a low correlation (value close to zero).
Weka supports correlation based feature selection with the CorrelationAttributeEval
technique that requires use of a Ranker search method.
Running this on LidiyaAssign dataset suggests that one attribute (Temprature) has the highest
correlation with the output class. It also suggests a host of attributes with some modest
correlation (Condition).

2. Information Gain Based Feature Selection


Another popular feature selection technique is to calculate the information gain. You can
calculate the information gain (also called entropy) for each attribute for the output variable.
Entry values vary from 0 (no information) to 1 (maximum information). Those attributes that
contribute more information will have a higher information gain value and can be selected,
whereas those that do not add much information will have a lower score and can be removed.
Weka supports feature selection via information gain using the InfoGainAttributeEval
Attribute Evaluator. Like the correlation technique above, the Ranker Search Method must be
used.
Running this technique on LidiyaAssign dataset we can see that one attribute contributes
more information than all of the others (Temperature).

12
Machine Learning Individual Assignment on Weka

3. Learner Based Feature Selection


A popular feature selection technique is to use a generic but powerful learning algorithm and
evaluate the performance of the algorithm on the dataset with different subsets of attributes
selected. The subset that results in the best performance is taken as the selected subset. The
algorithm used to evaluate the subsets does not have to be the algorithm that you intend to
use to model your problem, but it should be generally quick to train and powerful, like a
decision tree method.
In Weka this type of feature selection is supported by the WrapperSubsetEval technique and
must use a GreedyStepwise or BestFirst Search Method. The latter, BestFirst, is preferred if
you can spare the compute time.
1. First select the “WrapperSubsetEval” technique.
2. Click on the name “WrapperSubsetEval” to open the configuration for the method.
3. Click the “Choose” button for the “classifier” and change it to J48 under “trees”.

4. Click “OK” to accept


the configuration.
5. Change the “Search
Method” to “BestFirst”.
6. Click the “Start”
button to evaluate the
features.

13
Machine Learning Individual Assignment on Weka

Running this feature selection technique on the LidiyaAssign dataset selects 1 of the 3 input
variables: Temperature.

14

You might also like