You are on page 1of 12

BAHIR DAR UNIVERSITY

BAHIR DAR INSTITUTE OF TECHNOLOGY

FACULTY OF ELECTRICAL AND COMPUTER ENGINEERING

POSTGRADUATE PROGRAM IN COMPUTER ENGINEERING

MACHINE LEARNING

PROJECT REPORT: CLASSIFICATION.

Name: ID:
ABRHAM ADUGNA BDU1402379

Submitted to: Beakal G.(Ph.D)


Submission date: 11/06/2014 E.C
 Write a report of at least two pages based on the given breast-cancer-data and voting data
containing the steps you took to solve the problem for each task, a graph of the accuracy,
precision, recall, F-measurement, and confusion matrix.
Precision: a metric that quantifies the number of correct positive predictions made which means
out of all the positive predicted, what percentage is truly positive.

Precision talks about how precise/accurate your model is out of those predicted positive. It can be
calculated as the ratio of correctly predicted positive datasets divided by the total number of
positive datasets that were predicted.

Precision = true positive / total predicted positives

= true positive / (true positive + false negative)

Recall a metric that quantifies the number of correct positive predictions made out of all positive
predictions that have been made. Calculates how many of the actual positives our model captures
through labeling it as true positive. It can be calculated as:

Recall = true positive / total actual positives

= true positive / (true positive + false negative)

Note: maximizing precision will minimize the number of false positives.

The maximizing recall will minimize the number of false negatives.

Therefore; precision calculates the accuracy for the miner class.

F-score / F-measure: it is the harmonic mean of both precision and recall & might be a better
measure to use if we need to balance between precision and recall and it can be calculated as:

F1 score = 2 (precision * recall) / (precision + recall)

Tasks:

1. Estimate the accuracy of the Naive Bayes classifier, Decision tree, and SVM on the breast
cancer data set using 5-fold cross-validation. The breast cancer dataset has numeric values.
You can use wear’s filter to discretize the data.
a) Naive Bayes classifier

1
Steps-1: First changed the breast_cancer_data.txt to weak file format
(breast_cancer_data.arff).

Step-2: And then import breast_cancer_data.arff of file format to weka working area.

Step-3: thirdly, adjust the filter parameter on the attribute Numeric to Nominal.

2
Step-4: performance metrics of the accuracy of the given datasets are:

Note: at the lower-left corner there is a confusion matrix used for calculating the accuracy of the
datasets. It can be calculated as:

Precision = 443/448=0.9888~0.989

Recall = 443/458 = 0.967

3
F-measure = 2*0.989*0.967 / (0.989+0.967) = 0.9778~0.978

b) The Decision tree

Steps-1: then change the Naïve Bayes classifier to Decision tree in the classify tab.

Note: at the lower-left corner there is a 2x2 confusion matrix for the calculation of the accuracy
of the dataset. It can be calculated as:

Precision = 432/455 = 0.949

Recall = 432/458 = 0.943

4
F-measure = 2*0.949*0.943 / (0.949+0.943) = 0.946

Step-2: the decision tree graph is as shown.

5
c) Support Vector Machine (SVM)

Step-1: then change the Decision tree to SVM in the classify tab.

Note: at the lower-left corner there is a 2x2 confusion matrix for the computation of accuracy of
the given datasets. It can be calculated as:

Precision = 442/454 = 0.974

6
Recall = 442/458 = 0.965

F-measure = 2 * 0.974 * 0.965 = 0.969

2. Estimate the accuracy of the Naive Bayes, Decision tree, and SVM using 5-fold cross-
validation on the voting data.
a. Naive Bayes
Steps-1: changed the vote_data.txt to weka file format (vote_data.arff).
Step-2: After that import vote_data.arff of file format to weka working area.

Note: at the lower-left corner there is a 2x2 confusion matrix for the computation of accuracy of
the given datasets.

7
Step-3: performance metrics of the accuracy of the given datasets are as follows.

Note: at the lower-left corner there is a 2x2 confusion matrix for the computation of accuracy of
the given datasets. It can be calculated as:
Precision = 154/183 = 0.842
Recall = 154 / 168 = 0.917
F-measure = 2 * 0.842 * 0.917 / (0.842 + 0.917) = 0.8775

8
b. The Decision tree

Steps-1: change the Naïve Bayes to Decision tree in the classify tab.

Note: at the lower-left corner there is a 2x2 confusion matrix for the computation of accuracy of
the given datasets. It can be calculated as:
Precision = 162 / 171 = 0.947
Recall = 162 / 168 = 0.964
F-measure = 2 * 0.947 * 0.964 / (0.964 +0.947) = 0.956

9
Step-2: the decision tree graph is as shown.

10
c. Support Vector Machine (SVM)

Step-1: change the Decision tree to SVM in the classify tab.

Note: at the lower-left corner there is a 2x2 confusion matrix used for calculating the accuracy of
the datasets. It can be calculated as:

Precision = 162/172 = 0.942

Recall = 162 / 168 = 0.964

F-measure = 2 * 0.942 * 0.964 / (0.942 + 0.964) = 0.953

11

You might also like