You are on page 1of 6

HW #3

Kyohyeon KOO - G01229056

PART 1

1. The dataset ‘a’ has three attributes; x, y, and class. The x and y are numeric attributes
while class is nominal attribute. And the dataset ‘a’ has 4063 Instances.

The dataset ‘b’ is almost same as ‘a’; has three attributes. The x and y are numeric
attributes while class is nominal attribute. But the Instances are different, ‘b’ has 7981
Instances.

The dataset ‘c’ has 100 attributes. The name of those attributes is a0001, a0002, …, a0099,
and class attribute. And this is the label that we wanted to looking for. The first 4 attributes,
a0001, a0002, a0003, and a0004, are nominal attributes. Others are the numeric attributes.

2. The dataset ‘a’ has two class labels. The class distribution is not balance, Label 1 has 4023
weight which is 99%, while Label 2 has 40 weight which is only 1%.

The dataset ‘b’ is also same, has two class labels. But, the ratio between Label 1 and Label 2
is different from dataset ‘a’. The class distribution is almost balanced. Label 1 has 3981
weight which is little bit less than 50%, while Label 2 has 4000 weight which is little bit more
than Label 1, also almost 50%.

The dataset ‘c’ has the same level numbers for the class attributes. The distribution of the
class between Label 1 and Label 2 has totally same, both of them are 50%.

3. There’s no qualitative characteristics in dataset ‘a’ and dataset ‘b’. But in dataset ‘c’, for
a0001, a0002, a0003 and a0004, both of Label 1 and Label 2 has 500 weight, which is totally
same. And it can be seen as qualitative attributes.
PART 2

[DecisionStump]

[J48 (pruned)]

[J48 (unpruned)]

Accuracy (%) RMSE


DecisionStump 52% 0.5004
J48 (pruned) 91% 0.2932
J47 (unpruned) 69.6% 0.5417

When running on the DecisionStump, only the a0079 was used to make the decision.
Because the a0079 is the root of the tree, the classification is decided by the root of the
tree. If the a0079 has higher value than 0.875, then the result will be changed into the first
class. However, if it has less or equal value, the result will be classified in the second class.
When running on the J48 (pruned), the two attributes, a0001 and a0002 make the decision.
There are two cases, when a0001 = 0 and a0001 = 1. But, when running on the J48
(unpruned), there are much more attributes are used than the DecisionStump or J48
(pruned). To compare pruned one and unpruned one, pruned one has better performance.
Because it has smaller size of tree, and a smaller number of leaves. Unpruned one has larger
size of tree and has more leaves than pruned one, so it can lead the model to overloading
and also can make an error.

Dataset ‘a’

[J48 (pruned)]

[NaiveBayes]

[k-NN (k=1)]

[k-NN (k=21)]
Dataset ‘b’

[J48 (pruned)]

[NaiveBayes]

[k-NN (k=1)]

[k-NN (k=21)]

By compiling these data, the following table can be made.

Dataset ‘a’ Dataset ‘b’


J48 (pruned) 0.994 1.000
NaiveBayes ? 0.744
k-NN (k=1) 1.000 1.000
k-NN (k=21) ? 1.000
Through the above results, the peak performance according to F-Measure falls to k-NN
within k = 1 in particular for the dataset "a". However, because the first class of the dataset
"a" dominates the entire data set, the other two methods produce no results (? result).

The performance results of the dataset "b" appear to be better than the dataset "a" because
the label classes in the dataset have the same ratio (balance).

Dataset ‘c’

[NaiveBayes]

[k-NN (k=1)]

[k-NN (k=10)]

By compiling these data, the following table can be made.

F-Measure
NaiveBayes 0.472
k-NN (k=1) 0.971
k-NN (k=10) 1.000
The above results show the three performance of the classifier in the dataset "c". Such
results indicate that the best performance falls to k-NN with k=10. Because the k-NN with
k=10 has no generalization error and can be seen while the model is implemented in the
test set, so this model predicts the test set well. However, we can see some Incorrectly
Classified Instances in the model of k-NN with k=1, the ratio is 2.9%. So, the higher the
value of k, the more general the model. Moreover, the NaiveBayes model has more
Incorrectly Classified Instances than the Correctly Classified Instances. This is the reason that
there’s no doubt of why k-NN with k=10 is best choice.

You might also like