You are on page 1of 7

19CSE304- FOUNDATIONS OF DATA SCIENCE

LAB ASSIGNMENT-1
INTRODUCTION TO WEKA EXPLORER

CH SURYA UMA SHANKAR


CH.EN.U4CSE19101

1. Load diabetes.arff dataset.

a. What is the smallest, largest and average age of the patients in the
diabetes dataset?
smallest is 21, largest is 81, and average is 33.241

b. Pre-process the dataset to remove instances more than 50 years old.


Report what filters were used to achieve this. Show screenshot of how the data
changed at each step.
Choose removewithvalues filter. Under filter select unsupervised Instance
removewithvalues. Set attribute index as 8 and invert selection true. Change split point
to 51. Click apply to remove all instances more than 50 years old
c. Compare the number of values for age attribute before and after applying
the filters.
Before applying filter, we have 52 distinct values and after applying filter we
have 29 distinct values

2. Load credit-g.arff dataset.

a. Run J48 classifier with default properties on this dataset and report the
accuracy. Comment about the misclassification and the confusion matrix.

Accuracy is 70.5%. For class good, there were 588 correctly classified
instances and 112 incorrectly classified instances but for class bad, there were more
incorrectly classified instances (183) compared to correctly classified instances (117)
which is a bad figure.

➢ Classification of class bad is not accurate

b. Run 'InfoGainAttribute' evaluator and using 'Ranker' search method and


find out which are the most irrelevant attributes.
Click select attributes and choose 'InfoGainAttribute' and select 'Ranker'
method under Search method.

➢ num_dependents, installment_commitment, residence_since, and


existing_credits are the most irrelevant attributes.

c. After removing the most irrelevant attributes, run J48 classifier again and
comment on the results obtained.
Accuracy has increased (72%) compared to previous results. For class good,
there were 586 correctly classified instances and 114 incorrectly classified instances
and for class bad, there were 166 correctly classified instances and 134 incorrectly
classified instances which is much better compared to previous results. Rate of
misclassification has decreased comparatively.

3. Load weather.nominal dataset. Remove instances in the dataset where


humidity attribute has high value.
The number of instances reduced from 14 to 7 for humidity attribute
4. Run J48 classifier on glass dataset.

a. What is the top node on the decision tree?


Ba

b. How many instances of the tableware class were misclassified? What are
the misclassified instances?
2 instances were misclassified. 171 and 192
c. Comment about true positives and false positives of 'build wind float' class.
In 'build wind float' class we have 50 true positives and 25 false
positives
TP rate = .714 and FP rate = .174

d. From the decision tree, come up with a rule to identify headlamps.


If Ba>0.27 and Si>70.16, we can classify the instances as headlamps

e. From the dataset, remove the instances that were misclassified.


Under filter select Removemisclassified. Right click on the filter and choose
J48 classifier and click OK. Now click apply

The number of instances reduced to 194 after applying the filter.

You might also like