You are on page 1of 4

IS 675: INTRODUCTION TO DATA MINING SUMMER 2011 DUE: JUNE 19, 2011 MIDNIGHT EST Mini project 1: Discretization: use

WEKA Explorer for this project

(1) Load dataset iris.arff, to be found in the ”Data” folder in WEKA, also loaded in the folder ”Assignments”. Answer: Shown in Figure 1. (2) Create a 3-bin discretization of the first 4 attributes (all except the ”class” attribute) using equal frequency binning. Answer: Shown in Figure 2. (3) Visualize the class distribution for each discretized attribute from the ”Preprocess” tab. Answer: Shown in Figure 3. (4) Explain briefly which discretized attribute does the best job and which does the worst job in separating the three classes: Iris-setosa, Iris-versicolor, Iris-virginica. Answer: The question is about the discriminating power of each attribute in determining the class values. In other words, how much can we ascertain about the class value of an iris flower, if we know the value of an attribute. From Figure 3 we can see that petal length discriminates the class value the best: each bin contains mostly one type (class) of iris flower. On the other hand, sepal width does the worst job because multiple classes are represented in each bin. (5) For each step include relevant screenshots in a word file and submit using the course website.

1

2011 MIDNIGHT EST Figure 1.2 IS 675: INTRODUCTION TO DATA MINING SUMMER 2011 DUE: JUNE 19. Load data .

2011 MIDNIGHT EST 3 Figure 2. 3-bin discretization .IS 675: INTRODUCTION TO DATA MINING SUMMER 2011 DUE: JUNE 19.

2011 MIDNIGHT EST Figure 3. Visualize .4 IS 675: INTRODUCTION TO DATA MINING SUMMER 2011 DUE: JUNE 19.