You are on page 1of 4

Data Mining - Fall 2003 Instructor: Craig A. Struble, Ph.D.

Lab 3: Other Classication Techniques

Assigned: September 17, 2003 Due: September 24, 2003

In this lab, you will investigate other classication techniques with the labor contract data set used in the previous labs. These techniques apply statistical, distance based, and neural network approaches for classication. You will compare these models to each other and to the decision tree models from lab 2.

By completing this lab, students should Understand how to build na Bayesian models for classication; ve Understand how to build a feed forward, back propagation neural network for classication; Understand how to build a distance based model for classication; Understand how to assess the accuracy of several models using cross-validation; Understand how to assess the interpretability and size of dierent types of models for classication.

Prepatory Reading
Chapters 3 and 4 in Dunham. Chapters 4 and 6 Witten and Frank (available on reserve) Sections 7.4, 7.5, and 7.7 in Han and Kamber (available on reserve) Documentation on the various classier implementations available via http://www.cs.waikato.

The following materials will be used in this lab: Weka, which is available on studsys; The data le /home/cstruble/class/mscs228/data/UCI/labor.arff, located on studsys and available from the course web site; 1

Pre-lab Questions
These questions should be answered before you perform the lab assignment. Record your answers in the introduction section of your lab assignment in your lab notebook. 1. Using a na Bayesian model with maximum likelihood estimates for classication with the ve credit risk data below, what would the data item D = good, high, none, 1513k be classied as? Customer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit History bad unknown unknown unknown unknown unknown bad bad good good good good good bad Debt high high low low low low low low low high high high high high Collateral none none none none none adequate none adequate none adequate none none none none Income $015k $1535k $1535k $015k > $35k > $35k $015k > $35k > $35k > $35k $015k $1535k > $35k $1535k Credit Risk? high high moderate high low low high moderate low low high moderate low high

2. What would a 3-NN model classify the data item D as? 3. Draw a feed-forward neural network architecture to classify the credit risk data above. 4. Hypothesize whether your best decision tree model from lab 2 will be better, the same, or worse than your best na Bayesian, K-NN, or neural network model that you create in this ve assignment. Provide a reasoning for making your hypotheses.

This section provides the steps to take for this lab assignment. As you carry out each step, record observations you make in your lab notebook. Your notes do not have to be completed writing, but youll use them to generate your nal lab report. When you work on a data mining problem, it is important to record any steps you take along with observations you make at each step. Remember, it is often the goal of a data mining project to produce a nal report. That report includes a summary of the steps you took to achieve your nal results. For all of the models below, use 10-fold cross validation to evaluate the accuracy of your models. I will not provide explicit instructions for doing so.

Build Na Bayesian Models ve

In this section, you will build two na Bayesian models of the labor contract data. ve 1. Start Weka. On studsys, this can be accomplished with the command: java -jar /home/cstruble/class/mscs228/weka-3-2-3/weka.jar 2. Open Wekas Explorer interface. 3. Open /home/cstruble/class/mscs228/data/UCI/labor.arff from the Explorer interface. You may also want to open this le in a text editor, or by using the command: more /home/cstruble/class/mscs228/data/UCI/labor.arff 4. Typically, at this point, you would record a summary of your data. Since we are using the same data set from lab 1, you can simply refer to that summary. 5. Click on the Classify tab in the Weka window. 6. Select the NaiveBayes classier. Make sure not to use the NaiveBayesSimple classier. 7. Create and evaluate a na Bayesian with and without using kernel estimators for estimating ve the probabilities of numeric data attributes. Remember to right click on the result list for additional evaluation options.

Creating K-NN Models

In this section, you will build and evaluate a series of K-NN models for the labor contract data. 1. Select the IBk classier, which is a K-NN classier. 2. Create models with 3, 5, 7, and 9 nearest neighbors, by changing the KNN parameter. 3. Of the four previous models, select the KNN value that performs the best. Using that value, create another model with normalization turned o. The other parameters are more meaningful for numeric predictions, not classication.

Creating Neural Network Models

In this section, you will build and evaluate several feed forward, back propogation neural networks. 1. Select the NeuralNetwork classier. 2. Read the About information available from clicking the More button. 3. I recommend using autoBuild for all your networks in this exercise. You can investigate building your own networks with the GUI later, if you like. 4. Build and evaluate neural networks with the hiddenLayers parameter set to a, i, o, and t. When evaluating, make sure to right click and view the neural network with the nal learned weights. 5. Select the best performing neural network, and build neural networks without normalization (change both at once), and with decay (with other parameters set to their default). Again, make sure to view the neural network with the learned weights. 3

Post Laboratory Questions

1. What was your best model? What does this suggest about your data, and in particular how the classes are organized in the data space? 2. For each of the classes of models, briey discuss how easy it is to interpret the results of the models. Compare to the results provided by the decision tree. 3. For each class of model, and for the decision tree class of models in the previous lab, discuss the computation eort required to build and evaluate the models. 4. Discuss the eects of normalization. Did it help, hurt, or have no eect on performance?