You are on page 1of 6

Introduction to Data Mining

Hand-On Class Assignment-1


Data Preprocessing is a key step when you build data mining models. In real-world business settings, a great proportion of the time you
spend in a data mining project will be associated to tasks like data consolidation, data cleaning, feature construction/transformation,
feature selection, etc. More importantly, the quality of theresulting predictive models will largely depend on your ability to adequately
preprocess the raw data and to create meaningful features from it.In a recent cross-selling application, experiments conducted in a mailing
campaign in the publishing industry shown that about 50-70% of the accuracy of the predictive models built in these experiments can be -
at least indirectly- explained by data preprocessing decisions (sampling, coding of categorical variables, scaling, etc.).

The purpose of this assignment is to familiarize you with some of the preprocessing tools you may need for your projects .You should
already have Weka installed.For this assignment, you will need to download the TRAIN2.arff and TRAIN2.csv datasets from the course
website.

PART I – FEATURE CONSTRUCTION

Open the TRAIN2.arff file found on the Weka

You should see 9 attributes in the attributes section on the Preprocess tab. Click on each attribute one by one. You should notice that the
statistics in the selected attributesection change according to the attribute you select. You should be able to see information about the
number of missing values, the attribute type (Nominal vs. Numeric), the number of unique values, and so on.

Transformation

Now click on pgift. You should see from the selected attribute window that pgift is a numeric attribute. You should also see that the
distribution is skewed. Let’s transform this attribute. Go to the filter section and click on the Choose button. Go to the folder
filters.unsupervised.attribute Click on the wordNumericTransformin the white text box. In the filter section, click on the box right next
to the Choose button, as indicated in the figure:

1
In the popup (see the figure below), change the method name to logto take the logarithm of the values in attribute pgift.Set the invert
selection flag to False. Put the number 4 for attribute 4 in the attributIndices text box. Click on the More button to see how you might
transform multiple attributes at a time. Click the OK button. Now click Apply.

Click on pgift again. You should see that the distribution is normal now.

Nominal to Binary

Now click on attribute rfa_2f. In the selected attribute window, you should see that rfa_2f is a nominal attribute and it has fourpossible
values. Go to theNominalToBinary filter the same way you went to the NumericTransform filter above. Set the parameters as follows.

Apply the filter.

Question 1: How many attributes do you now see in the attributes window? What possible values do the new attributes take?

Now click the Undo button to roll back the change. Go back to the NominalToBinary filter and set the binaryAttributesNominal flag to
True. Apply the filter again.

Question 2: Now, how many attributes do you see in the attributes window? What possible values do the new attributes take?

2
Discretize

Now click on attribute firstdate. You will see that type isnumeric attribute. Let’s discretize this attribute using the Discretize filter. Set
the parameters as follows:

Click on the More button to learn more about the parameters that you may set. For right now, we will leave the default bins setting at 10.

Question 3: Select the attribute and look at the ‘selected attribute’ box.What “type” of attribute do you now have? What is the label
for the first category? What is the category with the least number of observations?

PART II: SAMPLING

Unsupervised Sampling

Select the Resample instance filter in the filters.unsupervised.instance folder. Notice that in the current relation section it shows you have
9541 instances. For the Resample, set the parameters as indicated in the figure:

Select OK and Apply.

Q4: How many instances does Weka show in your dataset after sampling?

3
Remove an attribute

Click on the check box next totarget_d. Now,Click the Remove button at the bottom of the window.

Supervised Sampling

Weka assumes that the last column in your data is the target variable (NOTE: you can change this to another attribute when running
classification and feature selection methods). Our data has a uniform class distribution (I already sampled from a larger data set so that
we would have approximately the same number of ones and zeros). However, if your dataset were skewed with respect to your class
label, you could perform supervised sampling to bias your sample to a uniform class.

You’ll perform supervised sampling. Select the Resample instance filter in the filters.supervised.instance folder, and use the parameter
settings as indicated in the figure:

Select OK and Apply. Save your updated .arff file now. Click on the Save button to save the .arff file as TRAIN2new.arff.

PART III: FEATURE SELECTION

Now Click on the Select Attributes tab in Weka. The default evaluator is CfsSubsetEval and the default Search Method is BestFirst.
Change these to InfoGainAttributeEval and Ranker respectively. Click Start. The attributes in your data set are ranked by information
gain with respect to the class.

Question 5: What are the first three attributes ranked by information gain?

You may also try Principal Components (PC) as the evaluator. Note that PC creates dummies for all of the attribute/value pairs before
perfoming the analysis.

Go back to the default settings of CfsSubsetEval and BestFirst. (These settings will perform the forward selection method we discussed
in class). Press Start.

Question 6: Which attributes were selected?

Now go back to the Preprocess tab and select the attributes that you found in the step above. You will need to check the check boxes
next to the attributes as well as the check box next to your target variable (target_b).Now, click on the Invert button at the top of the
attributes section. Click Remove button. Feel free to go to the Classify tab and play around with some of the classification methods we
will discuss in class (Decision Trees, Naïve Bayes, MultiLayerPerceptrons, LogisticRegression, K-Nearest Neighbor, etc) or some of the
unsupervised methods like Clustering. For fun, click through the folders to explore the algorithms that are part of the Weka package.

Question 7: (NOT A QUESTION BUT REQUIRES ACTION) Open the TRAIN2new.arff file you created in a text editor (e.g. MS
Word). Cut and paste the first 20 lines of the file to your homework assignment.
4
PART IV: DATA CLEANING

So, if you haven’t already noticed, Weka uses .arff data files. If you open the TRAIN2.arff data file in a text editor, you will see that it
has the following header:

@relation learn-weka.filters.unsupervised.attribute.Remove-R10-
weka.filters.supervised.instance.Resample-B1.0-S1-Z10.0

@attribute Income {0,1,2,3,4,5,6,7}


@attribute Firstdate numeric
@attribute Lastdate numeric
@attribute pgift numeric
@attribute rfa_2f {1,2,3,4}
@attribute rfa_2a {A,B,C,D,E,F,G}
@attribute pepstrfl {X,0}
@attribute target_b {1,0}
@attribute target_d numeric

@data

In .arff files, the first line must start with @relation, followed by a line for each attribute. Each attribute line begins with @attribute
followed by the name of the attribute and then either the word numeric if numeric or a set of attribute values separated by commas
enclosed in curly brackets if nominal. After the attributes are declared, a line with @data follows indicating the end of the header.

Following the header, you will see a comma delimited set of rows. You may be familiar with comma separated files from Excel. If not,
.csv is a filetype that you may use to both save and read files in Excel.

You can save comma separated files (csv) using Excel and then read them into Weka easily. The nice thing is that Weka will
automatically detect most nominal attributes and their corresponding values. Once you read a .csv file into Weka, you can save it in .arff
format and edit the heading according to your needs in a text editor.

This PART addresses the top 5 things that will stump you (in Weka) when working with new dirty data.The error descriptions can be a bit
cryptic at times (Afterall the software is free). But here are some things to be aware of.

1. Records of different length


2. Missing values not set to question mark (All missing values must be denoted by a question mark as opposed to a space). For
example, a row with 5 columns and 2 missing values like 4,A,,,B must be formatted to 4,A,?,?,B for Weka.
3. Non-alphanumeric characters must be removed
4. Non-nominal target variable. For classification, you want your target value (the attribute you are trying to predict) to be of type
Nominal. If Weka detects your target attribute to be numeric, you can discretize the attribute into two bins. However, you can
also make sure that the values are detected as non-numeric from your .csv file by giving the values text names. For example.
You can call the positive examples (pos) and negative examples (neg) instead of assigning them values of ones and zeros
respectively.
5. Incompatible training and test sets. You make transformations to attributes in your training set and forget to make the same
transformations to the attributes in your test set (We won’t deal with this problem just yet).

Open TRAIN2.csv in Weka

Weka will complain. Open TRAIN2.csv in Excel and inspect the data for errors:

1. Make sure all records have the same length


2. Make sure there are no blank cells (You may need to find blanks and replacewith ?)
3. Make sure there are no bothersome characters (“,*,@, etc.). ALSO, for future reference, note numeric values with commas
cause major trouble!
4. In the last column, target_b, replace all ones with pos and all zeros with neg.

5
Question 8: (NOT A QUESTION BUT REQUIRES ACTION) Once the data are clean, open the file in Weka and save the file as
a TRAIN2new2.arff file. Open the file in a text editor and cut and paste the heading plus the first 4 lines of data of to your
homework assignment.

These exercises were meant to get you familiar with Weka (Not to cause you data cleaning pain). Feel free to play around with additional
filters and feature selection methods. Next week we will actually start building some models!

You might also like