You are on page 1of 43

VNRVJIET,

DATA MINING LABORATORY


IT Department

DATA MINING LABORATORY


LAB RECORD

Name: XXXXXXXXX Roll No: 18075A1222

1
VNRVJIET,
DATA MINING LABORATORY
IT Department

Date of Execution: XX-XX-XXXX

EXP #1:
EXPLORE CONTACT LENS DATA SET.
contact-lens.arff

contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor,
Benoit Julien in the year 1990.

Database: This database is complete. The examples used in this database are complete and
noise-free. The database has 24 instances and 4 attributes.

Attributes: All four attributes are nominal. There are no missing attribute values. 

Name: XXXXXXXXX Roll No: 18075A1222

2
VNRVJIET,
DATA MINING LABORATORY
IT Department

The four attributes are as follows:

#1) Age of the patient: The attribute age can take values:

 young

 pre-presbyopic

 presbyopic

#2) Spectacle prescription: This attribute can take values:

 myope

 hypermetrope

#3) Astigmatic: This attribute can take values

 no

 yes

#4) Tear production rate: The values can be

 reduced

 normal

Class: Three class labels are defined here. These are:

 the patient should be fitted with hard contact lenses.

 the patient should be fitted with soft contact lenses.

 the patient should not be fitted with contact lenses.

Class Distribution: The instances that are classified into class labels are enlisted below:

Class Label No of Instances

1. Hard contact lenses 4

2. Soft contact lenses 5

3. No contact lenses 15

Name: XXXXXXXXX Roll No: 18075A1222

3
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

4
VNRVJIET,
DATA MINING LABORATORY
IT Department

Date of Execution: XX-XX-XXXX

EXP #2:
EXPLORE IRIS DATASTE FOR WEKA TOOL.
iris.arff

iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.

Name: XXXXXXXXX Roll No: 18075A1222

5
VNRVJIET,
DATA MINING LABORATORY
IT Department

Database: This database is used for pattern recognition. The data set contains 3 classes of 50
instances. Each class represents a type of iris plant. One class is linearly separable from the
other 2 but the latter are not linearly separable from each other. It predicts to which species of
the 3 iris flower the observation belongs. This is called a multi-class classification dataset.

Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing
attributes.

The attributes are:

 sepal length in cm

 sepal width in cm

 petal length in cm

 petal width in cm

 class:

 Iris Setosa

 Iris Versicolour

 Iris Virginica

Summary Statistics:

Class
  Min Max Mean SD
Correlation

sepal length 4.3 7.9 5.84 0.83 0.7826

sepal width 2.0 4.4 3.05 0.43 -0.4194

petal length 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width 0.1 2.5 1.20 0.76 0.9565 (high!)

Class Distribution: 33.3% for each of 3 classes

Name: XXXXXXXXX Roll No: 18075A1222

6
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

7
VNRVJIET,
DATA MINING LABORATORY
IT Department

Date of Execution: XX-XX-XXXX

EXP #3:
Explore CREDIT Dataset with respect to Weka Tool. Answer the
following questions.

a) Stepwise of importing of dataset into weka.


b) How dataset is analysed with weka tool.
c) Show the visualizations of all attributes.
d) Explore how filters are used in weka in order to manage the instances of a
dataset.

Answer:

Follow the steps enlisted below to use WEKA for identifying real values and nominal
attributes in the dataset.

#1) Open WEKA and select “Explorer” under ‘Applications’.

Name: XXXXXXXXX Roll No: 18075A1222

8
VNRVJIET,
DATA MINING LABORATORY
IT Department

#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA user, you can access
WEKA sample files.

#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the
predefined .arff file “credit-g.arff” file and click on “Open”.

#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on
the right panel along with the histogram.

Analysis of the dataset:

In the left panel the current relation shows:

Name: XXXXXXXXX Roll No: 18075A1222

9
VNRVJIET,
DATA MINING LABORATORY
IT Department

 Relation name: german_credit is the sample file.

 Instances: 1000 number of data rows in the dataset.

 Attributes: 21 attributes in the dataset.

The panel below current relation shows the name of attributes.

In the right panel, the selected attribute statistics are displayed. Select the attribute
“checking_status”.

It shows:

 Name of the attribute

 Missing: Any missing values of the attribute in the dataset. 0% in this case.

 Distinct: The attribute has 4 distinct values.

 Type: The attribute is of the nominal type that is, it does not take any numeric value.

 Count: Among the 1000 instances, the count of each distinct class label is written in
the count column.

 Histogram: It will display the output class label for the attribute. The class label in
this dataset is either good or bad. There are 700 instances of good (marked in blue) and
300 instances of bad (marked in red).

 For the label < 0, the instances for good or bad are almost the same in number.

 For label, 0<= X<200, the instances with decision good are more than
instances with bad.

 Similarly, for label >= 200, the max instances occur for good and no checking
label has more instances with decision good.

Name: XXXXXXXXX Roll No: 18075A1222

10
VNRVJIET,
DATA MINING LABORATORY
IT Department

For the next attribute “duration”.

The right panel shows:

 Name: This is the Name of the attribute.

 Type: Type of the attribute is numeric.

 Missing value: The attribute does not have any missing value.

 Distinct: It has 33 distinct values in 1000 instances. It means in 1000


instances it has 33 distinct values.

 Unique: It has 5 unique values that do not match with each other.

 Minimum value: The min value of the attribute is 4.

 Maximum Value: The max value of the attribute is 72.

 Mean: Mean is adding all the values divided by instances.

 Standard Deviation: Stddeviation of attribute duration.

 Histogram: The histogram depicts the duration of 4 units, the max instances


occur for a good class. As the duration increases to 38 units, the number of
instances reduces for good class labels. The duration reaches 72 units which
have only one instance which classifies decision as bad.

Name: XXXXXXXXX Roll No: 18075A1222

11
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

12
VNRVJIET,
DATA MINING LABORATORY
IT Department

The class is the classification feature of the nominal type. It has two distinct values: good
and bad. The good class label has 700 instances and the bad class label has 300 instances.

To visualize all the attributes of the dataset, click on “Visualize All”.

Name: XXXXXXXXX Roll No: 18075A1222

13
VNRVJIET,
DATA MINING LABORATORY
IT Department

#5) To find out only numeric attributes, click on the Filter button. From there, click
on Choose ->WEKA >FILTERS -> Unsupervised Type ->Remove Type.

WEKA filters have many functionalities to transform the attribute values of the dataset to
make it suitable for the algorithms. For example, the numeric transformation of attributes.

Filtering the nominal and real-valued attributes from the dataset is another example of using
WEKA filters.

#6) Click on the RemoveType in the filter tab. An object editor window will open. Select
attributeType “Delete numeric attributes” and click on OK.

Name: XXXXXXXXX Roll No: 18075A1222

14
VNRVJIET,
DATA MINING LABORATORY
IT Department

#7) Apply the filter. Only numeric attributes will be shown.

The class attribute is of the nominal type. It classifies the output and hence cannot be deleted.
Thus it is seen with the numeric attribute.

Output:

The real-valued and nominal values attributes in the dataset are identified. Visualization with
the class label is seen in the form of histograms.

Name: XXXXXXXXX Roll No: 18075A1222

15
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

16
VNRVJIET,
DATA MINING LABORATORY
IT Department

EXP #4:
DEMONSTRATE THE WEKA DECISION TREE
CLASSIFICATION ALGORITHMS FOR
WEATHER.NOMINAL DATASET.
Now, we will see how to implement decision tree classification on
WEATHER.NOMINAL.ARFF dataset using the J48 classifier.

WEATHER.NOMINAL.ARFF

It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is
suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label
“play” classifies the output as “yes’ or “no”.

What Is Decision Tree

Decision Tree is the classification technique that consists of three components root node,
branch (edge or link), and leaf node. Root represents the test condition for different attributes,
the branch represents all possible outcomes that can be there in the test, and leaf nodes
contain the label of the class to which it belongs. The root node is at the starting of the tree
which is also called the top of the tree.

J48 Classifier

It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3).
It is also known as a statistical classifier. For decision tree classification, we need a database.

Steps include:

#1) Open WEKA explorer.

#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.

Name: XXXXXXXXX Roll No: 18075A1222

17
VNRVJIET,
DATA MINING LABORATORY
IT Department

#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose”
button. From this, select “trees -> J48”. Let us also have a quick look at other options in
the Choose button:

 Bayes: It is a density estimation for numerical attributes.

 Meta: It is a multi-response linear regression.

 Functions: It is logistic regression.

 Lazy: It sets the blend entropy automatically.

 Rule: It is a rule learner.

 Trees: Trees classifies the data.

Name: XXXXXXXXX Roll No: 18075A1222

18
VNRVJIET,
DATA MINING LABORATORY
IT Department

#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It
shows the run information in the panel as:

 Scheme: The classification algorithm used.

 Instances: Number of data rows in the dataset.

 Attributes: The dataset has 5 attributes.

 The number of leaves and the size of the tree describes the decision tree.

 Time taken to build the model: Time for the output.

 Full classification of the J48 pruned with the attributes and number of
instances.

Name: XXXXXXXXX Roll No: 18075A1222

19
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

20
VNRVJIET,
DATA MINING LABORATORY
IT Department

#5) To visualize the tree, right-click on the result and select visualize the tree.

Output:

The output is in the form of a decision tree. The main attribute is “outlook”.

If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then
class label play= “yes”.

If the outlook is overcast, the class label, play is “yes”. The number of instances which obey
the classification is 4.

If outlook is rainy, further classification takes place to analyze the attribute “windy”. If


windy=true, the play = “no”. The number of instances which obey the classification for
outlook= windy and windy=true is 2.

Conclusion

WEKA offers a wide range of sample datasets to apply machine learning algorithms. The
users can perform machine learning tasks such as classification, regression, attribute
selection, association on these sample datasets, and can also learn the tool using them.

WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be
used by other computer programs. In WEKA the output of preprocessing gives the attributes
present in the dataset which can be further used for statistical analysis and comparison with
class labels.

Name: XXXXXXXXX Roll No: 18075A1222

21
VNRVJIET,
DATA MINING LABORATORY
IT Department

WEKA also offers many classification algorithms for decision tree. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied
from the Pre-process tab by removing the attributes which are not required and start the
classification process again.

Name: XXXXXXXXX Roll No: 18075A1222

22
VNRVJIET,
DATA MINING LABORATORY
IT Department

EXP #5:
DEMONSTRATE KNN CLASSIFER FOR THE IONOSHPERE
DATASET USING WEKA.
Ionosphere Dataset

Let’s start out by selecting the dataset.

1. In the “Datasets” select click the “Add new…” button.

2. Open the “data“directory and choose the “ionosphere.arff” dataset.

The Ionosphere Dataset is a classic machine learning dataset. The problem is to predict the
presence (or not) of free electron structure in the ionosphere given radar signals. It is
comprised of 16 pairs of real-valued radar signals (34 attributes) and a single class attribute
with two values: good and bad radar returns.

You can read more about this problem on the UCI Machine Learning Repository page for the
Ionosphere dataset.

Tuning k-Nearest Neighbour

In this experiment we are interested in tuning the k-nearest neighbor algorithm (kNN) on the


dataset. In Weka this algorithm is called IBk (Instance Based Learner).

The IBk algorithm does not build a model, instead it generates a prediction for a test instance
just-in-time. The IBk algorithm uses a distance measure to locate k “close” instances in the
training data for each test instance and uses those selected instances to make a prediction.

In this experiment, we are interested to locate which distance measure to use in the IBk
algorithm on the Ionosphere dataset. We will add 3 versions of this algorithm to our
experiment:

Euclidean Distance

1. Click “Add new…” in the “Algorithms” section.

2. Click the “Choose” button.

3. Click “IBk” under the “lazy” selection.

4. Click the “OK” button on the “IBk” configuration.

This will add the IBk algorithm with Euclidean distance, the default distance measure.

Manhattan Distance

Name: XXXXXXXXX Roll No: 18075A1222

23
VNRVJIET,
DATA MINING LABORATORY
IT Department

1. Click “Add new…” in the “Algorithms” section.

2. Click the “Choose” button.

3. Click “IBk” under the “lazy” selection.

4. Click on the name of the “nearestNeighborSearchAlgorithm” in the configuration for


IBk.

5. Click the “Choose” button for the “distanceFunction” and select


“ManhattanDistance“.

6. Click the “OK” button on the “nearestNeighborSearchAlgorithm” configuration.

7. Click the “OK” button on the “IBk” configuration.

Select a distance measures for IBk

This will add the IBk algorithm with Manhattan Distance, also known as city block distance.

Chebyshev Distance

1. Click “Add new…” in the “Algorithms” section.

2. Click the “Choose” button.

3. Click “IBk” under the “lazy” selection.

4. Click on the name of the “nearestNeighborSearchAlgorithm” in the configuration for


IBk.

5. Click the “Choose” button for the “distanceFunction” and select


“ChebyshevDistance“.

Name: XXXXXXXXX Roll No: 18075A1222

24
VNRVJIET,
DATA MINING LABORATORY
IT Department

6. Click the “OK” button on the “nearestNeighborSearchAlgorithm” configuration.

7. Click the “OK” button on the “IBk” configuration.

This will add the IBk algorithm with Chebyshev Distance, also known as city chessboard
distance.

4. Run Experiment

Click the “Run” tab at the top of the screen.

This tab is the control panel for running the currently configured experiment.

Click the big “Start” button to start the experiment and watch the “Log” and “Status” sections
to keep an eye on how it is doing.

5. Review Results

Click the “Analyse” tab at the top of the screen.

This will open up the experiment results analysis panel.

Algorithm Rank

The first thing we want to know is which algorithm was the best. We can do that by ranking
the algorithms by the number of times a given algorithm beat the other algorithms.

1. Click the “Select” button for the “Test base” and choose “Ranking“.

2. Now Click the “Perform test” button.

The ranking table shows the number of statistically significant wins each algorithm has had
against all other algorithms on the dataset. A win, means an accuracy that is better than the
accuracy of another algorithm and that the difference was statistically significant.

Name: XXXXXXXXX Roll No: 18075A1222

25
VNRVJIET,
DATA MINING LABORATORY
IT Department

Algorithm ranking in the Weka explorer for the Ionosphere dataset

We can see the Manhattan Distance variation is ranked at the top and that the Euclidean
Distance variation is ranked down the bottom. This is encouraging, it looks like we have
found a configuration that is better than the algorithm default for this problem.

Algorithm Accuracy

Next we want to know what scores the algorithms achieved.

1. Click the “Select” button for the “Test base” and choose the “IBk” algorithm with
“Manhattan Distance” in the list and click the “Select” button.

2. Click the check-box next to “Show std. deviations“.

3. Now click the “Perform test” button.

In the “Test output” we can see a table with the results for 3 variations of the IBk algorithm.
Each algorithm was run 10 times on the dataset and the accuracy reported is the mean and the
standard deviation in rackets of those 10 runs.

Name: XXXXXXXXX Roll No: 18075A1222

26
VNRVJIET,
DATA MINING LABORATORY
IT Department

Table of algorithm classification accuracy on the Ionosphere dataset in the Weka Explorer

We can see that IBk with Manhattan Distance achieved an accuracy of 90.74% (+/- 4.57%)
which was better than the default of Euclidean Distance that had an accuracy of 87.10% (+/-
5.12%).

The little *” next to the result for IBk with Euclidean Distance tells us that the accuracy
results for the Manhattan Distance and Euclidean Distance variations of IBk were drawn
from different populations, that the difference in the results is statistically significant.

We can also see that there is no “*” for the results of IBk with Chebyshev Distance indicating
that the difference in the results between the Manhattan Distance and Chebyshev Distance
variations of IBk was not statistically significant.

Summary

In this post you discovered how to configure a machine learning experiment with one dataset
and three variations of an algorithm in Weka. You discovered how you can use the Weka
experimenter to tune the parameters of machine learning algorithm on a dataset and analyse
the results.

EXP #6:
DEMONSTRATE THE CLUSTERING ALGORITHM FOR
IRIS DATASET USING WEKA.
A clustering algorithm finds groups of similar instances in the entire dataset. WEKA
supports several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer,
SimpleKMeans and so on. You should understand these algorithms completely to fully
exploit the WEKA capabilities.

As in the case of classification, WEKA allows you to visualize the detected clusters
graphically. To demonstrate the clustering, we will use the provided iris database. The data
set contains three classes of 50 instances each. Each class refers to a type of iris plant.

Loading Data

In the WEKA explorer select the Preprocess tab. Click on the Open file ... option and select
the iris.arff file in the file selection dialog. When you load the data, the screen looks like as
shown below −

Name: XXXXXXXXX Roll No: 18075A1222

27
VNRVJIET,
DATA MINING LABORATORY
IT Department

You can observe that there are 150 instances and 5 attributes. The names of attributes are
listed as sepallength, sepalwidth, petallength, petalwidth and class. The first four
attributes are of numeric type while the class is a nominal type with 3 distinct values.
Examine each attribute to understand the features of the database. We will not do any
preprocessing on this data and straight-away proceed to model building.

Clustering

Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on
the Choose button. You will see the following screen −

Name: XXXXXXXXX Roll No: 18075A1222

28
VNRVJIET,
DATA MINING LABORATORY
IT Department

Now, select EM as the clustering algorithm. In the Cluster mode sub window, select


the Classes to clusters evaluation option as shown in the screenshot below −

Name: XXXXXXXXX Roll No: 18075A1222

29
VNRVJIET,
DATA MINING LABORATORY
IT Department

Click on the Start button to process the data. After a while, the results will be presented on
the screen.

Next, let us study the results.

Examining Output

The output of the data processing is shown in the screen below −

Name: XXXXXXXXX Roll No: 18075A1222

30
VNRVJIET,
DATA MINING LABORATORY
IT Department

From the output screen, you can observe that −

 There are 5 clustered instances detected in the database.

 The Cluster 0 represents setosa, Cluster 1 represents virginica, Cluster 2 represents


versicolor, while the last two clusters do not have any class associated with them.

If you scroll up the output window, you will also see some statistics that gives the mean and
standard deviation for each of the attributes in the various detected clusters. This is shown in
the screenshot given below −

Name: XXXXXXXXX Roll No: 18075A1222

31
VNRVJIET,
DATA MINING LABORATORY
IT Department

Next, we will look at the visual representation of the clusters.

Visualizing Clusters

To visualize the clusters, right click on the EM result in the Result list. You will see the
following options −

Name: XXXXXXXXX Roll No: 18075A1222

32
VNRVJIET,
DATA MINING LABORATORY
IT Department

Select Visualize cluster assignments. You will see the following output −

Name: XXXXXXXXX Roll No: 18075A1222

33
VNRVJIET,
DATA MINING LABORATORY
IT Department

As in the case of classification, you will notice the distinction between the correctly and
incorrectly identified instances. You can play around by changing the X and Y axes to
analyze the results. You may use jittering as in the case of classification to find out the
concentration of correctly identified instances. The operations in visualization plot are
similar to the one you studied in the case of classification.

Applying Hierarchical Clusterer

To demonstrate the power of WEKA, let us now look into an application of another
clustering algorithm. In the WEKA explorer, select the HierarchicalClusterer as your ML
algorithm as shown in the screenshot shown below −

Name: XXXXXXXXX Roll No: 18075A1222

34
VNRVJIET,
DATA MINING LABORATORY
IT Department

Choose the Cluster mode selection to Classes to cluster evaluation, and click on


the Start button. You will see the following output −

Name: XXXXXXXXX Roll No: 18075A1222

35
VNRVJIET,
DATA MINING LABORATORY
IT Department

Notice that in the Result list, there are two results listed: the first one is the EM result and
the second one is the current Hierarchical. Likewise, you can apply multiple ML algorithms
to the same dataset and quickly compare their results.

If you examine the tree produced by this algorithm, you will see the following output −

Name: XXXXXXXXX Roll No: 18075A1222

36
VNRVJIET,
DATA MINING LABORATORY
IT Department

EXP #7:
EXPLAIN THE PROCESS OF DATA PREPROCESSING IN
WEKA.
Step 1: Data Pre Processing or Cleaning

1. Launch Weka-> click on the tab Explorer

2. Load a dataset. (Click on “Open File” & locate the datafile)

3. Click on PreProcess tab & then look at your lower R.H.S. bottom window
click on drop down arrow and choose “No Class”

4. Click on “Edit” tab, a new window opens up that will show you the loaded
datafile. By looking at your dataset you can also find out if there are missing
values in it or not. Also please note the attribute types on the column header. It
would either be ‘nominal’ or ‘numeric’.

Name: XXXXXXXXX Roll No: 18075A1222

37
VNRVJIET,
DATA MINING LABORATORY
IT Department

1) If your data has missing values then its best to clean it first before you apply any forms of
mining algorithm to     it. Please look below at Figure 1, you will see the highlighted fields
are blank that means the data at hand is dirty and it first needs to be cleaned. 

Figure 1

2) Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be
missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter
choice may vary according to your need, for more information on it please consult the
resources).Click on Choose button below Filters-> Unsupervised->attribute—————>
ReplaceMissingWithUserConstant

Please refer below to Figure: 2 to know how to edit the filter values.

Name: XXXXXXXXX Roll No: 18075A1222

38
VNRVJIET,
DATA MINING LABORATORY
IT Department

Figure: 2

A good choice for replacing missing numeric values is to give it values like -1 or 0 and for
string values it could be NULL. Refer to Figure 3.

Figure: 3

It’s worthwhile to also know how to check the total number of data values or instances in
your dataset.

Refer to Figure: 4.

Name: XXXXXXXXX Roll No: 18075A1222

39
VNRVJIET,
DATA MINING LABORATORY
IT Department

Figure: 4

So as you can see in Figure 4 the number of instances is 345446. The reason why I want you
to know about this is because later when we will be applying clustering to this data, your
Weka software will crash because of “OutOfMemory” problem.

So this logically follows that how do we now partition or sample the dataset such that we
have a smaller data content which Weka can process. So for this again we use the Filter
option.

4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of
the following options below

1. RemovePercentage – removes a given percentage from dataset

2. RemoveRange- removes a given range of instances of a dataset

Name: XXXXXXXXX Roll No: 18075A1222

40
VNRVJIET,
DATA MINING LABORATORY
IT Department

3. RemoveWithValues

4. Resample

5. ReservoirSample

To know about each of these, place your mouse cursor on their name and you will see a tool-
tip that will explain them.

For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that
Weka is unable to handle values in size equal to or greater than 999999. Therefore when you
are sampling your data I will suggest choose the sample size to a value less than or equal to
9999. The default value of the sample size will be 100. Change it to 9999 as shown below in
Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has
been applied, if you look at the Instances value also shown in Figure 6, you will see that the
sample size is now 9999 as compared to the previous complete instances value at 345446.

Figure: 5

Name: XXXXXXXXX Roll No: 18075A1222

41
VNRVJIET,
DATA MINING LABORATORY
IT Department

Figure: 6

If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset
cleaned. All missing values have been replaced with your user specified constants. Please see
below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been
completed.

Figure: 7

 It’s always a good idea to save the cleaned dataset. To do so, click on the save button as
shown below in  Figure: 8.

Figure: 8

Name: XXXXXXXXX Roll No: 18075A1222

42
VNRVJIET,
DATA MINING LABORATORY
IT Department

Name: XXXXXXXXX Roll No: 18075A1222

43

You might also like