You are on page 1of 88

VAAGDEVI COLLEGE OF ENGINEERING

(AUTONOMOUS)

(B18CS31)DATAMINING AND SE LAB MANUAL

B.Tech : VI SEMESTER L T P C
0 0 3 1.5

Pre-Requisites:
A course on “Computer Programming and Data Structures”
A course on “Object Oriented Programming Through Java”

Course Objectives:
After the completion of this course, the students should be able to

• This practical paper is designed to help students to design a data warehouse and implement
OLAP operations.
• This shall give them exposure to application of data warehousing.
• The next part of the practical helps the students to perform data mining functionalities such
as association rule mining, classification and clustering.
• To have hands on experience in developing a software project by using various
software engineering principles and methods in each of the phases of software
development.

Syllabus Content

Part A

Week 1: Design a data warehouse for auto sales analysis.

Week 2: Perform OLAP operations on auto sales data warehouse.

Week 3: Perform Data Preprocessing :


a)Data Selection and Loading.
b)Handing Missing values.
c)Creating arff file.
Week 4: a) Introduction to WEKA Explorer.
b) Implement Apriori Algorithm using supermarket data.

Week 5: Implement FP-Growth Algorithm using Super market data.

Week 6: Implement the following Tree based classification Algorithms on sample dataset:
a) ID3
b) C4.5

Week 7: Implement Naive Bayesian Classification Algorithm on sample dataset.

Week 8: Implement the following Clustering Algorithms on sample data set:


a) K-Means
b) DBSCAN
Part B

Week 9-15:

LIST OF EXPERIMENTS

Do the following 7 exercises for any two projects given in the list of sample projects
or any other projects:
1. Development of problem statement.
2. Preparation of Software Requirement Specification Document, Design Documents
and Testing Phase related documents.
3. Preparation of Software Configuration Management and Risk Management related
documents.
4. Study and usage of any Design phase CASE tool
5. Performing the Design by using any Design phase CASE tools.
6. Develop test cases for unit testing and integration testing
7. Develop test cases for various white box and black box testing

Sample Projects:
1. Passport automation System
2. Book Bank
3. Online Exam Registration
4. Stock Maintenance System
5. Online course reservation system
6. E-ticketing
7. Software Personnel Management System
8. Credit Card Processing
9. E-book management System.
10. Recruitment system

Course Outcomes:

• Develop a design of data warehouse and implement OLAP operations.


• Explore WEKA for data mining task such as association rule mining, classification and
clustering using a few algorithms from the respective task.
• Explore text mining using WEKA and apply classification using Naive bayes technique.
• Will have experience and/or awareness of testing problems and will be able to develop
a simple testing report.

TEXT BOOKS:
1. Data Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber, Morgan
Kaufmann Publishers, Elsevier,2nd Edition, 2006.
2. Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach and Vipin Kumar,
Pearson education.
3. Software Engineering, A practitioner’s Approach- Roger S. Pressman, 6th edition, Mc Graw
Hill International Edition.
4. Software Engineering- Sommerville, 7th edition, Pearson Education.
5. The unified modeling language user guide Grady Booch, James Rambaugh, Ivar Jacobson,
Pearson Education.
6. Ilene Burnstein, “Practical Software Testing”, Springer International Edition, 2003.

REFERENCE BOOKS:

1. Data Mining Techniques – Arun K Pujari,2nd edition, Universities Press.


2. Data Warehousing in the Real World – Sam Aanhory & Dennis Murray Pearson Edn Asia.
3. Insight into Data Mining,K.P.Soman,S.Diwakar,V.Ajay,PHI,2008.
4. Data Warehousing Fundamentals – Paulraj Ponnaiah Wiley student Edition.
Week 1 : Design a Data ware house for auto sales analysis

Solution:

Steps to design a data warehouse for auto sales analysis are as follows :

Step 1: Open SQL Server Management Studio 2008 and click on connect.
Step 2: Now a connection is established with Database Engine .
Step 3 : Right click on Databases and select New Database(to create our own database)

Step 4: Type any Database name (which we want to create) and select ok.
Step 5: Our own database is created .Right click on the database which is created and select
New Query.

Step 6 : Type the following SQl script in the New Query editor .

USE DemoDW
GO

CREATE TABLE DimProduct


(ProductKey int Identity NOT NULL PRIMARY KEY,
ProductAltKey nvarchar(10)NOT NULL,
ProductName nvarchar(50)NULL,
ProductDescription nvarchar(100)NULL,
ProductCategoryName nvarchar(50))
GO
CREATE TABLE DimCustomer
(CustomerKey int Identity NOT NULL PRIMARY KEY ,
CustomerAltKey nvarchar(10)NOT NULL,
CustomerName nvarchar(50)NULL,
CustomerEmail nvarchar(50)NULL,
CustomerGeographyKey int NULL)
GO

CREATE TABLE DimSalesperson


(SalespersonKey int Identity NOT NULL PRIMARY KEY ,
SalesPersonAltKey nvarchar(10)NOT NULL,
SalespersonName nvarchar(50)NULL,
StoreName nvarchar(50)NULL,
StoreGeographyKey int NULL)
GO

CREATE TABLE DimDate


(DateKey int Identity NOT NULL PRIMARY KEY ,
DateAltKey datetime NOT NULL,
CalenderYear int NOT NULL,
CalenderQuarter int NOT NULL,
MonthOfYear int NOT NULL,
[MonthName]nvarchar(15)NOT NULL,
[DayOfMonth]int NOT NULL,
[DayOfWeek]int NOT NULL,
[DayName]nvarchar(15)NOT NULL,
FiscalYear int NOT NULL,
FiscalQuarter int NOT NULL)
GO

CREATE TABLE FactSalesOrder


(ProductKey int NOT NULL REFERENCES DimProduct(ProductKey),
CustomerKey int NOT NULL REFERENCES DimCustomer(CustomerKey),
SalespersonKey int NOT NULL REFERENCES DimSalesperson(SalespersonKey),
OrderDateKey int NOT NULL REFERENCES DimDate(DateKey),
OrderNo int NOT NULL,
ItemNo int NOT NULL,
Quantity int NOT NULL,
SalesAmount money NOT NULL,
Cost money NOT NULL,
CONSTRAINT[PK_FactSalesOrder] PRIMARY KEY
(
[ProductKey],[CustomerKey],[SalesPersonKey],[OrderDateKey],[OrderNo],[ItemNo]
)
)
Once finished typing the SQL Script click on Execute button.
Step 7 : After successful execution, select database which we have created -> Database
Diagrams->New Database Diagram.

Step 8 : Select the tables which we want to add to the data warehouse design and select Add
and then close.
Step 9: The following Auto Sales Analysis Data warehouse design will be obtained.
Week2: Perform OLAP operations:

Solution :

Developing an OLAP Cube

For creation of OLAP Cube in Microsoft BIDS Environment, follow the 10 steps given below.

Step 1: Start BIDS Environment

Click on Start Menu -> Microsoft SQL Server 2008 R2 -> Click SQL Server Business
Intelligence Development Studio.
Step 2: Start Analysis Services Project

Click File -> New -> Project ->Business Intelligence Projects ->select Analysis Services
Project-> Assign Project Name -> Click OK
Step 3: Creating New Data Source

3.1 In Solution Explorer, Right click on Data Source -> Click New Data Source
3.2 Click on Next
3.3 Click on New Button
3.4 Creating New connection

1. Specify Your SQL Server Name where your Data Warehouse was created
2. Select Radio Button according to your SQL Server Authentication mode
3. Specify your Credentials using which you can connect to your SQL Server
4. Select database Sales_DW.
5. Click on Test Connection and verify for its success
6. Click OK.
3.5 Select Connection created in Data Connections-> Click Next

3.6 Select Option Inherit


3.7 Assign Data Source Name -> Click Finish
Step 4: Creating New Data Source View

4.1 In the Solution Explorer, Right Click on Data Source View -> Click on New Data Source
View
4.2 Click Next

4.3 Select Relational Data Source we have created previously (Sales_DW)-> Click Next
4.4 First move your Fact Table to the right side to include in object list.

Select FactProductSales Table -> Click on Arrow Button to move the selected object to Right
Pane.
4.5 Now to add dimensions which are related to your Fact Table, follow the given steps:Select
Fact Table in Right Pane (Fact product Sales) -> Click On Add Related Tables

4.6 It will add all associated dimensions to your Fact table as per relationship specified in your
SQL DW (Sales_DW).

Click Next.
4.7 Assign Name (SalesDW DSV)-> Click Finish

4.8 Now Data Source View is ready to use.


Step 5: Creating New Cube

5.1 In Solution Explorer -> Right Click on Cube-> Click New Cube
5.2 Click Next

5.3 Select Option Use existing Tables -> Click Next


5.4 Select Fact Table Name from Measure Group Tables (FactProductSales) -> Click Next

5.5 Choose Measures from the List which you want to place in your Cube --> Click Next
5.6 Select All Dimensions here which are associated with your Fact Table-> Click Next

5.7 Assign Cube Name (SalesAnalyticalCube) -> Click Finish


5.8 Now your Cube is ready, you can see the newly created cube and dimensions added in your
solution explorer.
Step 6: Dimension Modification

In Solution Explorer, double click on dimension Dim Product -> Drag and Drop Product Name
from Table in Data Source View and Add in Attribute Pane at left side.
Step 7: Creating Attribute Hierarchy In Date Dimension

Double click On Dim Date dimension -> Drag and Drop Fields from Table shown in Data
Source View to Attributes-> Drag and Drop attributes from leftmost pane of attributes to middle
pane of Hierarchy.

Drag fields in sequence from Attributes to Hierarchy window (Year, Quarter Name, Month
Name, Week of the Month, Full Date UK),
Step 8: Deploy the Cube

8.1 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Properties
8.2 Set Deployment Properties First

In Configuration Properties, Select Deployment-> Assign Your SQL Server Instance Name
Where Analysis Services Is Installed (mubin-pc\fairy) (Machine Name\Instance Name) ->
Choose Deployment Mode Deploy All as of now ->Select Processing Option Do Not Process ->
Click OK
8.3 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Deploy
8.4 Once Deployment will finish, you can see the message Deployment Completed in
deployment Properties.
Step 9: Process the Cube

9.1 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Process
9.2 Click on Run button to process the Cube
9.3 Once processing is complete, you can see Status as Process Succeeded -->Click Close to
close both the open windows for processing one after the other.
Step 10: Browse the Cube for Analysis

10.1 In Solution Explorer, right click on Cube Name (SalesDataAnalysisCube) -- > Click
Browse

10.2 Drag and drop measures in to Detail fields, & Drag and Drop Dimension Attributes in Row
Field or Column fields.

Now to Browse Our Cube

1. Product Name Drag & Drop into Column


2. Full Date UK Drag & Drop into Row Field
3. FactProductSalesCount Drop this measure in Detail area
10.2 Drag and drop measures in to Detail fields, & Drag and Drop Dimension Attributes in Row
Field or Column fields.

Now to Browse Our Cube

1. Product Name Drag & Drop into Column


2. Full Date UK Drag & Drop into Row Field
3. FactProductSalesCount Drop this measure in Detail area
Week 3: Perform Data Preprocessing: a) Data Selection and Loading b)Handling Missing
values c) Creating arff file.

(a) Data Selection and Loading Data

The first four buttons at the top of the preprocess section (figure1)enable you to
load data into WEKA:

1. Open file: Brings up a dialog box allowing you to browse for the data file on the
local file system.

2. Open URL: Asks for a Uniform Resource Locator address for where the data is
stored.

3. Open DB: Reads data from a database.

4. Generate: Enables you to generate artificial data from a variety of Data


Generators. Using the Open file... button you can read files in a variety of formats:
WEKA’s ARFF format, CSV.

Figure1
Loading Data from Local File System(using “Open File” option) :

Click on the Open file ... button(Figure1). A directory browser window will open as shown in the
following screen –

Now navigate to the folder where your data files are stored. The WEKA installation offers many
sample databases for you to experiment with. These are available in the data of the WEKA
installation.For training, select any data file in this folder. The content of the file will be loaded
in the WEKA environment.
Loading data from the web(Using “Open URL” option) :

Once you click on the Open URL ... (Figure2)button, you can see a window like follows -

We will open the file from a public url Type the following url in the pop-up box -

https: //storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather. "nominal.arff

We can specify any other URL where your data is stored. The Explorer will load the data from the
remote site into its environment.

Loading data from a database :

Data can also be read from an SQL database using JDBC. Click on ‘Open DB…’ button,
‘GenericObjectEditor’ appears on the screen.
To read data from a database, click on ‘Open’ button and select the database from a filesystem.

(b) Handling missing values

Data is rarely clean and often you can have corrupt or missing values. It is important to identify,
mark and handle missing data in order to get the very best performance.

Methods for handling missing values are:

i. Mark Missing Values

ii. Remove Missing Data

iii. Impute Missing Values

For demonstrating above methods “Pima Indians onset of diabetes” dataset is used ,which can
be accessed using the following link :
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv
Description of dataset:
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.names

i. Mark Missing Values

The Pima Indians dataset is a good basis for exploring missing data.

Some attributes such as blood pressure (pres) and Body Mass Index (mass) have values of zero,
which are impossible. These are examples of corrupt or missing data that must be marked
manually.

We can mark missing values in Weka using the NumericalCleaner filter. The steps below shows
us how to use this filter to mark the 11 missing values on the Body Mass Index (mass) attribute.

1. Open the Weka Explorer.

2. Load the Pima Indians onset of diabetes dataset.

3. Click the “Choose” button for the Filter and select NumericalCleaner, it us under
unsupervized.attribute.NumericalCleaner.
Weka Select NumericCleaner Data Filter

4. Click on the filter to configure it.

5. Set the attributeIndicies to 6, the index of the mass attribute.

6. Set minThreshold to 0.1E-8 (close to zero), which is the minimum value allowed for the
attribute.

7. Set minDefault to NaN, which is unknown and will replace values below the threshold.

8. Click the “OK” button on the filter configuration.

9. Click the “Apply” button to apply the filter.

Click “mass” in the “attributes” pane and review the details of the “selected attribute”. Notice that
the 11 attribute values that were formally set to 0 are not marked as Missing.
Weka Missing Data Marked

In this example we marked values below a threshold as missing.

We could just as easily mark them with a specific numerical value. We could also mark values
missing between a upper and lower range of values.
ii. Remove Missing Data

Now that we know how to mark missing values in our data, we need to learn how to handle them.

A simple way to handle missing data is to remove those instances that have one or more missing
values.

We can do this in Weka using the RemoveWithValues filter.

We can remove missing values as follows:

1. Click the “Choose” button for the Filter and select RemoveWithValues, it us under
unsupervized.instance.RemoveWithValues.

Weka Select RemoveWithValues Data Filter


2. Click on the filter to configure it.

3. Set the attributeIndicies to 6, the index of the mass attribute.

4. Set matchMissingValues to “True”.

5. Click the “OK” button to use the configuration for the filter.

6. Click the “Apply” button to apply the filter.

Click “mass” in the “attributes” section and review the details of the “selected attribute”.

Notice that the 11 attribute values that were marked Missing have been removed from the
dataset.

Weka Missing Values Removed

Note: We can undo this operation by clicking the “Undo” button.

iii. Impute Missing Values


Instances with missing values do not have to be removed, we can replace the missing
values with some other value.

This is called imputing missing values.

It is common to impute missing values with the mean of the numerical distribution. You
can do this easily in Weka using the ReplaceMissingValues filter.

We can impute the missing values as follows:

1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us under
unsupervized.attribute.ReplaceMissingValues.

2. Click the “Apply” button to apply the filter to your dataset.

Click “mass” in the “attributes” section and review the details of the “selected attribute”.
Notice that the 11 attribute values that were marked Missing have been set to the mean value of
the distribution.

Weka Imputed Values

(c) Create an arff File :


An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a
list of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of
Waikato

ARFF files have two distinct sections. The first section is the Header information,
which is followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes
(the columns in the data), and their types. An example header on the standard IRIS
dataset looks like this:

% 1. Title: Iris Plants Database


%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris

@ATTRIBUTE sepallength NUMERIC


@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments.


The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

The ARFF Header Section


The ARFF Header section of the file contains the relation declaration and attributes
declarations.

The @relation Declaration


The relation name is defined as the first line in the ARFF file. The format is:

@relation <relation-name>

Where <relation-name> is a string. The string must be quoted if the name includes spaces.

The @attribute Declarations


Attribute declarations take the form of an ordered sequence of @attribute statements.
Each attribute in the data set has its own @attribute statement which uniquely defines
the name of that attribute and it's data type. The order the attributes are declared
indicates the column position in the data section of the file. For example, if an attribute
is the third one declared then Weka expects that all that attributes values will be found
in the third comma delimited column.

The format for the @attribute statement is:

@attribute <attribute-name> <datatype>

Where the <attribute-name> must start with an alphabetic character. If spaces are to be included
in the name then the entire name must be quoted.

The <datatype> can be any of the four types currently supported by Weka:

• numeric
• <nominal-specification>
• string
• date [<date-format>]

Where <nominal-specification> and <date-format> are defined below. The


keywords numeric, string and date are case insensitive.

Numeric attributes: Numeric attributes can be real or integer numbers.

Nominal attributes :Nominal values are defined by providing an <nominal-


specification> listing the possible values: {<nominal-name1>, <nominal-name2>,
<nominal-name3>, ...}
For example, the class value of the Iris dataset can be defined as follows:

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted.

String attributes: String attributes allow us to create attributes containing arbitrary


textual values. This is very useful in text-mining applications, as we can create datasets
with string attributes, then write Weka Filters to manipulate strings (like
StringToWordVectorFilter). String attributes are declared as follows:
@ATTRIBUTE LCC string

Date attributes : Date attribute declarations take the form:


@attribute <name> date [<date-format>]

where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: "yyyy-MM-dd'T'HH:mm:ss".

Dates must be specified in the data section as the corresponding string representations
of the date/time.

ARFF Data Section

The ARFF Data section of the file contains the data declaration line and the actual
instance lines.

The @data Declaration


The @data declaration is a single line denoting the start of the data segment in the
file. The format is:

@data

The instance data


Each instance is represented on a single line, with carriage returns denoting the end of
the instance.
Attribute values for each instance are delimited by commas. They must appear in the
order that they were declared in the header section (i.e. the data corresponding to the
nth @attribute declaration is always the nth field of the attribute).

Missing values are represented by a single question mark, as in:

@data
4.4,?,1.5,?,Iris-setosa

Values of string and nominal attributes are case sensitive, and any that contain space
must be quoted, as follows:

@relation LCCvsLCSH

@attribute LCC string


@attribute LCSH string

@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'

Dates must be specified in the data section using the string representation specified in
the attribute declaration. For example:

@RELATION Timestamps

@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"

@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"

Sparse ARFF files

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.

Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the
data section is different. Instead of representing each value in order, like this:

@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"

The non-zero attributes are explicitly identified by attribute number and their value
stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}

Each instance is surrounded by curly braces, and the format for each entry is: <index>
<space> <value> where index is the attribute index (starting from 0).

Note that the omitted values in a sparse instance are 0; they are not "missing" values! If
a value is unknown, you must explicitly represent it with a question mark (?).

Week 4: (a) Introduction to Weka Explorer

b) Implement Apriori Algorithm using supermarket data.


Solution :

(a) Introduction to weka Explorer:

WEKA is open source java code created by researchers at the University of Waikato in New
Zealand. It provides many different machine learning algorithms, including the following
classifiers:

• Decision tree (j4.8, an extension of C4.5)


• MLP, aka multiple layer perceptron (a type of neural net)
• Naïve bayes
• Rule induction algorithms such as JRip
• Support vector machine
And many more…

The GUI WEKA


The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching
Weka’s main GUI applications and supporting tools. The GUI Chooser consists of four buttons—
one for each of the four major Weka applications—and four menus.

The buttons can be used to start the following applications:


• Explorer : An environment for exploring data with WEKA (the rest of this documentation deals
with this application in more detail).
• Experimenter: An environment for performing experiments and conducting statistical tests
between learning schemes.
• KnowledgeFlow: This environment supports essentially the same functions as the Explorer but
with a drag-and-drop interface. One advantage is that it supports incremental learning.
• SimpleCLI: Provides a simple command-line interface that allows direct execution of WEKA
commands for operating systems that do not provide their own command line interface used to
start the following applications:

WEKA Explorer

Section Tabs :
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first
started only the first tab is active; the others are grayed out. This is because it is necessary to open
(and potentially pre-process) a data set before starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.

2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box,
the log button, and the Weka bird) stays visible regardless of which section you are in. The
Explorer can be easily extended with custom tabs.

Demonstration of preprocessing on dataset student.arff

Aim :This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student data
available in arff format.

Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute.
Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Step5 :Selecting or filtering attributes


Removing an attribute
-When we need to remove an attribute,we can do this by using the attribute filters in weka.In the
filter model panel,click on choose button,This will show a popup window with a list of available
filters. Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.

Step 6:
a)Next click the textbox immediately to the right of the choose button.In the resulting dialog box
enter the index of the attribute to be filtered out.
b)Make sure that invert selection option is set to false.The click OK now in the filter box.you will
see “Remove-R-7”.
c)Click the apply button to apply filter to this data.This will remove the attribute and create new
working relation.
d)Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)

Discretization
1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.

• Let us divide the values of age attribute into three bins(intervals).

• First load the dataset into weka(student.arff)


• Select the age attribute.

• Activate filter dialog box and select “WEKA.filters.unsupervised.attribute.discretize”from


the list.

• To change the defaults for the filters,click on the box immediately to the right of the choose
button.

• We enter the index for the attribute to be discretized.In this case the attribute is age.So we
must enter ‘1’ corresponding to the age attribute.

• Enter ‘3’ as the number of bins.Leave the remaining field values as they are.

• Click OK button.

• Click apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.

• Save the new working relation in a file called student-data-discretized.arff


Dataset student .arff
@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-ratin
g {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

The following screenshot shows the effect of discretization.


Week 4 (b):Implementation of Apriori algorithm using supermarket data.

Solution:
Aim: To implement Apriori algorithm to mine association rules using weka tool.

Step 1: Load the file (in arff format)into weka explorer using Open file tab.

Step 2: View the contents of the file using “edit” tab(shown in figure1)

Step 3: select Associate tab.

Step 3: select filter option--->choose------>weka---->associations-->apriori---->close.

Step 4: select start button.

Step 5: Association rules are generated, which are shown in “run information” area.(shown in
figure2)

Step 6: In order to change the parameters for the run(example support, confidence, etc),we click
on the text box immediately to the right of the choose button(shown in figure3)

Datafile :

@relation supermarket

@attribute bread{1,0}

@attribute cheese{1,0}

@attribute milk{1,0}

@attribute juice{1,0}

@attribute eggs{1,0}

@attribute yogurt{1,0}

@data

1,0,1,0,1,0

0,0,1,1,1,1
1,1,0,0,0,1

1,0,1,0,1,0

1,1,0,0,1,1

0,1,1,0,0,0

1,0,1,0,1,0

1,1,0,0,1,1

0,1,1,0,0,0

1,1,1,1,1,0

Figure 1 :
Figure 2 :

Figure 3:
Week 5: Implementation of FP Growth algorithm using supermarket data.

Aim: To implement FPGrowth algorithm to mine association rules using weka tool.

Step 1: Load the file (in arff format)into weka explorer using Open file tab.
Step 2: View the contents of the file using edit tab.(shown in figure1)

Step 3: select Associate tab.

Step 3: select filter option--->choose------>weka---->associations-->FPGrowth---->close.

Step 4: select start button.

Step 5: Association rules are generated, which are shown in “run information” area.(shown in
figure2)

Step 6:Inorder to change the parameters for the run(example support ,confidence ,etc),we click
on the text box immediately to the right of the choose button(shown in figure3)

Fpgrowth.arff

@relation supermarket

@attribute bread {yes,no}

@attribute butter {yes,no}

@attribute cheese {yes,no}

@attribute milk {yes,no}

@data

Yes, yes, no, yes

No, yes ,no, yes

No, yes, yes, no

Yes, no, yes, yes

No, yes, yes, no

Yes ,yes, no, no

Yes, yes ,no, yes


No, yes, yes ,no

Yes, yes, no, yes

No, yes, yes, no

Figure 1 :

Figure 2 :
Figure 3 :
Week 6: Implement the following Tree based classification Algorithms on sample dataset:
a) ID3 b) C4.5
(a) Implementation of id3

AIM: To Implement ID3 algorithm to generate decision tree. The sample data used in this
experiment is ‘student ’data available in arff format.

Steps involved in this experiment:

Step 1: We begin the experiment by loading the data into WEKA (figure1)

Step 2: Next we select the “classify” tab and click “choose” button to select the “id3” classifier.

Step 3: Now we specify various parameters .These can be specified by clicking in the text box to
the right of the choose button..

Step 4: Under the “text” options in the main panel .we select the “use training set “as our
evaluation approach since we don’t have separate evaluation dataset this is necessary to get a
reasonable idea of accuracy of generated model.

Step 5: We now click ‘start’ to generate the model. The ASCII version of the tree as well as
evaluations statistic will appear in the right panel when the model construction is
complete.(figure2)

Step 6: we will use our model to classify the new instances .

Dataset id3.arff

@relation id3

@attribute age{<30,30-40,>40}

@attribute income {low,medium,high}

@attribute student {yes,no}

@attribute credit_rating{fair,excellent}
@attribute buys pc{yes,no}

@data

<30,high,no,fair,no

<30,high,no,excellent,no

30-40,high,no,fair,yes

>40,medium,no,fair,yes

>40,low,yes,excellent,no

30-40,low,yes,excellent,yes

<30,medium,no,fair,no

<30,low,yes,fair,no

>40,medium,yes,fair,yes

<30,medium,yes,excellent,yes

30-40,medium,no,excellent,yes

30-40,high,yes,fair,yes

>40,medium,no,excellent,no

%
Figure 1 :
Figure 2:
Week 6 b) Implementation of C4.5.

AIM: This experiment illustrate the use of C4.5 classifier in the WEKA. The sample data used in
this experiment is ‘student ’data available in arff format.

Steps involved in this experiment:

Step1: We begin the experiment by loading the data into WEKA(figure1)

Step2: Next we select the “classify” tab and click “choose” button to select the “C4.5” classifier.

Step3: Now we specify various parameters .These can be specified by clicking in the text box to
the right of the choose button..

Step4: Under the “text” options in the main panel .we select the “use training set “as our evaluation
approach since we don’t have separate evaluation dataset this is necessary to get a reasonable idea
of accuracy of generated model.

Step5:We now click ‘start’ to generate the model. The ASCII version of the tree as well as
evaluations statistic will appear in the right panel when the model construction is
complete.(figure2)

Step 7: we will use our model to classify the new instances .

Dataset C4.5.arff

@relation j48

@attribute age{<30,30-40,>40}

@attribute income {low,medium,high}


@attribute student {yes,no}

@attribute credit_rating{fair,excellent}

@attribute buys pc{yes,no}

@data

<30,high,no,fair,no

<30,high,no,excellent,no

30-40,high,no,fair,yes

>40,medium,no,fair,yes

>40,low,yes,excellent,no

30-40,low,yes,excellent,yes

<30,medium,no,fair,no

<30,low,yes,fair,no

>40,medium,yes,fair,yes

<30,medium,yes,excellent,yes

30-40,medium,no,excellent,yes

30-40,high,yes,fair,yes

>40,medium,no,excellent,no

%
Figure 1:

Figure 2:
Week 7: Implement Tree based classification Algorithm- Naive Bayesian Algorithm on
sample dataset.

AIM: To implement naive Bayesian algorithm to classify the given dataset using weka-tool

Step 1: load file (in .arff format) into weka explorer using open file tab .

Step2: view the contents of the file using edit tab.(figure 1)

Step3: select classifier-------->bayes------->naive bayer------>close

Step4:select test options------>using training set

Step5:output is shown in “run information” area (Figure2)

Naive.arrf

@relation loan_status

@attributes homeowner{yes,no}

@attribute marital_status{single,married,divorced}

@attribute annual_income numeric

@attribute loan{yes,no}

@data

Yes,single,125000,no

No,married,100000,no

No,single,10000,no

Yes,married,120000,no

No,divorced,950000,yes

No,married,60000,yes
Yes,divorced,220000,no

No,single,85000,yes

No,married,75000,no

No,single,90000,yes

Figure1 :
Figure2 :
Week 9: Implement the following Clustering Algorithms on sample data set: a) K-Means

b) DBSCAN

(a)Implementation of k-means clustering algorithm using sample dataset

Aim: This experiment illustrates the use of simple k-means clustering with Weka explorer. The
sample data set used for this example is based on the student data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This student dataset
includes 14 instances.

Steps involved in this Experiment


Step 1: Run the Weka explorer and load the data file student.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the choose
button. This step results in a dropdown list of available clustering algorithms.
Step 3: In this case we select ‘simple k-means’.
Step 4: Next click in text button to the right of the choose button to get popup window shown in
the screenshots. In this window we enter six on the number of clusters and we leave the value of
the seed on as it is. The seed value is used in generating a random number which is used for making
the internal assignments of instances of clusters.
Step 5: Once of the option have been specified. We run the clustering algorithm there we must
make sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and
then we click ‘start’ button. This process and resulting window are shown in the following
screenshots. (Figure 2)
Step 6: The result window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid is means vectors
for each cluster. These clusters can be used to characterize the cluster.
Step 7: Another way of understanding characteristics of each cluster through visualization, we can
do this, try right clicking the result set on the result. List panel and selecting the visualize cluster
assignments. (figure3)
Dataset student .arff
@relation student
@attribute age {<30,30-40,>40}
@attribute income {low,medium,high}
@attribute student {yes,no}
@attribute credit-rating {fair,excellent}
@attribute buyspc {yes,no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

The following screenshot shows the clustering rules that were generated when simple k-means
algorithm is applied on the given dataset.
Figure 1 :
Figure 2:

Figure 3:
(b)Implementation of DBSCAN clustering algorithm using sample dataset

Aim: This experiment illustrates the use of DBSCAN clustering with Weka explorer. The sample
data set used for this example is based on the student data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This student dataset
includes 14 instances.

Steps involved in this Experiment


Step 1: Run the Weka explorer and load the data file student.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the choose
button. This step results in a dropdown list of available clustering algorithms.
Step 3: In this case we select ‘DBSCAN’.
Step 4: Next click in text button to the right of the choose button to get popup window shown in
the screenshots. In this window we enter six on the number of clusters and we leave the value of
the seed on as it is. The seed value is used in generating a random number which is used for making
the internal assignments of instances of clusters.
Step 5: Once of the option have been specified. We run the clustering algorithm there we must
make sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and
then we click ‘start’ button.
Step 6: The result window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid is means vectors
for each cluster. These clusters can be used to characterize the cluster.
Step 7: Another way of understanding characteristics of each cluster through visualization, we can
do this, try right clicking the result set on the result.
Dataset student .arff
@relation student
@attribute age {<30,30-40,>40}
@attribute income {low,medium,high}
@attribute student {yes,no}
@attribute credit-rating {fair,excellent}
@attribute buyspc {yes,no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

You might also like