Professional Documents
Culture Documents
(AUTONOMOUS)
B.Tech : VI SEMESTER L T P C
0 0 3 1.5
Pre-Requisites:
A course on “Computer Programming and Data Structures”
A course on “Object Oriented Programming Through Java”
Course Objectives:
After the completion of this course, the students should be able to
• This practical paper is designed to help students to design a data warehouse and implement
OLAP operations.
• This shall give them exposure to application of data warehousing.
• The next part of the practical helps the students to perform data mining functionalities such
as association rule mining, classification and clustering.
• To have hands on experience in developing a software project by using various
software engineering principles and methods in each of the phases of software
development.
Syllabus Content
Part A
Week 6: Implement the following Tree based classification Algorithms on sample dataset:
a) ID3
b) C4.5
Week 9-15:
LIST OF EXPERIMENTS
Do the following 7 exercises for any two projects given in the list of sample projects
or any other projects:
1. Development of problem statement.
2. Preparation of Software Requirement Specification Document, Design Documents
and Testing Phase related documents.
3. Preparation of Software Configuration Management and Risk Management related
documents.
4. Study and usage of any Design phase CASE tool
5. Performing the Design by using any Design phase CASE tools.
6. Develop test cases for unit testing and integration testing
7. Develop test cases for various white box and black box testing
Sample Projects:
1. Passport automation System
2. Book Bank
3. Online Exam Registration
4. Stock Maintenance System
5. Online course reservation system
6. E-ticketing
7. Software Personnel Management System
8. Credit Card Processing
9. E-book management System.
10. Recruitment system
Course Outcomes:
TEXT BOOKS:
1. Data Mining – Concepts and Techniques - Jiawei Han & Micheline Kamber, Morgan
Kaufmann Publishers, Elsevier,2nd Edition, 2006.
2. Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach and Vipin Kumar,
Pearson education.
3. Software Engineering, A practitioner’s Approach- Roger S. Pressman, 6th edition, Mc Graw
Hill International Edition.
4. Software Engineering- Sommerville, 7th edition, Pearson Education.
5. The unified modeling language user guide Grady Booch, James Rambaugh, Ivar Jacobson,
Pearson Education.
6. Ilene Burnstein, “Practical Software Testing”, Springer International Edition, 2003.
REFERENCE BOOKS:
Solution:
Steps to design a data warehouse for auto sales analysis are as follows :
Step 1: Open SQL Server Management Studio 2008 and click on connect.
Step 2: Now a connection is established with Database Engine .
Step 3 : Right click on Databases and select New Database(to create our own database)
Step 4: Type any Database name (which we want to create) and select ok.
Step 5: Our own database is created .Right click on the database which is created and select
New Query.
Step 6 : Type the following SQl script in the New Query editor .
USE DemoDW
GO
Step 8 : Select the tables which we want to add to the data warehouse design and select Add
and then close.
Step 9: The following Auto Sales Analysis Data warehouse design will be obtained.
Week2: Perform OLAP operations:
Solution :
For creation of OLAP Cube in Microsoft BIDS Environment, follow the 10 steps given below.
Click on Start Menu -> Microsoft SQL Server 2008 R2 -> Click SQL Server Business
Intelligence Development Studio.
Step 2: Start Analysis Services Project
Click File -> New -> Project ->Business Intelligence Projects ->select Analysis Services
Project-> Assign Project Name -> Click OK
Step 3: Creating New Data Source
3.1 In Solution Explorer, Right click on Data Source -> Click New Data Source
3.2 Click on Next
3.3 Click on New Button
3.4 Creating New connection
1. Specify Your SQL Server Name where your Data Warehouse was created
2. Select Radio Button according to your SQL Server Authentication mode
3. Specify your Credentials using which you can connect to your SQL Server
4. Select database Sales_DW.
5. Click on Test Connection and verify for its success
6. Click OK.
3.5 Select Connection created in Data Connections-> Click Next
4.1 In the Solution Explorer, Right Click on Data Source View -> Click on New Data Source
View
4.2 Click Next
4.3 Select Relational Data Source we have created previously (Sales_DW)-> Click Next
4.4 First move your Fact Table to the right side to include in object list.
Select FactProductSales Table -> Click on Arrow Button to move the selected object to Right
Pane.
4.5 Now to add dimensions which are related to your Fact Table, follow the given steps:Select
Fact Table in Right Pane (Fact product Sales) -> Click On Add Related Tables
4.6 It will add all associated dimensions to your Fact table as per relationship specified in your
SQL DW (Sales_DW).
Click Next.
4.7 Assign Name (SalesDW DSV)-> Click Finish
5.1 In Solution Explorer -> Right Click on Cube-> Click New Cube
5.2 Click Next
5.5 Choose Measures from the List which you want to place in your Cube --> Click Next
5.6 Select All Dimensions here which are associated with your Fact Table-> Click Next
In Solution Explorer, double click on dimension Dim Product -> Drag and Drop Product Name
from Table in Data Source View and Add in Attribute Pane at left side.
Step 7: Creating Attribute Hierarchy In Date Dimension
Double click On Dim Date dimension -> Drag and Drop Fields from Table shown in Data
Source View to Attributes-> Drag and Drop attributes from leftmost pane of attributes to middle
pane of Hierarchy.
Drag fields in sequence from Attributes to Hierarchy window (Year, Quarter Name, Month
Name, Week of the Month, Full Date UK),
Step 8: Deploy the Cube
8.1 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Properties
8.2 Set Deployment Properties First
In Configuration Properties, Select Deployment-> Assign Your SQL Server Instance Name
Where Analysis Services Is Installed (mubin-pc\fairy) (Machine Name\Instance Name) ->
Choose Deployment Mode Deploy All as of now ->Select Processing Option Do Not Process ->
Click OK
8.3 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Deploy
8.4 Once Deployment will finish, you can see the message Deployment Completed in
deployment Properties.
Step 9: Process the Cube
9.1 In Solution Explorer, right click on Project Name (SalesDataAnalysis) -- > Click Process
9.2 Click on Run button to process the Cube
9.3 Once processing is complete, you can see Status as Process Succeeded -->Click Close to
close both the open windows for processing one after the other.
Step 10: Browse the Cube for Analysis
10.1 In Solution Explorer, right click on Cube Name (SalesDataAnalysisCube) -- > Click
Browse
10.2 Drag and drop measures in to Detail fields, & Drag and Drop Dimension Attributes in Row
Field or Column fields.
The first four buttons at the top of the preprocess section (figure1)enable you to
load data into WEKA:
1. Open file: Brings up a dialog box allowing you to browse for the data file on the
local file system.
2. Open URL: Asks for a Uniform Resource Locator address for where the data is
stored.
Figure1
Loading Data from Local File System(using “Open File” option) :
Click on the Open file ... button(Figure1). A directory browser window will open as shown in the
following screen –
Now navigate to the folder where your data files are stored. The WEKA installation offers many
sample databases for you to experiment with. These are available in the data of the WEKA
installation.For training, select any data file in this folder. The content of the file will be loaded
in the WEKA environment.
Loading data from the web(Using “Open URL” option) :
Once you click on the Open URL ... (Figure2)button, you can see a window like follows -
We will open the file from a public url Type the following url in the pop-up box -
We can specify any other URL where your data is stored. The Explorer will load the data from the
remote site into its environment.
Data can also be read from an SQL database using JDBC. Click on ‘Open DB…’ button,
‘GenericObjectEditor’ appears on the screen.
To read data from a database, click on ‘Open’ button and select the database from a filesystem.
Data is rarely clean and often you can have corrupt or missing values. It is important to identify,
mark and handle missing data in order to get the very best performance.
For demonstrating above methods “Pima Indians onset of diabetes” dataset is used ,which can
be accessed using the following link :
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv
Description of dataset:
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-
diabetes.names
The Pima Indians dataset is a good basis for exploring missing data.
Some attributes such as blood pressure (pres) and Body Mass Index (mass) have values of zero,
which are impossible. These are examples of corrupt or missing data that must be marked
manually.
We can mark missing values in Weka using the NumericalCleaner filter. The steps below shows
us how to use this filter to mark the 11 missing values on the Body Mass Index (mass) attribute.
3. Click the “Choose” button for the Filter and select NumericalCleaner, it us under
unsupervized.attribute.NumericalCleaner.
Weka Select NumericCleaner Data Filter
6. Set minThreshold to 0.1E-8 (close to zero), which is the minimum value allowed for the
attribute.
7. Set minDefault to NaN, which is unknown and will replace values below the threshold.
Click “mass” in the “attributes” pane and review the details of the “selected attribute”. Notice that
the 11 attribute values that were formally set to 0 are not marked as Missing.
Weka Missing Data Marked
We could just as easily mark them with a specific numerical value. We could also mark values
missing between a upper and lower range of values.
ii. Remove Missing Data
Now that we know how to mark missing values in our data, we need to learn how to handle them.
A simple way to handle missing data is to remove those instances that have one or more missing
values.
1. Click the “Choose” button for the Filter and select RemoveWithValues, it us under
unsupervized.instance.RemoveWithValues.
5. Click the “OK” button to use the configuration for the filter.
Click “mass” in the “attributes” section and review the details of the “selected attribute”.
Notice that the 11 attribute values that were marked Missing have been removed from the
dataset.
It is common to impute missing values with the mean of the numerical distribution. You
can do this easily in Weka using the ReplaceMissingValues filter.
1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us under
unsupervized.attribute.ReplaceMissingValues.
Click “mass” in the “attributes” section and review the details of the “selected attribute”.
Notice that the 11 attribute values that were marked Missing have been set to the mean value of
the distribution.
ARFF files have two distinct sections. The first section is the Header information,
which is followed the Data information.
The Header of the ARFF file contains the name of the relation, a list of the attributes
(the columns in the data), and their types. An example header on the standard IRIS
dataset looks like this:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
@relation <relation-name>
Where <relation-name> is a string. The string must be quoted if the name includes spaces.
Where the <attribute-name> must start with an alphabetic character. If spaces are to be included
in the name then the entire name must be quoted.
The <datatype> can be any of the four types currently supported by Weka:
• numeric
• <nominal-specification>
• string
• date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: "yyyy-MM-dd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations
of the date/time.
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
@data
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain space
must be quoted, as follows:
@relation LCCvsLCSH
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
Dates must be specified in the data section using the string representation specified in
the attribute declaration. For example:
@RELATION Timestamps
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the
data section is different. Instead of representing each value in order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
The non-zero attributes are explicitly identified by attribute number and their value
stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: <index>
<space> <value> where index is the attribute index (starting from 0).
Note that the omitted values in a sparse instance are 0; they are not "missing" values! If
a value is unknown, you must explicitly represent it with a question mark (?).
WEKA is open source java code created by researchers at the University of Waikato in New
Zealand. It provides many different machine learning algorithms, including the following
classifiers:
WEKA Explorer
Section Tabs :
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first
started only the first tab is active; the others are grayed out. This is because it is necessary to open
(and potentially pre-process) a data set before starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box,
the log button, and the Weka bird) stays visible regardless of which section you are in. The
Explorer can be easily extended with custom tabs.
Aim :This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student data
available in arff format.
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute.
Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Step 6:
a)Next click the textbox immediately to the right of the choose button.In the resulting dialog box
enter the index of the attribute to be filtered out.
b)Make sure that invert selection option is set to false.The click OK now in the filter box.you will
see “Remove-R-7”.
c)Click the apply button to apply filter to this data.This will remove the attribute and create new
working relation.
d)Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)
Discretization
1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.
• To change the defaults for the filters,click on the box immediately to the right of the choose
button.
• We enter the index for the attribute to be discretized.In this case the attribute is age.So we
must enter ‘1’ corresponding to the age attribute.
• Enter ‘3’ as the number of bins.Leave the remaining field values as they are.
• Click OK button.
• Click apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.
Solution:
Aim: To implement Apriori algorithm to mine association rules using weka tool.
Step 1: Load the file (in arff format)into weka explorer using Open file tab.
Step 2: View the contents of the file using “edit” tab(shown in figure1)
Step 5: Association rules are generated, which are shown in “run information” area.(shown in
figure2)
Step 6: In order to change the parameters for the run(example support, confidence, etc),we click
on the text box immediately to the right of the choose button(shown in figure3)
Datafile :
@relation supermarket
@attribute bread{1,0}
@attribute cheese{1,0}
@attribute milk{1,0}
@attribute juice{1,0}
@attribute eggs{1,0}
@attribute yogurt{1,0}
@data
1,0,1,0,1,0
0,0,1,1,1,1
1,1,0,0,0,1
1,0,1,0,1,0
1,1,0,0,1,1
0,1,1,0,0,0
1,0,1,0,1,0
1,1,0,0,1,1
0,1,1,0,0,0
1,1,1,1,1,0
Figure 1 :
Figure 2 :
Figure 3:
Week 5: Implementation of FP Growth algorithm using supermarket data.
Aim: To implement FPGrowth algorithm to mine association rules using weka tool.
Step 1: Load the file (in arff format)into weka explorer using Open file tab.
Step 2: View the contents of the file using edit tab.(shown in figure1)
Step 5: Association rules are generated, which are shown in “run information” area.(shown in
figure2)
Step 6:Inorder to change the parameters for the run(example support ,confidence ,etc),we click
on the text box immediately to the right of the choose button(shown in figure3)
Fpgrowth.arff
@relation supermarket
@data
Figure 1 :
Figure 2 :
Figure 3 :
Week 6: Implement the following Tree based classification Algorithms on sample dataset:
a) ID3 b) C4.5
(a) Implementation of id3
AIM: To Implement ID3 algorithm to generate decision tree. The sample data used in this
experiment is ‘student ’data available in arff format.
Step 1: We begin the experiment by loading the data into WEKA (figure1)
Step 2: Next we select the “classify” tab and click “choose” button to select the “id3” classifier.
Step 3: Now we specify various parameters .These can be specified by clicking in the text box to
the right of the choose button..
Step 4: Under the “text” options in the main panel .we select the “use training set “as our
evaluation approach since we don’t have separate evaluation dataset this is necessary to get a
reasonable idea of accuracy of generated model.
Step 5: We now click ‘start’ to generate the model. The ASCII version of the tree as well as
evaluations statistic will appear in the right panel when the model construction is
complete.(figure2)
Dataset id3.arff
@relation id3
@attribute age{<30,30-40,>40}
@attribute credit_rating{fair,excellent}
@attribute buys pc{yes,no}
@data
<30,high,no,fair,no
<30,high,no,excellent,no
30-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,excellent,no
30-40,low,yes,excellent,yes
<30,medium,no,fair,no
<30,low,yes,fair,no
>40,medium,yes,fair,yes
<30,medium,yes,excellent,yes
30-40,medium,no,excellent,yes
30-40,high,yes,fair,yes
>40,medium,no,excellent,no
%
Figure 1 :
Figure 2:
Week 6 b) Implementation of C4.5.
AIM: This experiment illustrate the use of C4.5 classifier in the WEKA. The sample data used in
this experiment is ‘student ’data available in arff format.
Step2: Next we select the “classify” tab and click “choose” button to select the “C4.5” classifier.
Step3: Now we specify various parameters .These can be specified by clicking in the text box to
the right of the choose button..
Step4: Under the “text” options in the main panel .we select the “use training set “as our evaluation
approach since we don’t have separate evaluation dataset this is necessary to get a reasonable idea
of accuracy of generated model.
Step5:We now click ‘start’ to generate the model. The ASCII version of the tree as well as
evaluations statistic will appear in the right panel when the model construction is
complete.(figure2)
Dataset C4.5.arff
@relation j48
@attribute age{<30,30-40,>40}
@attribute credit_rating{fair,excellent}
@data
<30,high,no,fair,no
<30,high,no,excellent,no
30-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,excellent,no
30-40,low,yes,excellent,yes
<30,medium,no,fair,no
<30,low,yes,fair,no
>40,medium,yes,fair,yes
<30,medium,yes,excellent,yes
30-40,medium,no,excellent,yes
30-40,high,yes,fair,yes
>40,medium,no,excellent,no
%
Figure 1:
Figure 2:
Week 7: Implement Tree based classification Algorithm- Naive Bayesian Algorithm on
sample dataset.
AIM: To implement naive Bayesian algorithm to classify the given dataset using weka-tool
Step 1: load file (in .arff format) into weka explorer using open file tab .
Naive.arrf
@relation loan_status
@attributes homeowner{yes,no}
@attribute marital_status{single,married,divorced}
@attribute loan{yes,no}
@data
Yes,single,125000,no
No,married,100000,no
No,single,10000,no
Yes,married,120000,no
No,divorced,950000,yes
No,married,60000,yes
Yes,divorced,220000,no
No,single,85000,yes
No,married,75000,no
No,single,90000,yes
Figure1 :
Figure2 :
Week 9: Implement the following Clustering Algorithms on sample data set: a) K-Means
b) DBSCAN
Aim: This experiment illustrates the use of simple k-means clustering with Weka explorer. The
sample data set used for this example is based on the student data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This student dataset
includes 14 instances.
The following screenshot shows the clustering rules that were generated when simple k-means
algorithm is applied on the given dataset.
Figure 1 :
Figure 2:
Figure 3:
(b)Implementation of DBSCAN clustering algorithm using sample dataset
Aim: This experiment illustrates the use of DBSCAN clustering with Weka explorer. The sample
data set used for this example is based on the student data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This student dataset
includes 14 instances.