You are on page 1of 53

L. D.

College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015

LAB MANUAL
Branch: Computer Engineering

Data Mining (3160714)


Semester: VI

Faculty Details:
1) Prof. (Dr.) V. B. Vaghela

2) Prof. H. A. Joshiyara
Data Mining (3160714)


Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [1]


Data Mining (3160714)


Sr. CO AIM Date Page Marks Sign


No. No.

CO3 Introduction various Data Mining Tools -WEKA,


1
DTREG, DB Miner. Compare in terms of their
special features, functionality and limitations.
CO1 Demonstration of preprocessing on dataset
2
like student.arff and labor.arff using WEKA.
CO2 Demonstration of Association rule process on any
3
data set using Apriori algorithm in WEKA.
CO2 Demonstration of classification rule process on
4
dataset using j48 algorithm using WEKA.
CO2 Demonstration of classification rule process on
5
dataset using id3 algorithm using WEKA.
CO5 Implementation of any one classifier using JAVA
6
(Bayesian classifier) and verify result with WEKA.
CO5 Implementation of any one clustering (K-
7
mean) algorithm using JAVA and verify result with
WEKA.
CO2 Demonstration of Decision Tree classification using
8
WEKA.
CO4 Case Study: Do literature review of research paper on
9
Data Mining and Web Mining which must includes
any of the special techniques as per your syllabus and
advanced topics/improved technique.
CO1 To perform hands on experiments of data
10
preprocessing with sample data sets on RapidMiner.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [2]


Data Mining (3160714)


L. D. College of Engineering, Ahmedabad

Department of Computer Engineering


RUBRICS FOR LABORATORY PRACTICALS ASSESSMENT

Subject Name: Data Mining Subject Code: 3160714

Term: 2020-21

Rubrics ID Criteria Marks Good (2) Satisfactory (1) Need


Improvement (0)

RB1 Regularity 02 High (>70%) Moderate (40- Poor (0-40%)


70%)

RB2 Problem Analysis 02 Apt & Full Limited Very Less


Identification of Identification of Identification of
the Problem the Problem the Problem

RB3 Development of 03 & Complete Incomplete Very Less


the Solution Solution for the Solution for the Solution for the
Problem Problem Problem

RB4 Concept Clarity 03 Concept is very Concept is clear Concept is not


& Understanding clear with proper up to some clear
understanding extent

SIGN OF FACULTY

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [3]


Data Mining (3160714)


L. D. College of Engineering, Ahmedabad


Department of Computer Engineering
LABORATORY PRACTICALS ASSESSMENT

Subject Name: Data Mining


Term: 2020-21

Enroll. No.:
Class: 6th CE
Pract. CO No. RB1 RB2 RB3 RB4 Total Date Faculty Sign
No.

1 CO3

2 CO1

3 CO2

4 CO2

5 CO2

6 CO5

7 CO5

8 CO2

9 CO4

10 CO1

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [4]


Data Mining (3160714)


GUJARAT TECHNOLOGICAL UNIVERSITY, AHMEDABAD,


COURSE CURRICULUM
COURSE TITLE: Data Mining
(Code: 3160714)
Degree Programmes in which this course is offered Semester in which offered

Computer Engineering 6th Semester

1. RATIONALE
 To teach the basic principles, concepts and applications of data warehousing and data
mining in Business Intelligence.
 To introduce the task of data mining as an important phase of knowledge recovery process
 To familiarize Conceptual, Logical, and Physical design of Data Warehouses OLAP
applications and OLAP deployment
 To characterize the kinds of patterns that can be discovered by association rule mining,
classification and clustering.
 To develop skill in selecting the appropriate data mining algorithm & tools for solving
practical problems.
 Master data mining techniques in various applications like social, scientific and
environmental context

2. COMPETENCY

The primary purpose of data mining in business intelligence is to find correlations or patterns
among dozens of fields in large databases. The course exposes students to topics involving
planning, designing, building, populating, and maintaining a successful data warehouse and
implementing various data mining techniques in business applications.

3. COURSE OUTCOMES

After learning the course the students should be able to:


1. Perform the preprocessing of data and apply mining techniques on it.
2. Identify the association rules, classification, and clusters in large data sets.
3. Solve real world problems in business and scientific information using data mining.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [5]


Data Mining (3160714)


4. Use data analysis tools for scientific applications.


5. Implement various supervised machine learning algorithms.

4. TEACHING AND EXAMINATION SCHEME

5. SUGGESTED LEARNING RESOURCES


A. LIST OF BOOKS

1. J. Han, M. Kamber, ―Data Mining Concepts and Techniques‖, Morgan Kaufmann


2. M. Kantardzic, ―Data mining: Concepts, models, methods and algorithms, John Wiley &Sons Inc.
3. M. Dunham, ―Data Mining: Introductory and Advanced Topics‖, Pearson Education.
4. Ning Tan, Vipin Kumar, Michael Steinbanch Pang, ―Introduction to Data Mining‖, Pearson
Education

5. LIST OF SOFTWARE / LEARNING WEBSITES


1. HTTPS://WWW.SAS.COM/EN_US/INSIGHTS/ANALYTICS/DATA-MINING.HTML
2. HTTPS://WWW.IBM.COM/CLOUD/LEARN/DATA-MINING
3. HTTPS://EN.WIKIPEDIA.ORG/WIKI/DATA_MINING
4. HTTPS://WWW.INVESTOPEDIA.COM/TERMS/D/DATAMINING.ASP
5. HTTPS://WWW.JAVATPOINT.COM/DATA-MINING
6. HTTP://WWW.LASTNIGHTSTUDY.COM/SHOW?ID=37/DATA-MINING-
FUNCTIONALITIES
7. HTTPS://WWW.JIGSAWACADEMY.COM/BLOGS/DATA-SCIENCE/DATA-
MINING-FUNCTIONALITIES

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [6]


Data Mining (3160714)


Practical – 1
AIM: Introduction various Data Mining Tools -WEKA, DTREG, DB Miner. Compare in terms of their
special features, functionality and limitations.

 Objectives:
o Study various data mining tools
o Learn data mining features
o Learn data mining functionality
 Theory:
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand.
Weka is free software available under the GNU General Public License. The Weka
workbench contains a collection of visualization tools and algorithms for data analysis and
predictive modeling, together with graphical user interfaces for easy access to this functionality.
Weka supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. Weka provides access to
SQL databases using Java Database Connectivity and can process the result returned by a database
query. It is not capable of multi- relational data mining, but there is separate software for
converting a collection of linked database tables into a single table that is suitable for processing
using Weka.
 Download and Installation Procedure:
Step 1: download from link: http://sourceforge.net/projects/weka/postdownload?source=dlp
Step 2: Installation step
Follow the simple steps to install the WEKA as mentioned below:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [7]


Data Mining (3160714)


Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [8]


Data Mining (3160714)


DTREG is a robust application that is installed easily on any Windows system. DTREG
reads Comma Separated Value (CSV) data files that are easily created from almost any data
source. Once you create your data file, just feed it into DTREG, and let DTREG do all of the
work of creating a decision tree, Support Vector Machine, K-Means clustering, Linear
Discriminant Function, Linear Regression or Logistic Regression model. Even complex analyses
can be set up in minutes.

DTREG accepts a dataset containing of number of rows with a column for each
variable. One of the variables is the ―target variable‖ whose value is to be
modeled and predicted as a function of the ―predictor variables‖. DTREG
analyzes the data and generates a model showing how best to predict the values
of the target variable based on values of the predictor variables.

DTREG can create classical, single-tree models and also TreeBoost and
Decision Tree Forest models consisting of ensembles of many trees. DTREG
also can generate Neural Networks, Support Vector Machine (SVM), Gene
Expression Programming/Symbolic Regression, K-Means clustering, GMDH
polynomial networks, Discriminate Analysis, Linear Regression, and Logistic
Regression models.

Installing DTREG

To install DTREG, run the installation program named DTREGsetup.exe. A ―wizard‖


screen will guide you through the installation process. You can accept the default
installation location (C:\Program files\DTREG) or select a different folder location. When
the installation finishes, you should see this icon for DTREG on your desktop:

To launch DTREG, double-click the Shortcut to DTREG icon on your desktop.


DTREG’s Main Screen
When you launch DTREG, its main screen displays:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [9]


Data Mining (3160714)


From this screen, you can


 Create a new project to build a model by clicking
 Open an existing project by clicking
 Set options and enter your registration key.

 Tools / Material Needed:


o Hardware:
o Software: Weka, DTREG, DB Miner

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [10]


Data Mining (3160714)


Practical – 2
AIM: Demonstration of preprocessing on dataset like student.arff and labor.arff using WEKA.

 Objectives: Perform preprocessing operation on some data set using WEKA.

 Theory:
o Data preprocessing is a data mining technique which is used to transform the raw data in
a useful and efficient format. To ensure high-quality data, it's crucial to preprocess it. To
make the process easier, data preprocessing is divided into four stages: data cleaning,
data integration, data reduction, and data transformation.

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [11]


Data Mining (3160714)


3. Data Reduction:

Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.

 Tools / Material Needed:


o Hardware:
o Software: Weka

 Procedure / Steps:
The sample dataset used for this example is the student data available in arff format.

Step1:
Load the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2:
Once the data is loaded, weka will recognize the attributes and during the scan of the
data weka will compute some basic strategies on each attribute. The left panel in the
above figure shows the list of recognized attributes while the top panel indicates the
names of the base relation or table and the current working relation (which are same
initially).
Step3:
Clicking on an attribute in the left panel will show the basic statistics on the attributes
for the categorical attributes the frequency of each attribute value is shown, while for
continuous attributes we can obtain min, max, mean, standard deviation and deviation
etc.,
Step4:
The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note: we can select another attribute using the dropdown list.


Step5:
Selecting or filtering attributes

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [12]


Data Mining (3160714)


Removing an attribute-When we need to remove an attribute, we can do this by using


the attribute filters in weka. In the filter model panel, click on choose button, This will
show a popup window with a list of available filters.
Scroll down the list and select the ―weka.filters.unsupervised.attribute.remove‖ filters.
Step 6:a)
Next click the textbox immediately to the right of the choose button.In the resulting
dialog box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false.The click OK now in the filter
box.you will see ―Remove-R-7‖.

c) Click the apply button to apply filter to this data.This will remove the attribute and
create new working relation.

d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)

Dataset student .arff


@relation student
@attribute age {<30,30-40,>40} @attribute income {low, medium, high} @attribute
student {yes, no}
@attribute credit-rating {fair, excellent} @attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no 30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no 30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes 30-40, medium, no, excellent, yes 30-40, high, yes, fair,
yes
>40, medium, no, excellent, no
%

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [13]


Data Mining (3160714)


The following screenshot shows the effect of discretization.

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [14]


Data Mining (3160714)


Practical – 3
AIM: Demonstration of Association rule process on any data set using Apriori algorithm in WEKA.

 Objectives: to illustrate basic elements of the Apriori algorithm using Weka.

 Theory:
o Association rule mining finds interesting associations and relationships among large sets
of data items. This rule shows how frequently an itemset occurs in a transaction. A
typical example is Market Based Analysis.
o Market Based Analysis is one of the key techniques used by large relations to show
associations between items. It allows retailers to identify relationships between the items
that people buy together frequently.

 Background / Preparation:
o Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions.

Support Count( ) – Frequency of occurrence of a itemset.

Here ({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.

Association Rule– An implication expression of the form X -> Y, where X and Y are any 2
itemsets.

Example: {Milk, Diaper}->{Beer}

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [15]


Data Mining (3160714)


Rule Evaluation Metrics –

 Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
 Support

It is interpreted as fraction of transactions that contain both X and Y.

 Confidence(c)–
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items
in {A}.
 Conf(X=>Y)=Supp(XUY)/Supp(X)–
It measures how often each item in Y appears in transactions that contains items in X
also.
 Lift(l)
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other. The
expected confidence is the confidence divided by the frequency of {Y}.

 Tools / Material Needed:


o Hardware:
o Software: Weka

 Procedure / Steps:
o Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
o Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm. Step3: We will use apriori algorithm. This is the default algorithm.
o Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [16]


Data Mining (3160714)


Dataset contactlenses.arff

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [17]


Data Mining (3160714)


Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [18]


Data Mining (3160714)


Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [19]


Data Mining (3160714)


Practical – 4
AIM: Demonstration of classification rule process on dataset using j48 algorithm using WEKA.

 Objectives: to learn the decision tree algorithm using Weka

 Theory:

Classification:

It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to
which of a set of categories (subpopulations), a new observation belongs to, on the basis
of a training set of data containing observations and whose categories membership is
known.

Types of Classification Techniques in Data Mining

Before we discuss the various classification algorithms in data mining, let‘s first look
at the type of classification techniques available. Primarily, we can divide the
classification algorithms into two categories:

1. Generative
2. Discriminative

Generative

A generative classification algorithm models the distribution of individual classes. It tries


to learn the model which creates the data through estimation of distributions and
assumptions of the model. You can use generative algorithms to predict unseen data.

A prominent generative algorithm is the Naive Bayes Classifier.

Discriminative

It‘s a rudimentary classification algorithm that determines a class for a row of data. It
models by using the observed data and depends on the data quality instead of its
distributions.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [20]


Data Mining (3160714)


 Background / Preparation:
This experiment illustrates the use of j-48 classifier in weka. The sample data set used in
this experiment is ―student‖ data available at arff format. This document assumes that
appropriate data pre processing has been performed.

 Tools / Material Needed:


o Hardware:
o Software: Weka

 Procedure / Steps:
o Steps involved in this experiment:
o Step-1:
 We begin the experiment by loading the data (student.arff)into weka.
o Step2:
 Next we select the ―classify‖ tab and click ―choose‖ button t o select the
―j48‖classifier.
o Step3:
 Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default
values. The default version does perform some pruning but does not perform error
pruning.
o Step4:
 Under the ―text‖ options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don‘t have separate evaluation data set, this
is necessary to get a reasonable idea of accuracy of generated model.
o Step-5:
 We now click ‖start‖ to generate the model .the Ascii version of the tree as well
as evaluation statistic will appear in the right panel when the model construction
is complete.
o Step-6:
 Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
o Step-7:
 Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting ―visualize tree‖ from
the pop-up menu.
o Step-8:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [21]


Data Mining (3160714)


 We will use our model to classify the new instances.


o Step-9:

 In the main panel under ―text‖ options click the ―supplied test set‖ radio button
and then click the ―set‖ button. This wills pop-up a window which will allow you
to open the file containing test instances.

Dataset student .arff


@relation student
@attribute age {<30,30-40,>40} @attribute income {low, medium, high} @attribute
student {yes, no}
@attribute credit-rating {fair, excellent} @attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no 30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no 30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes 30-40, medium, no, excellent, yes 30-40, high, yes, fair,
yes
>40, medium, no, excellent, no
%

The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [22]


Data Mining (3160714)


Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [23]


Data Mining (3160714)


Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [24]


Data Mining (3160714)


Practical – 5
AIM: Demonstration of classification rule process on dataset using id3 algorithm using WEKA.

 Objectives: to learn the ID3 algorithm using Weka

 Theory:

o ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that


follows a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).

o The ID3 algorithm begins with the original set as the root node. On each iteration of the
algorithm, it iterates through every unused attribute of the set and calculates the entropy
or the information gain of that attribute. It then selects the attribute which has the smallest
entropy (or largest information gain) value. The set is then split or partitioned by the
selected attribute to produce subsets of the data. (For example, a node can be split into
child nodes based upon the subsets of the population whose ages are less than 50,
between 50 and 100, and greater than 100.) The algorithm continues to recurse on each
subset, considering only attributes never selected before.
o Recursion on a subset may stop in one of these cases:
o Every element in the subset belongs to the same class; in which case the node is turned
into a leaf node and labelled with the class of the examples.
o There are no more attributes to be selected, but the examples still do not belong to the
same class. In this case, the node is made a leaf node and labelled with the most common
class of the examples in the subset.
o There are no examples in the subset, which happens when no example in the parent set
was found to match a specific value of the selected attribute. An example could be the
absence of a person among the population with age over 100 years. Then a leaf node is
created and labelled with the most common class of the examples in the parent node's set.
o Throughout the algorithm, the decision tree is constructed with each non-terminal node
(internal node) representing the selected attribute on which the data was split, and
terminal nodes (leaf nodes) representing the class label of the final subset of this branch.

 Background / Preparation:
The sample data set used in this experiment is ―employee‖data available at arff format.
This document assumes that appropriate data pre processing has been performed.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [25]


Data Mining (3160714)


 Tools / Material Needed:


o Hardware:
o Software: Weka

 Procedure / Steps:
o Steps involved in this experiment:
o Step1:
 We begin the experiment by loading the data (employee.arff) into weka.
o Step 2:
 Next we select the ―classify‖ tab and click ―choose‖ button to select the
―id3‖classifier.
o Step 3:
 Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default
values his default version does perform some pruning but does not perform error
pruning.
o Step 4:
 Under the ―text ―options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don‘t have separate evaluation data set, this
is necessary to get a reasonable idea of accuracy of generated model.
o Step 5:
 We now click‖start‖to generate the model .the ASCII version of the tree as well
as evaluation statistic will appear in the right panel when the model construction
is complete.
o Step 6:
 Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters
for the classification)
o Step 7:
 Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting ―visualize tree‖ from
the pop-up menu.
o Step 8:
 We will use our model to classify the new instances.
o Step 9:
 In the main panel under ―text ―options click the ―supplied test set‖ radio button
and then click the ―set‖ button. This will show pop-up window which will allow
you to open the file containing test instances.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [26]


Data Mining (3160714)


Data set employee.arff:


@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48} @attribute
salary{10k,15k,17k,20k,25k,30k,35k,32k} @attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k, good
%

The following screenshot shows the classification rules that were generated when id3 algorithm
is applied on the given dataset.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [27]


Data Mining (3160714)


Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [28]


Data Mining (3160714)


Practical – 6
AIM: Implementation of Bayesian classifier using JAVA and verify result with WEKA.

 Objectives: to learn the Bayesian classifier

 Theory:
Naive Bayes classifiers are a collection of classification algorithms based on
Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them
share a common principle, i.e. every pair of features being classified is independent of
each other.

 Background / Preparation:
 To start with, let us consider a dataset.
 Consider a fictional dataset that describes the weather conditions for playing a
game of golf. Given the weather conditions, each tuple classifies the conditions as
fit(―Yes‖) or unfit(―No‖) for playing golf.
 Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf


0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [29]


Data Mining (3160714)


The dataset is divided into two parts, namely, feature matrix and the response vector.

 Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are
‗Outlook‘, ‗Temperature‘, ‗Humidity‘ and ‗Windy‘.
 Response vector contains the value of class variable(prediction or output) for
each row of feature matrix. In above dataset, the class variable name is ‗Play
golf‘.

 Tools / Material Needed:


o Hardware:
o Software: Java, Weka
 Procedure / Steps:

Assumption:

The fundamental Naive Bayes assumption is that each feature makes an:

 independent
 equal

contribution to the outcome.

With relation to our dataset, this concept can be understood as:

 We assume that no pair of features are dependent. For example, the temperature
being ‗Hot‘ has nothing to do with the humidity or the outlook being ‗Rainy‘ has
no effect on the winds. Hence, the features are assumed to be independent.
 Secondly, each feature is given the same weight(or importance). For example,
knowing only temperature and humidity alone can‘t predict the outcome
accuratey. None of the attributes is irrelevant and assumed to be contributing
equally to the outcome.

Note: The assumptions made by Naive Bayes are not generally correct in real-world
situations. In-fact, the independence assumption is never correct but often works well in
practice.

Bayes‘ Theorem provides a way that we can calculate the probability of a piece of data
belonging to a given class, given our prior knowledge. Bayes‘ Theorem is stated as:

 P(class|data) = (P(data|class) * P(class)) / P(data)

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [30]


Data Mining (3160714)


Where P(class|data) is the probability of class given the provided data.

This Naive Bayes tutorial is broken down into 5 parts:

 Step 1: Separate By Class.


 We will need to calculate the probability of data by the class they belong
to, the so-called base rate.
 This means that we will first need to separate our training data by class. A
relatively straightforward operation.
 We can create a dictionary object where each key is the class value and
then add a list of all the records as the value in the dictionary.
 Step 2: Summarize Dataset.
 We need two statistics from a given set of data.
 We‘ll see how these statistics are used in the calculation of probabilities in
a few steps. The two statistics we require from a given dataset are the
mean and the standard deviation (average deviation from the mean).
 The mean is the average value and can be calculated as: mean = sum(x)/n
* count(x)
 Where x is the list of values or a column we are looking.
 Step 3: Summarize Data By Class.
 We require statistics from our training dataset organized by class.
 Above, we have developed the separate_by_class() function to separate a
dataset into rows by class. And we have developed summarize_dataset()
function to calculate summary statistics for each column.
 We can put all of this together and summarize the columns in the dataset
organized by class values.
 Step 4: Gaussian Probability Density Function.
 Calculating the probability or likelihood of observing a given real-value
like X1 is difficult.
 One way we can do this is to assume that X1 values are drawn from a
distribution, such as a bell curve or Gaussian distribution.
 A Gaussian distribution can be summarized using only two numbers: the
mean and the standard deviation. Therefore, with a little math, we can
estimate the probability of a given value. This piece of math is called a
Gaussian Probability Distribution Function (or Gaussian PDF) and can be
calculated as:
 f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 *
sigma^2)))
 Where sigma is the standard deviation for x, mean is the mean for x and PI
is the value of pi.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [31]


Data Mining (3160714)


 Step 5: Class Probabilities.


 Now it is time to use the statistics calculated from our training data to
calculate probabilities for new data.
 Probabilities are calculated separately for each class. This means that we
first calculate the probability that a new piece of data belongs to the first
class, then calculate probabilities that it belongs to the second class, and so
on for all the classes.
 The probability that a piece of data belongs to a class is calculated as
follows:
 P(class|data) = P(X|class) * P(class)
 You may note that this is different from the Bayes Theorem described
above.
 The division has been removed to simplify the calculation.
 This means that the result is no longer strictly a probability of the data
belonging to a class. The value is still maximized, meaning that the
calculation for the class that results in the largest value is taken as the
prediction. This is a common implementation simplification as we are
often more interested in the class prediction rather than the probability.
 The input variables are treated separately, giving the technique it‘s name
―naive―. For the above example where we have 2 input variables, the
calculation of the probability that a row belongs to the first class 0 can be
calculated as:
 P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)
 Now you can see why we need to separate the data by class value. The
Gaussian Probability Density function in the previous step is how we
calculate the probability of a real value like X1 and the statistics we
prepared are used in this calculation.
 Below is a function named calculate_class_probabilities() that ties all of
this together.
 It takes a set of prepared summaries and a new row as input arguments.
 First the total number of training records is calculated from the counts
stored in the summary statistics. This is used in the calculation of the
probability of a given class or P(class) as the ratio of rows with a given
class of all rows in the training data.
 Next, probabilities are calculated for each input value in the row using the
Gaussian probability density function and the statistics for that column and
of that class. Probabilities are multiplied together as they accumulated.
 This process is repeated for each class in the dataset.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [32]


Data Mining (3160714)


 Questions:
1. What is a Naïve Bayes Classifier?
2. What is Statistical Significance?
3. Why Naive Bayes is called Naive?
4. How would you use Naive Bayes classifier for categorical features? What if some
features are numerical?

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [33]


Data Mining (3160714)


Practical – 7
AIM: Implementation of K-mean clustering algorithm using JAVA and verify result with WEKA.

 Objectives: to learn the K-mean clustering

 Theory:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what
is K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.

 Background / Preparation:

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other
clusters.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [34]


Data Mining (3160714)


The below diagram explains the working of the K-means Clustering Algorithm:

 Tools / Material Needed:


o Hardware:
o Software: Java, Weka
 Procedure / Steps:
o The working of the K-Means algorithm is explained in the below steps:
 Step-1: Select the number K to decide the number of clusters.
 Step-2: Select random K points or centroids. (It can be other from the input
dataset).
 Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
 Step-4: Calculate the variance and place a new centroid of each cluster.
 Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
 Step-7: The model is ready.
 Questions:
1. How to choose the value of "K number of clusters" in K-means Clustering?
2. Why do you prefer Euclidean distance over Manhattan distance in the K means
Algorithm?
3. What is a centroid point in K means Clustering?

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [35]


Data Mining (3160714)


Practical – 8
AIM: Demonstration of Decision Tree classification using WEKA.

 Objectives: to learn decision tree classifier using Weka

 Theory:

o A classification problem is about teaching your machine learning model how to


categorize a data value into one of many classes. It does this by learning the
characteristics of each type of class. For example, to predict whether an image is of a cat
or dog, the model learns the characteristics of the dog and cat on training data.
o A regression problem is about teaching your machine learning model how to predict the
future value of a continuous quantity. It does this by learning the pattern of the quantity
in the past affected by different variables. For example, a model trying to predict the
future share price of a company is a regression problem.
o Decision trees are also known as Classification And Regression Trees (CART). They
work by learning answers to a hierarchy of if/else questions leading to a decision. These
questions form a tree-like structure, and hence the name
o For example, let‘s say we want to predict whether a person will order food or not. We can
visualize the following decision tree for this:

Each node in the tree represents a question derived from the features present in your
dataset. Your dataset is split based on these questions until the maximum depth of the tree
is reached. The last node does not ask a question but represents which class the value
belongs to.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [36]


Data Mining (3160714)


 The topmost node in the Decision tree is called the Root node
 The bottom-most node is called the Leaf node
 A node divided into sub-nodes is called a Parent node. The sub-nodes are
called Child nodes

 Background / Preparation:
o ” Weka is a free open-source software with a range of built-in machine learning
algorithms that you can access through a graphical user interface! ―
o WEKA stands for Waikato Environment for Knowledge Analysis and was developed
at the University of Waikato, New Zealand.
o Weka has multiple built-in functions for implementing a wide range of machine learning
algorithms from linear regression to neural network. This allows you to deploy the most
complex of algorithms on your dataset at just a click of a button! Not only this, Weka
gives support for accessing some of the most common machine learning library
algorithms of Python and R!
o With Weka you can preprocess the data, classify the data, cluster the data and even
visualize the data! This you can do on different formats of data files like ARFF, CSV,
C4.5, and JSON. Weka even allows you to add filters to your dataset through which you
can normalize your data, standardize it, and interchange features between nominal and
numeric values, and what not!

 Tools / Material Needed:


o Hardware:
o Software: Weka

 Procedure / Steps:

o I will take the Breast Cancer dataset from the UCI Machine Learning Repository. I
recommend you read about the problem before moving forward.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [37]


Data Mining (3160714)


Let us first load the dataset in Weka. To do that, follow the below steps:

1. Open Weka GUI


2. Select the “Explorer” option.
3. Select “Open file” and choose your dataset.

Your Weka window should now look like this

You can view all the features in your dataset on the left-hand side. Weka automatically
creates plots for your features which you will notice as you navigate through your features.

You can even view all the plots together if you click on the “Visualize All” button.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [38]


Data Mining (3160714)


Now let‘s train our classification model!

Classification using Decision Tree in Weka

Implementing a decision tree in Weka is pretty straightforward. Just complete the


following steps:

1. Click on the “Classify” tab on the top


2. Click the “Choose” button
3. From the drop-down list, select “trees” which will open all the tree
algorithms
4. Finally, select the “RepTree” decision tree

” Reduced Error Pruning Tree (RepTree) is a fast decision tree learner that builds
a decision/regression tree using information gain as the splitting criterion, and
prunes it using reduced error pruning algorithm.‖

“Decision tree splits the nodes on all available variables and then selects the split which
results in the most homogeneous sub-nodes.”

Information Gain is used to calculate the homogeneity of the sample at a split.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [39]


Data Mining (3160714)


You can select your target feature from the drop-down just above the “Start” button. If you
don‘t do that, WEKA automatically selects the last feature as the target for you.

The “Percentage split” specifies how much of your data you want to keep for training the
classifier. The rest of the data is used during the testing phase to calculate the accuracy of the
model.

With “Cross-validation Fold” you can create multiple samples (or folds) from the training
dataset. If you decide to create N folds, then the model is iteratively run N times. And each
time one of the folds is held back for validation while the remaining N-1 folds are used for
training the model. The result of all the folds is averaged to give the result of cross-
validation.

The greater the number of cross-validation folds you use, the better your model will become.
This makes the model train on randomly selected data which makes it more robust.

Finally, press the “Start” button for the classifier to do its magic!

Our classifier has got an accuracy of 92.4%. Weka even prints the Confusion matrix for
you which gives different metrics.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [40]


Data Mining (3160714)


Decision Tree Parameters in Weka

Decision trees have a lot of parameters. We can tune these to improve our model‘s overall
performance. This is where a working knowledge of decision trees really plays a crucial role.

You can access these parameters by clicking on your decision tree algorithm on top:

Let‘s briefly talk about the main parameters:

 maxDepth – It determines the maximum depth of your decision tree. By default, it is


-1 which means the algorithm will automatically control the depth. But you can
manually tweak this value to get the best results on your data
 noPruning – Pruning means to automatically cut back on a leaf node that does not
contain much information. This keeps the decision tree simple and easy to interpret
 numFolds – The specified number of folds of data will be used for pruning the
decision tree. The rest will be used for growing the rules
 minNum – Minimum number of instances per leaf. If not mentioned, the tree will
keep splitting till all leaf nodes have only one class associated with it

You can always experiment with different values for these parameters to get the best
accuracy on your dataset.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [41]


Data Mining (3160714)


Visualizing your Decision Tree in Weka

Weka even allows you to easily visualize the decision tree built on your dataset:

1. Go to the “Result list” section and right-click on your trained algorithm


2. Choose the “Visualise tree” option

Your decision tree will look like below:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [42]


Data Mining (3160714)


Interpreting these values can be a bit intimidating but it‘s actually pretty easy once you get the
hang of it.

 The values on the lines joining nodes represent the splitting criteria based on the values in
the parent node feature
 In the leaf node:
o The value before the parenthesis denotes the classification value
o The first value in the first parenthesis is the total number of instances from the
training set in that leaf. The second value is the number of instances incorrectly
classified in that leaf
o The first value in the second parenthesis is the total number of instances from the
pruning set in that leaf. The second value is the number of instances incorrectly
classified in that leaf
 Questions:
1. What is the Decision Tree Algorithm?
2. List down the attribute selection measures used by the ID3 algorithm to construct a
Decision Tree.
3. List down the different types of nodes in Decision Trees.

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [43]


Data Mining (3160714)


Practical – 9
AIM: Case Study: Do literature review of research paper on Data Mining and Web Mining which must
include any of the special techniques as per your syllabus and advanced topics/improved
technique.

 Objectives: to learn about the research paper

 Theory:
o "Research paper." What image comes into mind as you hear those words: working with
stacks of articles and books, hunting the "treasure" of others' thoughts? Whatever image
you create, it's a sure bet that you're envisioning sources of information--articles, books,
people, artworks. Yet a research paper is more than the sum of your sources, more than a
collection of different pieces of information about a topic, and more than a review of the
literature in a field. A research paper analyzes a perspective or argues a point. Regardless
of the type of research paper you are writing, your finished research paper should present
your own thinking backed up by others' ideas and information.

 Techniques for writing a good quality computer science research paper:


o Choosing the topic: In most cases, the topic is selected by the interests of the author, but
it can also be suggested by the guides. You can have several topics, and then judge which
you are most comfortable with. This may be done by asking several questions of yourself,
like "Will I be able to carry out a search in this area? Will I find all necessary resources to
accomplish the search? Will I be able to find all information in this field area?" If the
answer to this type of question is "yes," then you ought to choose that topic. In most
cases, you may have to conduct surveys and visit several places. Also, you might have to
do a lot of work to find all the rises and falls of the various data on that subject.
Sometimes, detailed information plays a vital role, instead of short information.
Evaluators are human: The first thing to remember is that evaluators are also human
beings. They are not only meant for rejecting a paper. They are here to evaluate your
paper. So present your best aspect.
o Think like evaluators: If you are in confusion or getting demotivated because your
paper may not be accepted by the evaluators, then think, and try to evaluate your paper
like an evaluator. Try to understand what an evaluator wants in your research paper, and
you will automatically have your answer. Make blueprints of paper: The outline is the
plan or framework that will help you to arrange your thoughts. It will make your paper
logical. But remember that all points of your outline must be related to the topic you have
chosen.
o Ask your guides: If you are having any difficulty with your research, then do not hesitate
to share your difficulty with your guide (if you have one). They will surely help you out
Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [44]
Data Mining (3160714)


and resolve your doubts. If you can't clarify what exactly you require for your work, then
ask your supervisor to help you with an alternative. He or she might also provide you
with a list of essential readings.
o Use of computer is recommended: As you are doing research in the field of computer
science then this point is quite obvious. Use right software: Always use good quality
software packages. If you are not capable of judging good software, then you can lose the
quality of your paper unknowingly. There are various programs available to help you
which you can get through the internet.
o Use the internet for help: An excellent start for your paper is using Google. It is a
wondrous search engine, where you can have your doubts resolved. You may also read
some answers for the frequent question of how to write your research paper or find a
model research paper. You can download books from the internet. If you have all the
required books, place importance on reading, selecting, and analyzing the specified
information. Then sketch out your research paper. Use big pictures: You may use
encyclopedias like Wikipedia to get pictures with the best resolution. At Global Journals,
you should strictly follow here
o Bookmarks are useful: When you read any book or magazine, you generally use
bookmarks, right? It is a good habit which helps to not lose your continuity. You should
always use bookmarks while searching on the internet also, which will make your search
easier.
o Revise what you wrote: When you write anything, always read it, summarize it, and
then finalize it.
o Make every effort: Make every effort to mention what you are going to write in your
paper. That means always have a good start. Try to mention everything in the
introduction—what is the need for a particular research paper. Polish your work with
good writing skills and always give an evaluator what he wants. Make backups: When
you are going to do any important thing like making a research paper, you should always
have backup copies of it either on your computer or on paper. This protects you from
losing any portion of your important data.
o Produce good diagrams of your own: Always try to include good charts or diagrams in
your paper to improve quality. Using several unnecessary diagrams will degrade the
quality of your paper by creating a hodgepodge. So always try to include diagrams which
were made by you to improve the readability of your paper. Use of direct quotes: When
you do research relevant to literature, history, or current affairs, then use of quotes
becomes essential, but if the study is relevant to science, use of quotes is not preferable.
o Use proper verb tense: Use proper verb tenses in your paper. Use past tense to present
those events that have happened. Use present tense to indicate events that are going on.
Use future tense to indicate events that will happen in the future. Use of wrong tenses
will confuse the evaluator. Avoid sentences that are incomplete.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [45]


Data Mining (3160714)


o Pick a good study spot: Always try to pick a spot for your research which is quiet. Not
every spot is good for studying.
o Know what you know: Always try to know what you know by making objectives,
otherwise you will be confused and unable to achieve your target.
o Use good grammar: Always use good grammar and words that will have a positive
impact on the evaluator; use of good vocabulary does not mean using tough words which
the evaluator has to find in a dictionary. Do not fragment sentences. Eliminate one-word
sentences. Do not ever use a big word when a smaller one would suffice.
o Verbs have to be in agreement with their subjects. In a research paper, do not start
sentences with conjunctions or finish them with prepositions. When writing formally, it is
advisable to never split an infinitive because someone will (wrongly) complain. Avoid
clichés like a disease. Always shun irritating alliteration. Use language which is simple
and straightforward. Put together a neat summary.
o Arrangement of information: Each section of the main body should start with an
opening sentence, and there should be a changeover at the end of the section. Give only
valid and powerful arguments for your topic. You may also maintain your arguments
with records.
o Never start at the last minute: Always allow enough time for research work. Leaving
everything to the last minute will degrade your paper and spoil your work.
o Multitasking in research is not good: Doing several things at the same time is a bad
habit in the case of research activity. Research is an area where everything has a
particular time slot. Divide your research work into parts, and do a particular part in a
particular time slot.
o Never copy others' work: Never copy others' work and give it your name because if the
evaluator has seen it anywhere, you will be in trouble. Take proper rest and food: No
matter how many hours you spend on your research activity, if you are not taking care of
your health, then all your efforts will have been in vain. For quality research, take proper
rest and food.
o Go to seminars: Attend seminars if the topic is relevant to your research area. Utilize all
your resources.
o Refresh your mind after intervals: Try to give your mind a rest by listening to soft
music or sleeping in intervals. This will also improve your memory. Acquire colleagues:
Always try to acquire colleagues. No matter how sharp you are, if you acquire
colleagues, they can give you ideas which will be helpful to your research.
o Think technically: Always think technically. If anything happens, search for its reasons,
benefits, and demerits. Think and then print: When you go to print your paper, check that
tables are not split, headings are not detached from their descriptions, and page sequence
is maintained.
o Adding unnecessary information: Do not add unnecessary information like "I have
used MS Excel to draw graphs." Irrelevant and inappropriate material is superfluous.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [46]


Data Mining (3160714)


Foreign terminology and phrases are not apropos. One should never take a broad view.
Analogy is like feathers on a snake. Use words properly, regardless of how others use
them. Remove quotations. Puns are for kids, not grunt readers. Never oversimplify: When
adding material to your research paper, never go for oversimplification; this will
definitely irritate the evaluator. Be specific. Never use rhythmic redundancies.
Contractions shouldn't be used in a research paper. Comparisons are as terrible as clichés.
Give up ampersands, abbreviations, and so on. Remove commas that are not necessary.
Parenthetical words should be between brackets or commas. Understatement is always
the best way to put forward earth-shaking thoughts. Give a detailed literary review.
o Report concluded results: Use concluded results. From raw data, filter the results, and
then conclude your studies based on measurements and observations taken. An
appropriate number of decimal places should be used. Parenthetical remarks are
prohibited here. Proofread carefully at the final stage. At the end, give an outline to your
arguments. Spot perspectives of further study of the subject. Justify your conclusion at
the bottom sufficiently, which will probably include examples.
o Upon conclusion: Once you have concluded your research, the next most important step
is to present your findings. Presentation is extremely important as it is the definite
medium though which your research is going to be in print for the rest of the crowd. Care
should be taken to categorize your thoughts well and present them in a logical and neat
manner. A good quality research paper format is essential because it serves to highlight
your research paper and bring to light all necessary aspects of your research.

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [47]


Data Mining (3160714)


Practical – 10
AIM: To perform hands on experiments of data preprocessing with sample data sets on Rapid Miner.

 Objectives: to learn about the Data pre-processing using Rapid Miner


 Theory:

RapidMiner Studio combines technology and applicability to serve a user-friendly


integration of the latest as well as established data mining techniques. Defining
analysis processes with RapidMiner Studio is done by drag and drop of operators,
setting parameters and combining operators.

As we will see in the following, processes can be produced from a large number
of almost randomly nestable operators and finally be represented by a so-called
process graph (flow design). The process structure is described internally by
XML and developed by means of a graphical user interface. In the background,
RapidMiner Studio constantly checks the process currently being developed for
syntax conformity and automatically makes suggestions in case of problems. This
is made possible by the so-called meta data transformation, which transforms the
underlying meta data at the design stage in such a way that the form of the re-
sult can already be foreseen and solutions can be identified in case of unsuitable
operator combinations (quick fixes). In addition, RapidMiner Studio offers the
possibility of defining breakpoints and of therefore inspecting virtually every in-
termediate result. Successful operator combinations can be pooled into building
blocks and are therefore available again in later processes.

RapidMiner Studio contains more than 1500 operations altogether for all tasks
of professional data analysis, from data partitioning, to market-based analysis,
to attribute generation, it includes all the tools you need to make your data work
for you. But also methods of text mining, web mining, the automatic sentiment
analysis from Internet discussion forums (sentiment analysis, opinion mining)
as well as the time series analysis and -prediction are available. RapidMiner Studio
enables us to use strong visualisations like 3-D graphs, scatter matrices
and self-organizing maps. It allows you to turn your data into fully customizable,
exportable charts with support for zooming, panning, and rescaling for maximum
visual impact.

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [48]


Data Mining (3160714)


 Background / Preparation:

o Installation and First Repository:

Before we can work with RapidMiner Studio, you of course need to download and
install the software first. You will find it in the download area of the RapidMiner
website:
http://www.rapidminer.com
Download the appropriate installation package for your operating system and
install RapidMiner Studio according to the instructions on the website. All usual
Windows versions are supported as well as Macintosh, Linux or Unix systems.
Please note that an up-to-date Java Runtime (at least version 7) is needed for
the latter.

If you are starting RapidMiner Studio for the first time, you will be asked to
create a new repository (Fig. 10.1). We will limit ourselves to a local repository
on your computer first of all - later on you can then define repositories in the
network, which you can also share with others:

Figure 10.1: Create a local repository on your computer to begin with the first use
of RapidMiner Studio

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [49]


Data Mining (3160714)


For a local repository you just need to specify a name (alias) and define any
directory on your hard drive (Fig. 10.2). You can select the directory directly by
clicking on the folder icon on the right. It is advisable to create a new directory
in a convenient place within the file dialog that then appears and then use this
new directory as a basis for your local repository. This repository serves as a
central storage location for your data and analysis processes and will accompany
you in the near future.

Figure 10.2: Definition of a new local repository for storing your data and
analysis processes. It is advisable to create a new directory as a basis.
Perspectives and Views
After choosing the repository you will be welcomed into the Home Perspective
(Fig. 10.3). The right section shows current news about RapidMiner, if you are
connected to the Internet. The list in the centre shows the typical actions, which
you will perform frequently after starting RapidMiner Studio. Here are the details
of those:
1. New Process: Opens the design perspective and creates a new analysis process.
2. Open: Opens a repository browser, if you click on the button. You can choose and open
an existing process in the design perspective. If you click on the arrow button on the right
side, a list of recently opened processes appears. You can select one and it will be opened
in the design perspective. Either way, RapidMiner Studio will then automatically switch
to the Design Perspective.
3. Application Wizard: You can use the Application Wizard to solve typical data mining
problems with your data in three steps. The Direct Marketing Wizard allows you to _nd
marketing actions with the highest conversion rates. The Predictive Maintenance Wizard

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [50]


Data Mining (3160714)


predicts necessary maintenance activities. The Churn Analysis Wizard allows you to
identify which customers are most likely to churn and why. The Sentiment Analysis
Wizard analyses a social media stream and gives you an insight into customers' thinking.
4. Tutorials: Starts a tutorial window which shows several available tutorials from creating
the first analysis process to data transformation. Each tutorial can be used directly within
RapidMiner Studio and gives an introduction to some data mining concepts using a
selection of analysis processes.

Figure 2.3: Home Perspective of RapidMiner Studio.


 Tools / Material Needed:
o Hardware:
o Software: RapidMiner

 Procedure / Steps:
o When you have completed the tutorials, you can use RapidMiner Studio's built-in
samples repository, with explanatory help text, for more practice exercises. The sample
data and processes are located in the Repository panel:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [51]


Data Mining (3160714)


 The data folder contains a dozen different data sets, which are used by the sample
exercises. They contain a variety of different data types.
 The processes folder contains over 130 sample processes, organized by function, that
demonstrate preprocessing, visualization, clustering, and many other topics.

To use the samples, expand the processes folder.

There are two mechanisms for using these processes:

 double-click to display the individual operators with help text. This method is best
for learning.
 drag-and-drop to have the process immediately available for running.

Signature of Faculty:

Computer Engineering Department, L. D. College of Engineering, Ahmedabad-15 [52]

You might also like