Data and DW Lab Manual Updated

Course Description
Course Name : Data Mining & Data Warehousing Lab

Course Code : 18B1WCI675
Course Credits : 01
Faculty Coordinator : Dr. Yugal Kumar
Course Objectives:
1. To understand the basic concept of data mining and data warehouse.

2. To understand the design modeling of data warehouse.
3. To understand need of preprocessing method and convert raw data into
Pre-processed data.
4. To understand the concept of classification task
5. To understand the concept of classification task
6. To understand the organization of data warehouse and different OLAP operation
Course Outcomes:
CO1 Student will be able to understand data warehouse and design model of data
warehouse.
CO2 Students will be able to learned steps of preprocessing
CO3 Students will be able to understand the analytical operations on data.
CO4 Students will be able to discover patterns and knowledge from data warehouse.
CO5 Students will be able to understand and implement classical algorithms in data
1
Experiment No. 1
1. Aim: Study different pre-processing steps of data Mining and Data warehouse
2. Objectives: From this experiment, the student will be able to
 Discover patterns from data warehouse
 Learn steps of pre-processing of data
 Obtain knowledge from data warehouse
3. Outcomes: The learner will be able to
 Recognize the need of data pre-processing.
 Identify, formulate and solve engineering problems.
 Able to match industry requirements in domains of data warehouse
4. Theory:
Data pre-processing is an often neglected but important step in the data mining process. The
phrase "Garbage In, Garbage Out" is particularly applicable to data mining and machine
learning. Data gathering methods are often loosely controlled, resulting in out-of-range
values (e.g., Income:-100), impossible data combinations (e.g., Gender: Male, Pregnant:
Yes), missing values, etc
If there is much irrelevant and redundant information present or noisy and unreliable data,
then knowledge discovery during the training phase is more difficult. Data preparation and
filtering steps can take considerable amount of processing time. Data pre-processing includes
cleaning, normalization, transformation, feature extraction and selection, etc. The product of
data pre-processing is the final training set.
Data Pre-processing Methods

Raw data is highly susceptible to noise, missing values, and inconsistency. The quality of
data affects the data mining results. In order to help improve the quality of the data and,
consequently, of the mining results raw data is pre-processed so as to improve the efficiency
and ease of the mining process. Data pre-processing is one of the most critical steps in a data
mining process which deals with the preparation and transformation of the initial dataset.
Data pre-processing methods are divided into following categories:
1) Data Cleaning
2) Data Integration
3) Data Transformation
4) Data Reduction
2
Fig. Forms of data Preprocessing
Data Cleaning
Data that is to be analyze by data mining techniques can be incomplete (lacking attribute
values or certain attributes of interest, or containing only aggregate data), noisy (containing
errors, or outlier values which deviate from the expected), and inconsistent (e.g., containing
discrepancies in the department codes used to categorize items).Incomplete, noisy, and
inconsistent data are commonplace properties of large, real -world databases and data
warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may
not always be available, such as customer information for sales transaction data. Other data
may not be included simply because it was not considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have been
overlooked. Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred. Data can be noisy, having incorrect attribute values, owing to the
following. The data collection instruments used may be faulty. There may have been human
or computer errors occurring at data entry. Errors in data transmission can also occur. There
may be technology limitations, such as limited buffer size for coordinating synchronized data
transfer and consumption. Incorrect data may also result from inconsistencies in naming
conventions or data codes used. Duplicate tuples also require data cleaning. Data cleaning
routines work to ―clean" the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies. Dirty data can cause
confusion for the mining procedure. Although most mining routines have some procedures
for dealing with incomplete or noisy data, they are not always robust. Instead, they may
concentrate on avoiding over fitting the data to the function being modelled. Therefore, a
useful pre-processing step is to run your data through some data cleaning routines.
Missing Values: If it is noted that there are many tuples that have no recorded value for
several attributes, then the missing values can be filled in for the attribute by various methods
described below:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification or description). This method is not very effective,
3
unless the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like \Unknown", or -∞. If missing values are
replaced by, say, \Unknown", then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common | that of
\Unknown". Hence, although this method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction.
Inconsistent data: There may be inconsistencies in the data recorded for some transactions.
Some data inconsistencies may be corrected manually using external references. For
example, errors made at data entry may be corrected by performing a paper trace. This may
be coupled with routines designed to help correct the inconsistent use of codes. Knowledge
engineering tools may also be used to detect the violation of known data constraints. For
example, known functional dependencies between attributes can be used to find values
contradicting the functional constraints.
Data Integration
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files. There are a number of issues to consider
during data integration. Schema integration can be tricky. How can like real world entities
from multiple data sources be 'matched up'? This is referred to as the entity identification
problem. For example, how can the data analyst or the computer be sure that customer id in
one database, and cust_number in another refer to the same entity? Databases and data
warehouses typically have metadata - that is, data about the data. Such metadata can be used
to help avoid errors in schema integration. Redundancy is another important issue. An
attribute may be redundant if it can be ―derived" from another table, such as annual revenue.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting
data set.
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
1. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from data. Such techniques include binning,
clustering, and regression.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by
higher level concepts through the use of concept hierarchies. For example, categorical
4
attributes, like street, can be generalized to higher level concepts, like city or country.
Similarly, values for numeric attributes, like age, may be mapped to higher level
concepts, like young, middle-aged, and senior.
Data Reduction
Data reduction techniques have been helpful in analyzing reduced representation of the
dataset without compromising the integrity of the original data and yet producing the quality
knowledge. The concept of data reduction is commonly understood as either reducing the
volume or reducing the dimensions (number of attributes). There are a number of methods
that have facilitated in analyzing a reduced volume or dimension of data and yet yield useful
knowledge. Certain partition based methods work on partition of data tuples. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results
Strategies for data reduction include the following.
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2. Dimension reduction, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
3. Data compression, where encoding mechanisms are used to reduce the data set size.
Themethods used for data compression are wavelet transform and Principal
Component Analysis.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data e.g. regression and log-linear models), or
nonparametric methods such as clustering, sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the
mining of data at multiple levels of abstraction, and are a powerful tool for data
mining.
5. Conclusion:
Data pre-processing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviours or trends, and is likely to contain many errors. Data pre-processing is a
proven method of resolving such issues. Such pre-processing is thus studied.
6. Viva Questions:
 What is pre-processing of data?

 What is the need for data pre-processing?
 What kind of data can be cleaned?
7. References:
 Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
 M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education
5
Experiment No. 2
1. Aim: Implementation of decision tree algorithm in JAVA.
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical algorithms in data mining
 Identify the application of classification algorithm in data mining
 Assess the strength and weaknesses of algorithms
 Identify, formulate and solve engineering problems
 Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :JDK for JAVA
5. Theory:
Decision Tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. A decision tree represents a procedure for classifying
categorical data based on their attributes. It is also efficient for processing large amount of
data, so is often use in data mining operations. The construction of decision tree does not
require any domain knowledge or parameter setting, and therefore appropriate for exploratory
knowledge discovery.
Decision tree builds classification or regression models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. The final result is a tree with decision
nodes and leaf nodes
The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a
top-down, greedy search through the space of possible branches with no backtracking. ID3
uses Entropy and Information Gain to construct a decision tree.
6
Entropy: A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses
entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous
the entropy is zero and if the sample is an equally divided it has entropy of one. To build a
decision tree, we need to calculate two types of entropy using frequency tables as follows:
Information Gain: The information gain is based on the decrease in entropy after a dataset is
split on an attribute. Constructing a decision tree is all about finding attribute that returns the
highest information gain (i.e., the most homogeneous branches).
6. Procedure/Program:
1. Calculate entropy of the target
2. The dataset is then split on the different attributes. The entropy for each branch
is calculated. Then it is added proportionally, to get total entropy for the split.
7
The resulting entropy is subtracted from the entropy before the split. The result
is the Information Gain, or decrease in entropy.
3. Choose attribute with the largest information gain as the decision node
4. A. A branch with entropy of 0 is a leaf node
8
A. A branch with entropy more than 0 needs further splitting.
5. The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
7. Results:
Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node
to the leaf nodes one by one
8. Conclusion:
The different classification algorithms of data mining were studied and one among them
named decision tree (ID3) algorithm was implemented using JAVA. The need for
classification algorithm was recognized and understood.
9. Viva Questions:
 What are various classification algorithms?

 What is entropy?
 How does u find information gain?
10. References:
Edition
9
Experiment No. 3
1. Aim: Implementation of ID3 algorithm using WEKA tool.
 Understand and implement classical algorithms in data mining
 Identify the application of classification algorithm in data mining
4. Software Required :WEKA tool
5. Theory:
Decision tree learning is a method for assessing the most likely outcome value by taking into
account the known values of the stored data instances. This learning method is among the
most popular of inductive inference algorithms and has been successfully applied in broad
range of tasks such as assessing the credit risk of applicants and improving loyality of regular
customers
6. Procedure:
1. Download dataset for implementation of ID3 algorithm (.csv or .arff file). Here bank-
data.csvdataset has taken fordecision tree analysis
2. Load data in WEKA tool
10
3. Select the "Classify" tab and click the "Choose" button to select the ID3 classifier
4. Specify the various parameters. These can be specified by clicking in the text box to
the right of the "Choose" button. In this example we accept the default values. The
default version does perform some pruning (using the subtree raising approach), but
does not perform error pruning
11
5. Under the "Test options" in the main panel we select 10-fold cross-validation as our
evaluation approach. Since we do not have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of the generated model. We now click
"Start" to generate the model.
6. We can view this information in a separate window by right clicking the last result set
(inside the "Result list" panel on the left) and selecting "View in separate window"
from the pop-up menu.
12
7. WEKA also provides view a graphical rendition of the classification tree. This can be
done by right clicking the last result set (as before) and selecting "Visualize tree" from
the pop-up menu.
We will now use our model to classify the new instances. However, in the data
section, the value of the "pep" attribute is "?" (or unknown).
13
In the main panel, under "Test options" click the "Supplied test set" radio button, and
then click the "Set..." button. This will pop up a window which allows you to open the
file containing test instances.
In this case, we open the file "bank-new.arff" and upon returning to the main window,
we click the "start" button. This, once again generates the models from our training
data, but this time it applies the model to the new unclassified instances in the "bank-
new.arff" file in order to predict the value of "pep" attribute.
14
The summary of the results in the right panel does not show any statistics. This is
because in our test instances the value of the class attribute ("pep") was left as "?", thus
WEKA has no actual values to which it can compare the predicted values of new
instances.
GUI vesion of WEKA is used to create a file containing all the new instances along
with their predicted class value resulting from the application of the model.
First, right-click the most recent result set in the left "Result list" panel. In the resulting
pop-up window select the menu item "Visualize classifier errors". This brings up a
separate window containing a two-dimensional graph.
8. To save the file: In the new window, we click on the "Save" button and save the result
as the file: "bank-predicted.arff"
15
This file contains a copy of the new instances along with an additional column for the
predicted value of "pep". The top portion of the file can be seen in below figure.
7. Conclusion:
The different classification algorithms of data mining were studied and one among them
named decision tree (ID3) algorithm was implemented using JAVA. The need for
classification algorithm was recognized and understood.
8. Viva Questions:
 What is the use of WEKA tool?
9. References:
Edition
16
Experiment No. 4
1. Aim: Implementation of K-means clustering in JAVA.
 Understand and implement classical clustering algorithms in data mining
 Identify the application of clustering algorithm in data mining
5. Theory:
Clustering is dividing data points into homogeneous classes or clusters:
 Points in the same group are as similar as possible

 Points in different group are as dissimilar as possible
When a collection of objects is given, we put objects into group based on similarity.
Clustering Algorithms:
A Clustering Algorithm tries to analyse natural groups of data on the basis of some
similarity. It locates the centroid of the group of data points. To carry out effective
clustering, the algorithm evaluates the distance between each point from the centroid of
the cluster. The goal of clustering is to determine the intrinsic grouping in a set of
unlabelled dataTheory:
K-means Clustering
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well-known clustering problem. K-means clustering is a method of
vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining.
6. Procedure:
Input: K number of clusters, D is a data set containing n objects.
Output: A set of k clusters.
1. Arbitrarily choose K objects from D as the initial cluster centres
2. Partition of objects into k non-empty subsets
17
3. Identifying the cluster centroid (mean point) of the current partition.
4. Assigning each point to a specific cluster
5. Compute the distances from each point and allot points to the cluster where
the distance from the centroid is minimized.
6. After re-allotting the points, find the centroid of the new cluster formed.
7. Conclusion:
The different clustering algorithms of data mining were studied and one among them
named k-means clustering algorithm was implemented using JAVA. The need for
clustering algorithm was recognized and understood.
8. Viva Questions:
 What are different clustering techniques?
 What is difference between K-means and K-medoids?
 What is dendogram?
9. References:
Edition
 M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
18
Experiment No. 5
1. Aim: To implement the clustering algorithm, K-means using WEKA tool.
 Understand and implement classical clustering algorithms in data mining
 Identify the application of clustering algorithm in data mining
5. Theory:
Weka is a landmark system in the history of the data mining and machine learning
research communities, because it is the only toolkit that has gained such widespread
adoption and survived for an extended period of time
The key features responsible for Weka success are: –
• It provides many different algorithms for data mining and machine learning.
• Is is open source and freely available.
• It is platform-independent.
• It is easily useable by people who are not data mining specialists.
• It provides flexible facilities for scripting experiments – it has kept up-to-date, with
new algorithms
WEKA INTERFACE
The GUI Chooser consists of four buttons—one for each of the four major Weka
applications—and four menus. The buttons can be used to start the following applications:
• Explorer: An environment for exploring data with WEKA .

• Experimenter: An environment for performing experiments and conducting statistical tests
between learning schemes.
19
• KnowledgeFlow: This environment supports essentially the same functions as the Explorer
but with a drag-and-drop interface. One advantage is that it supports incremental learning.
• SimpleCLI: Provides a simple command-line interface that allows direct execution of

WEKA commands for operating systems that do not provide their own command line
interface.
WEKA CLUSTERER
It contains ―clusters‖ for finding groups of similar instances in a dataset. Some implemented
schemes are: k-Means, EM, Cobweb, X-means, FarthestFirst .Clusters can be visualized and
compared to ―true‖ clusters.
6. Procedure:
The basic step of k-means clustering is simple. In the beginning, we determine number of
cluster K and we assume the centroid or centre of these clusters. We can take any random
objects as the initial centroid or the first K objects can also serve as the initial centroid. Then
the K means algorithm will do the three steps below until convergence. Iterate until stable (=
no object move group):
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroid
3. Group the object based on minimum distance (find the closest centroid)
K-means in WEKA 3.7
The sample data set used is based on the "bank data" available in comma-separated format
bank-data.csv. The resulting data file is ―bank.arff‖ and includes 600 instances. As an
illustration of performing clustering in WEKA, we will use its implementation of the K-
means algorithm to cluster the customers in this bank data set, and to characterize the
resulting customer segments.
20
To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose"
button. This results in a drop down list of available clustering algorithms. In this case we
select "SimpleKMeans".
Next, click on the text box to the right of the "Choose" button to get the pop-up window
shown below, for editing the clustering parameter.
In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed"
as is.
21
The seed value is used in generating a random number which is, in turn, used for making the
initial assignment of instances to clusters.
Once the options have been specified, we can run the clustering algorithm. Here we make
sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click
"Start".
We can right click the result set in the "Result list" panel and view the results of clustering in
a separate window.
22
We can even visualize the assigned cluster as below
You can choose the cluster number and any of the other attributes for each of the three
different dimensions available (x-axis, y-axis, and color). Different combinations of choices
will result in a visual rendering of different relationships within each cluster.
23
Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster"
attribute to the original data set. In the data portion, each instance now has its assigned cluster
as the last attribute value (as shown below).
7. Conclusion:
The different clustering algorithms of data mining were studied and one among them
named k-means clustering algorithm was implemented using JAVA. The need for clustering
algorithm was recognized and understood.
8. References:
Edition
24
Experiment No. 6
1. Aim: To study and implement Apriori Algorithm.
 Understand and implement classical association mining algorithms
 Identify the application of association mining algorithms
5. Theory: Apriori algorithm is well known association rule algorithm is used in most
commercial product. It uses item set property: Any subset of large item set must be
large
6. Procedure:
Input:
I = // itemset
D = // db of transactions.
S= // support
Output:
L1
Apiriori Algorithm:
K=0;
L= #;
Ci= I;
repeat
k=k+1;
Lk= #;
for each Ji belong to Ck do
Ci=0;
for each I,j belong to D do
for each Ii belong to tj then
Ci=Ci+1;
for each Ii belong to Ck do
ifCi>=(S*/D/)do
Lk=L U Ii;
L=L U Lk;
Ck+1=Apriori-Gen(Lk) until Ck+1= # ;
25
7. Conclusion:
The different association mining algorithms of data mining were studied and one
among them named Apriori association mining algorithm was implemented using
JAVA. The need for association mining algorithm was recognized and understood.
8. Viva Questions:
 What is support and confidence?

 What are different type association mining algorithms?
 What is the disadvantage of Apriori algorithm?
9. References:
Edition
Education
26
Experiment No. 7
1. Aim: Implementation of Apriori algorithm in WEKA.
 Understand and implement classical association mining algorithms
 Identify the application of association mining algorithms
5. Theory: The Apriori Algorithm is an influential algorithm for mining frequent item
sets for Boolean association rules. Some key concepts for Apriori algorithm are:
 Frequent Item sets: The sets of item which has minimum support (denoted by
Li for ith-Itemset)
 Apriori Property: Any subset of frequent item set must be frequent.
 Join Operation: To find Lk , a set of candidate k item sets is generated by
joining Lk-1 with itself.
6. Procedure:
WEKA implementation:
To learn the system, TEST_ITEM_TRANS.arff has been used.
Using the Apriori Algorithm we want to find the association rules that have
minSupport=50% and minimum confidence=50%. After we launch the WEKA
application and open the TEST_ITEM_TRANS.arff file as shown in below figure.
27
Then we move to the Associate tab and we set up the configuration as shown below
After the algorithm is finished, we get the following results:

=== Run information ===
Scheme: weka.associations.Apriori -N 20 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c
-1
Relation: TEST_ITEM_TRANS
Instances: 15
Attributes: 8
28
ABCDEFGH
=== Associator model (full training set) ===Apriori =======
Minimum support: 0.5 (7 instances)
Minimum metric: 0.5 Number of cycles performed: 10
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 10
Best rules found
1. E=TRUE 11 ==> H=TRUE 11 conf:(1)
2. B=TRUE 10 ==> H=TRUE 10 conf:(1)
3. C=TRUE 10 ==> H=TRUE 10 conf:(1)
4. A=TRUE 9 ==> H=TRUE 9 conf:(1)
5. G=FALSE 9 ==> H=TRUE 9 conf:(1)
6. D=TRUE 8 ==> H=TRUE 8 conf:(1)
7. F=FALSE 8 ==> H=TRUE 8 conf:(1)
8. D=FALSE 7 ==> H=TRUE 7 conf:(1)
9. F=TRUE 7 ==> H=TRUE 7 conf:(1)
10. B=TRUE E=TRUE 7 ==> H=TRUE 7 conf:(1)
11. C=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1)
12. E=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1)
13. G=FALSE 9 ==> C=TRUE 7 conf:(0.78)
14. G=FALSE 9 ==> E=TRUE 7 conf:(0.78)
15. G=FALSE H=TRUE 9 ==> C=TRUE 7 conf:(0.78)
16. G=FALSE 9 ==> C=TRUE H=TRUE 7 conf:(0.78)
17. G=FALSE H=TRUE 9 ==> E=TRUE 7 conf:(0.78)
18. G=FALSE 9 ==> E=TRUE H=TRUE 7 conf:(0.78)
19. H=TRUE 15 ==> E=TRUE 11 conf:(0.73)
20. B=TRUE 10 ==> E=TRUE 7 conf:(0.7)
7. Conclusion:
The different association mining algorithms of data mining were studied and one
among them named Apriori association mining algorithm was implemented using
JAVA. The need for association mining algorithm was recognized and understood.
8. References:
Edition
Education
29
Experiment No. 8
1. Aim: Study of R Tool.

 Learn basics of mining tool
 Create web page for mobile shopping using editor tool
 Study the methodology of engineering legacy of data mining
 Use current techniques, skills and tools for mining.
 Engage them in life-long learning.
 Able to match industry requirements in domains of data mining
4. Software Required :R tool
5. Theory:
R tool is "a programming ―environment‖, object-oriented similar to S-Plus freeware that
provides calculations on matrices, excellent graphics capabilities and supported by a large
user network.
Installing R:
1) Download from CRAN
2) Select a download site
3) Download the base package at a minimum
4) Download contributed packages as needed
R Basics / Components Of R:
 Objects
 Naming convention
 Assignment
 Functions
 Workspace
 History
Objects
 names
 types of objects: vector, factor, array, matrix, data. frame, ts, list
 attributes
o mode: numeric, character, complex, logical
o length: number of elements in object
 creation
o assign a value
o create a blank object
30
Naming Convention
 must start with a letter (A-Z or a-z)

 can contain letters, digits (0-9), and/or periods ―.‖
 case-sensitive
o eg. mydata different from MyData
 do not use underscore ―_‖
Assignment
 ―<-‖ used to indicate assignment

o egs. <-c(1,2,3,4,5,6,7)
<-c(1:7)
<-1:4
Functions
 actions can be performed on objects using functions (note: a function is itself an

object)
 have arguments and options, often there are defaults
 provide a result
 parentheses () are used to specify that a function is being called.
Workspace
 during an R session, all objects are stored in a temporary, working memory

 list objects
o ls()
 remove objects
o rm()
 objects that you want to access later must be saved in a ―workspace‖
o from the menu bar: File->save workspace
o from the command line: save(x,file=―MyData.Rdata‖)
History
 command line history

 can be saved, loaded, or displayed
o savehistory(file=―MyData.Rhistory)
o loadhistory(file=―MyData.Rhistory)
o history(max.show=Inf)
 during a session you can use the arrow keys to review the command history
Two most common object types for statistics:
A. Matrix
a matrix is a vector with an additional attribute (dim) that defines the number of columns
and rowsonly one mode (numeric, character, complex, or logical) allowedcan be created
using matrix()
x<-matrix(data=0,nr=2,nc=2)
or
31
o x<-matrix(0,2,2)
B. Data Frame
several modes allowed within a single data framecan be created using data.frame()
L<-LETTERS[1:4] #A B C D
x<-1:4 #1 2 3 4
data.frame(x,L) #create data frame
attach() and detach()
o the database is attached to the R search path so that the database is searched by
R when it is evaluating a variable.
o objects in the database can be accessed by simply giving their names
Data Elements:
 select only one element

eg. x[2]
 select range of elements
eg. x[1:3]
 select all but one element
eg. x[-3]
 slicing: including only part of the object
eg. x[c(1,2,5)]
 select elements based on logical operator
eg. x(x>3)
Data Import & Entry:

Importing Data
 read.table(): reads in data from an external file

 data.entry(): create object first, then enter data
 c(): concatenate
 scan(): prompted data entry
 R has ODBC for connecting to other programs.
Data entry & editing
 start editor and save changes

o data.entry(x)
 start editor, changes not saved
o de(x)
 start text editor
o edit(x)
Useful Functions
 length(object) # number of elements or components
 str(object) # structure of an object
 class(object) # class or type of an object
32
 names(object) # names
 c(object,object,...) # combine objects into a vector
 cbind(object, object, ...) # combine objects as columns
 rbind(object, object, ...) # combine objects as rows
 ls() # list current objects
 rm(object) # delete an object
 newobject<- edit(object) # edit copy and save a
 newobject
 fix(object)
Exporting Data
 To A Tab Delimited Text File

o write.table(mydata, "c:/mydata.txt", sep="\t")
 To an Excel Spreadsheet
o library(xlsReadWrite)
write.xls(mydata, "c:/mydata.xls")
 To SAS
o library(foreign)
write.foreign(mydata,c:/mydata.txt", "c:/mydata.sas", package="SAS")
Viewing Data
There are a number of functions for listing the contents of an object or dataset:
•list objects in the working environment: ls()
•list the variables in mydata: names(mydata)
•list the structure of mydata: str(mydata)
•list levels of factor v1 in mydata: levels(mydata$v1)
•dimensions of an object: dim(object)
•class of an object (numeric, matrix, dataframe, etc): class(object)
•print mydata :mydata
•print first 10 rows of mydata: head(mydata, n=10)
•print last 5 rows of mydata: tail(mydata, n=5)
Interfacing with R
 CSV Files
 Excel Files
 Binary Files
 XML Files
 JSON Files
 Web data
 Database
We can also create
33
 pie charts
 bar charts
 box plots
 histograms
 line graphs
 scatterplots
DataTypesIn R Tool
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
Input
 The source( ) function runs a script in the current session.

 If the filename does not include a path, the file is taken from the current working
directory.
 #input a script
source("myfile")
Output
 The sink( ) function defines the direction of the output.

o # direct output to a file
 sink("myfile", append=FALSE, split=FALSE)
o # return output to the terminal
 sink()
 The append option controls whether output overwrites or adds to a file.
 The split option determines if output is also sent to the screen as well as the output
file.
Creating new variables
 Use the assignment operator <- to create new variables. A wide array of operators and
functions are available here.
# Three examples for doing the same computations
1. mydata$sum<- mydata$x1 + mydata$x2
mydata$mean<- (mydata$x1 + mydata$x2)/2
2. attach(mydata)
mydata$sum<- x1 + x2
mydata$mean<- (x1 + x2)/2
detach(mydata)
34
3. mydata<- transform( mydata,sum = x1 + x2,
mean = (x1 + x2)/2 )
Renaming variables
 You can rename variables programmatically or interactively.

o # rename interactively
o fix(mydata) # results are saved on close
o # rename programmatically
 library(reshape)
 mydata<- rename(mydata, c(oldname="newname"))
Sorting
 To sort a data frame in R, use the order( ) function.

 By default, sorting is ASCENDING.
 Prepend the sorting variable by a minus sign to indicate DESCENDING order.
Merging
 To merge two data frames (datasets) horizontally, use the merge function.
 In most cases, you join two data frames by one or more common key variables (i.e.,
an inner join).
Examples:
# merge two dataframes by ID -- total <- merge(dataframeA,dataframeB,by="ID")
# merge two dataframes by ID and Country --
total<- merge(dataframeA,dataframeB,by=c("ID","Country"))
6. Conclusion:
R tool, free software environment for statistical computing and graphics is studied. Using R
tool, various data mining algorithms were implemented. R and its packages, functions and
task views for data mining process and popular data mining techniques were learnt.
7. Viva Questions:
 How R tool is used for mining big data?
8. References:
Edition
Education
35
Experiment No. 9
1. Aim: Study of Business Intelligence Tool, SPSS Clementine, and XL Miner etc
 Learn basics of business intelligence
 Create web page for mobile shopping using editor tool
 Study the methodology of engineering legacy of data mining
 Use current techniques, skills and tools for mining.
 Engage them in life-long learning.
 Able to match industry requirements in domains of data mining
4. Software Required :BI tool - SPSS Clementine
5. Theory:
IBM SPSS Modeler is a data mining and text analytics software application built by IBM.
It is used to build predictive models and conduct other analytic tasks. It has a visual
interface which allows users to leverage statistical and data mining algorithms without
programming. IBM SPSS Modeler was originally named Clementine by its creators,
Applications:
SPSS Modeler has been used in these and other industries:
• Customer analytics and Customer relationship management (CRM)
• Fraud detection and prevention
• Optimizing insurance claims
• Risk management
• Manufacturing quality improvement
• Healthcare quality improvement
• Forecasting demand or sales
• Law enforcement and border security
• Education
• Telecommunications
• Entertainment: e.g., predicting movie box office receipts
SPSS is available in two separate bundles of features called editions.

1. SPSS Modeller Professional
2. SPSS Modeller Premium
It all includes:
o text analytics
o entity analytics
o social network analysis
Both the editions are available in desktop and server configurations.
Earlier it was Unix based and designed as a consulting tool and not for sale to the
customers. Originally, developed by a UK Company called Integral Solutions in
36
collaboration with Artificial Intelligence researchers at Sussex University. It mainly uses
two of the Poplog languages, Pop11 and Prolog. It was the first data mining tool to use an
icon based graphical user interface rather than writing programming languages.
Clementine is a data mining software for business solutions.
Previous version was a stand alone application architecture while new version is a distributed
architecture.
Fig. Previous version (stand alone)
Distributed Architecture
Fig. New version (Distributed architecture)
Multiple model building techniques in Clementine:
37
 Rule Induction
 Graph
 Clustering
 Association Rules
 Linear Regression
 Neural Networks
Functionalities:
 Classification: Rule Induction, neural Networks

 Association: Rule Induction, Apriori
 Clustering: Kohonen Networks, Rule Induction
 Sequence: Rule Induction, Neural Networks, Linear Regression
 Prediction: Rule Induction, Neural Networks
Applications:
 Predict market share

 Detect possible fraud
 Locate new retail sites
 Assess financial risk
 Analyze demographic trends and patterns
6. Conclusion:
IBM SPSS Modeler is a data mining and text analytics software application is studied. It has
a visual interface which allows users to leverage statistical and data mining algorithms
without programming is understood
7. Viva Questions:
 What are the functionalities of SPSS Clementine?
8. References:
Edition
Education
38
Experiment No. 10
1. Aim: One case study on Data warehouse System.
A. Write Detail Statement Problem and creation of dimensional modelling (creation star
and snowflake schema) Implementation of dimensional modeling
B. Implementation of all dimension table and fact table
C. Implementation of OLAP operations.
 Understand the basics of Data Warehouse
 Understand the design model of Data Warehouse
 Study methodology of engineering legacy databases for data warehousing
 Apply knowledge of legacy databases in creating data warehouse
 Understand, identify, analyse and design the warehouse
 Use current techniques, skills and tools necessary for designing a data
warehouse
4. Software Required: Oracle 11g
5. Theory:
In computing, online analytical processing, or OLAP is an approach to answering multi-
dimensional analytical (MDA) queries swiftly OLAP is part of the broader category
of business intelligence which also encompasses relational database, report writing and data
mining. Typical applications of OLAP include business reporting for sales, marketing,
management reporting, business process management (BPM), budgeting and similar areas,
with new applications coming up, such as agriculture The term OLAP was created as a slight
modification of the traditional database term online transaction processing.
Dimensional modelling-
Dimensional modeling (DM) names a set of techniques and concepts used in Dimensional
modeling (DM) names a set of techniques and concepts used in data warehouse design. It is
considered to be different from Entity relationship (ER). Dimensional Modeling does not
necessarily involve a relational database. The same modeling approach, at the logical level,
can be used for any physical form, such as multidimensional database or even flat files. , DM
is a design technique for databases intended to support end-user queries in a data warehouse.
It is oriented around understandability and performance.
Star Schema
- Fact table is in middle and dimension tables are arranged around the fact table
39
Snowflake Schema
Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
Snowflaking in the dimensional model can impact understandability of the
dimensional model and result in a decrease in performance because more tables will
need to be joined to satisfy queries
6. Conclusion:
We have studied different schemas of data warehouse, and using the methodology of
engineering legacy database, a new data warehouse was built. The normalization was applied
wherever required on star schema and snowflake schema was designed.
7. Viva Questions:
 What is data warehouse?
 What is multi-dimensional data?
 What is difference between star and snowflake schema?
40
8. References:
 Paulraj Ponniah, ―Data Warehousing: Fundamentals for IT Professionals‖, Wiley
India
 Reema Theraja ―Data warehousing‖, Oxford University Press
41
Experiment No. 11
1. Aim: Study different OLAP operations
 Discover patterns from data warehouse
 Online analytical processing of data
 Obtain knowledge from data warehouse
 Recognize the need of online analytical processing.
 Identify, formulate and solve engineering problems.
 Able to match industry requirements in domains of data warehouse
4. Theory:
Following are the different OLAP operations
 Roll up (drill-up): summarize data
o by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
o from higher level summary to lower level summary or detailed data, or
introducing new dimensions
 Slice and dice:
o project and select
 Pivot (rotate):
o reorient the cube, visualization, 3D to series of 2D planes
o
Fact table View Multi-dimensional cube
Dimension = 3
42
Example
Cube aggregation – roll up and drill down
Example – slicing
Example – slicing and pivoting
43
5. Conclusion:
OLAP, which performs multidimensional analysis of business data and provides the
capability for complex calculations, trend analysis, and sophisticated data modelling is
studied.
6. Viva Questions:
 What are OLAP operations?

 What is difference between OLTP and OLAP?
 What is difference between slicing and dicing?
7. References:
Edition
Education
44

Data and DW Lab Manual Updated

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data and DW Lab Manual Updated

Uploaded by

Copyright:

Available Formats

Course Description

Course Name : Data Mining & Data Warehousing Lab

1. To understand the basic concept of data mining and data warehouse.

Data Pre-processing Methods

 What is pre-processing of data?

1. Calculate entropy of the target

4. A. A branch with entropy of 0 is a leaf node

Decision Tree to Decision Rules

 What are various classification algorithms?

2. Load data in WEKA tool

 Points in the same group are as similar as possible

• Explorer: An environment for exploring data with WEKA .

• SimpleCLI: Provides a simple command-line interface that allows direct execution of

1. Determine the centroid coordinate

2. Determine the distance of each object to the centroid

K-means in WEKA 3.7

 What is support and confidence?

To learn the system, TEST_ITEM_TRANS.arff has been used.

After the algorithm is finished, we get the following results:

1. Aim: Study of R Tool.

 must start with a letter (A-Z or a-z)

 ―<-‖ used to indicate assignment

 actions can be performed on objects using functions (note: a function is itself an

 during an R session, all objects are stored in a temporary, working memory

 command line history

 select only one element

Data Import & Entry:

 read.table(): reads in data from an external file

 start editor and save changes

 To A Tab Delimited Text File

 The source( ) function runs a script in the current session.

 The sink( ) function defines the direction of the output.

 You can rename variables programmatically or interactively.

 To sort a data frame in R, use the order( ) function.

SPSS is available in two separate bundles of features called editions.

Fig. Previous version (stand alone)

Fig. New version (Distributed architecture)

Multiple model building techniques in Clementine:

 Classification: Rule Induction, neural Networks

 Predict market share

 What are the functionalities of SPSS Clementine?

Fact table View Multi-dimensional cube

Cube aggregation – roll up and drill down

Example – slicing and pivoting

 What are OLAP operations?

You might also like