You are on page 1of 31

Laboratory File

Data Mining and Business


Intelligence
For 4th Year Student, CSE
Dept.: Computer Science
Engineering
ETCS - 457

Submitted To: Submitted By: Lakshay Bisht


Ms. Divya Roll no. 00576802719
CSE -III
INDEX

Date of Date of Teacher’s


S Name of Experiment Experiment Submission Remarks Signature
No.

1. Introduction to WEKA. 19-10-22 18-01-23

2. To study about ETL process & its tools. 02-11-22 18-01-23

3. create an .arff file. 09-11-22 18-01-23

4. Implementation of Classification technique 16-11-22 18-01-23


on ARFF files using WEKA.

5.
Implementation of Clustering technique on 07-12-22 18-01-23
ARFF files using WEKA.

6.
Implementation of Association Rule technique 14-12-22 18-01-23
on ARFF files using WEKA.

7. To explore 2 graphs, view their ARFF


files and apply an algorithm on them. 21-12-22 18-01-23

8. To use Numeric Transform filter and


floor function to obtain the precision up 28-12-22 18-01-23
to same value.
EXPERIMENT-1

AIM:
Introduction to WEKA

THEORY:
Introduction:

Named after a flightless New Zealand bird, Weka is a set of machine learning algorithms
that can be applied to a data set directly, or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association
rules, and visualization.
Machine learning is nothing but a type of artificial intelligence which enables computers
to learn the data without help of any explicit programs. Machine learning systems crawl
through the data to find the patterns and, when these are found, adjust the program’s
actions accordingly.

Data mining analyses the data from different perspectives and summarizes it into parcels
of useful information. The machine learning method is similar to data mining. The
difference is that data mining systems extract the data for human comprehension.

About:
Weka is data mining software that uses a collection of machine learning algorithms.
These algorithms can be applied directly to the data or called from the Java code.
Weka is a collection of tools for:

1. Regression
2. Clustering
3. Association
4. Data pre-processing
5. Classification
6. Visualization

Features of Weka:

3
Weka’s application interfaces:

4
Installation of Weka:
You can download Weka from the official website http://www.cs.waikato.ac.nz/ml/weka/.
Execute the following commands at the command prompt to set the Weka environment
variable for Java, as follows:-
setenv WEKAHOME /usr/local/weka/weka-3-0-2
setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH

Once the download is completed, run the exe file and choose the default set-up.

Weka data formats:


Weka uses the Attribute Relation File Format for data analysis, by default. But listed below
are some formats that Weka supports, from where data can be imported:
 CSV (Comma Separated Values)
 ARFF
 Database using ODBC

Weka Explorer:
The Weka Explorer contains a total of six tabs.
1. Preprocess:- This allows us to choose the data file.

2. Classify:- This allows us to apply and experiment with different algorithms on


preprocessed data files.

3. Cluster:- This allows us to apply different clustering tools, which identify clusters within
the data file.

4. Association:- This allows us to apply association rules, which identify the association
within the data.

5. Select attributes:- These allow us to see the changes on the inclusion and exclusion
of attributes from the experiment.

6. Visualize:- This allows us to see the possible visualization produced on the data set in
a 2D format, in scatter plot and bar graph output.

5
Pre-processing:
Data pre-processing is a must. There are three ways to inject the data for pre-processing:
• Open File – enables the user to select the file from the local machine
• Open URL – enables the user to select the data file from different locations
• pen Database – enables users to retrieve a data file from a database source

Classification:

To predict nominal or numeric quantities, we have classifiers in Weka. Available


learning schemes are decision- trees and lists, support vector machines, instance-based
classifiers, logistic regression and Bayes’ nets.

Clustering:

The cluster tab enables the user to identify similarities or groups of occurrences within
the data set. Clustering can provide data for the user to analyse.

Association:

The only available scheme for association in Weka is the Apriori algorithm. It identifies
statistical dependencies between clusters of attributes, and only works with discrete data.
The Apriori algorithm computes all the rules having minimum support and exceeding a
given confidence level.

6
Attribute selection:

Attribute selection crawls through all possible combinations of attributes in the data to
decide which of these will best fit the desired calculation which subset of attributes
works best for prediction. The attribute selection method contains two parts.
 Search method: Best-first, forward selection, random, exhaustive, genetic algorithm,
ranking algorithm
 Evaluation method: Correlation-based, wrapper, information gain, chi-squared

Visualization:

The user can see the final piece of the puzzle, derived throughout the process. It allows users
to visualise a 2D representation of data, and is used to determine the difficulty of the learning
problem. We can visualise single attributes (1D) and pairs of attributes (2D), and rotate 3D
visualisations in Weka.

7
EXPERIMENT-2

AIM:
To study about ETL process & its tools.

THEORY:
ETL is a process that extracts the data from different source systems, then transforms the
data (like applying calculations, concatenations, etc.) and finally loads the data into the
Data Warehouse system. Full form of ETL is Extract, Transform and Load. It’s tempting to
think a creating a Data warehouse is simply extracting data from multiple sources and
loading into database of a Data warehouse. This is far from the truth and requires a
complex ETL process.

Need for ETL:


 There are many reasons for adopting ETL in the organization:
 It helps companies to analyze their business data for taking critical business decisions.
 Transactional databases cannot answer complex business questions that can be answered by
ETL.
 A Data Warehouse provides a common data repository
 ETL provides a method of moving the data from various sources into a data warehouse.
 As data sources change, the Data Warehouse will automatically update.
 Well-designed and documented ETL system is almost essential to the success of a Data
Warehouse project.
 Allow verification of data transformation, aggregation and calculations rules.
 ETL process allows sample data comparison between the source and the target system.
 ETL process can perform complex transformations and requires the extra area to store the data.

ETL Process in Data Warehouses:

ETL is a 3-step process. The steps are given below:


1. Extraction:

8
9
In this step, data is extracted from the source system into the staging area.
Transformations if any are done in staging area so that performance of source system in
not degraded. Hence one needs a logical data map before data is extracted and loaded
physically. This data map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification.

2. Transformation:

Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed.In this step, you apply a set of functions
on extracted data. Data that does not require any transformation is called as direct move
or pass through data.Following are Data Integrity Problems:
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.

3. Loading:
Loading data into the target datawarehouse database is the last step of the ETL process.
In a typical Data warehouse, huge volume of data needs to be loaded in a relatively
short period (nights). Hence, load process should be optimized for performance.

Types of Loading:
 Initial Load — populating all the Data Warehouse tables
 Incremental Load — applying ongoing changes as when needed periodically.
 Full Refresh —erasing the contents of one or more tables and reloading with fresh data.

10
Load verification:
 Ensure that the key field data is neither missing nor null.
 Test modeling views based on the target tables.
 Check that combined values and calculated measures.
 Data checks in dimension table as well as history table.
 Check the BI reports on the loaded fact and dimension table.

ETL tools:
There are many Data Warehousing tools are available in the market. Here, are some most
prominent one.

MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier and faster
using an array of enterprise features. It can query different types of data like documents,
relationships, and metadata. https://developer.marklogic.com/products/

Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of Data
Warehouse solutions for both on- premises and in the cloud. It helps to optimize
customer experiences by increasing operational efficiency.
Features:
• Distributes data in the same way across disks to offer uniform performance
• Works for single-instance and real application clusters
• Offers real application testing

Amazon RedShift:-
Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze
all types of data using standard SQL and existing BI tools. It also allows running
complex queries against petabytes of structured data.
Features:-
• No Up-Front Costs for its installation
• It allows automating most of the common administrative tasks to monitor,
manage, and scale your data warehouse
• Possible to change the number or type of nodes.

11
EXPERIMENT-3

AIM:
To create an .arff file.

THEORY:
Attribute Relation File Format (ARFF), this has two parts:
1. The header section defines the relation (data set) name, attribute name and the type.
2. The data section lists the data instances.
An ARFF file requires the declaration of the relation, attribute and data.
 @relation:- This is the first line in any ARFF file, written in the header section,
followed by the relation/data set name. The relation name must be a string and if it
containsspaces, then it should be enclosed between quotes.
@attribute:- These are declared with their names and the type or range in the header
section. Weka supports the following data types for attributes:-
 Numeric
 <nominal-specification>
 String
 date
 @data – Defined in the Data section followed by the list of all data segments

Procedure:
To create an .arff file the steps involved are:
1. Open Notepad or any text editor.
2. Make a relation Employee.
3. Give attributes to it like name, age, gender
4. Give values to the attributes.
5. After everything is done save the file as “filename.arff”, Click OK.
6. Open WEKA and load the .arff file created.
7. Go to classify and click on cross-validation radio button
8. The value of Folds should be equal to your instances & Press Start.
9. Perform same action in all tabs.

12
Code:

Classifier Output:

13
Clustered Output:

14
15
Visualization:

16
17
EXPERIMENT-4

AIM:
Implementation of Classification technique on ARFF files using WEKA.

THEORY:
This experiment illustrates the use of j-48 classifier in Weka. The sample data set used in this
experiment is “student” data available at .arff format. This document assumes that appropriate
data pre-processing has been performed.

Steps involved in this experiment:

1. Step-1: We begin the experiment by loading the data (employee1.arff)into Weka.

2. Step2: Next we select the “classify” tab and click “choose” button to select the “j48”classifier.

3. Step3: Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values. The
default version does perform some pruning but does not perform error pruning.

4. Step4: Under the “text” options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary
to get a reasonable idea of accuracy of generated model.

5. Step-5: We now click ”start” to generate the model. The ASCII version of the tree as well
as evaluation statistic will appear in the right panel when the model construction is
complete.

6. Step-6: Note that the classification accuracy of model is about 78.57%. This indicates that
we may find more work. (Either in preprocessing or in selecting current parameters for the
classification)

7. Step-7: Now Weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the pop-
up menu.

18
8. Step-8: We will use our model to classify the new instances.

9. Step-9: In the main panel under “text” options click the “supplied test set” radio button and
then click the “set” button. This wills pop-up a window which will allow you to open the
file containing test instances.

Code:

19
Classification Window:

Classification Tree:

20
EXPERIMENT-5

AIM:
Implementation of Clustering technique on ARFF files using WEKA.

THEORY:
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.
Steps involved in this Experiment:
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: In order to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 3: In this case we select ‘Simple K-Means’.
Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5: Once of the option have been specified. We run the clustering algorithm there we must
make sure that they are in the ‘cluster mode’ panel. The use of training set option is selected
and then we click ‘start’ button. This process and resulting window are shown in the following
screenshots.
Step 6: The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters and centroids
are means vectors for each cluster. These clusters can be used to characterize the cluster. For
example, the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length
is5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characteristics of each cluster through visualization, we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.
Step 8: From the above visualization, we can understand the distribution of sepal length and
petal length in each cluster. For instance, for each cluster is dominated by petal length. In this
case by changing the color dimension to other attributes we can see their distribution with in
each of the cluster. We can assure that resulting dataset which included each instance along
with its assign cluster. To do so we click the save button in the visualization window and save
the result iris k-mean. The top portion of this file is shown in the following figure.

21
The following screenshot shows the clustering rules that were generated when Simple-K-Means
algorithm is applied on the given dataset.

Visualization:-

22
EXPERIMENT-6

AIM:
Implementation of Association Rule technique on ARFF files using WEKA.

THEORY:
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.
This experiment illustrates some of the basic elements of asscociation rule mining using
WEKA. The sample dataset used for this example is contactlenses.arff

Step 1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.

Step 2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step 3: We will use apriori algorithm. This is the default algorithm.

Step 4: Inorder to change the parameters for the run (example support, confidence etc) we

click on the text box immediately to the right of the choose button.

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

Dataset contactlenses.arff

23
24
EXPERIMENT-7

AIM:
To explore 2 graphs, view their ARFF files and apply an algorithm on them.

THEORY:
We will use weather.arff and weather.nominal.arff files for the following experiment and
use a classification algorithm on them.

Steps for exploring a graph:

1. Open Weka
2. Go to Applications then Explorer
3. In Pre-processor click on ‘open file’
4. Set path to ‘C:\Program Files\Weka-3-8-4\data’ then select the file you want to view the
graph for
5. Click on ‘Visualize All’

Fig: weather.arff file in Weka

25
Output: Graph Visualization of two files:

Fig: Graph for weather.arff Fig: Graph for weather.nominal.arff

Steps for viewing the arff file’s database:

1. Go to Tools and click on Arff Viewer


2. Go to Files and Open the file you want to view from the data folder
3. Choose weather.arff & weather.nominal.arff

Output: Database of two arff files:

Fig: Database View of weather.arff

26
Fig: Database View of weather.nominal.arff

Steps for applying an algorithm to the arff file:

1. Go to Classify
2. Choose OneR-B6 Algorithm and click Start to apply it
3. Right-click on the result list.
4. Click on Visualize Cost Curve (yes)

Output: Classifier Output of two arff files:

27
Classifier Output of weather.arff:

Fig: Cost Curve of weather.arff

28
Classifier Output of weather.nominal.arff:

Fig: Cost Curve of weather.nominal.arff

29
EXPERIMENT-8

AIM:
To use Numeric Transform filter and floor function to obtain the precision up to
same value.
THEORY:
Steps to be followed:

1. Open segment-challenge.arff in the explorer in weka.

30
2. Then select the choose option after opening the file.

3. Perform the following.


Choose->filters->unsupervised->attribute->Numeric transform method

4. Click and fill the index of the column whose values are to be rounded off.

5. Apply these changes to all the columns by selecting all the clicks apply.

6. Click Edit to see the Viewer as shown.

CONCLUSION:
By using Numeric transform filter and floor method, require values have been
obtained.

31

You might also like