You are on page 1of 32

DWDM 191290116048

Practical 1
Aim: Case study on different data mining tools.
 What is Data Mining
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends and
make more-informed business decisions.
Data mining is a crucial component of successful analytics initiatives in
organizations. The information it generates can be used in business
intelligence (BI) and advanced analytics applications that involve analysis of
historical data, as well as real-time analytics applications that examine streaming
data as it's created or collected.
 Data Mining Tools
Data Melt Data Mining
Orange Data Mining
Oracle Data Mining
SAS Data Mining
RapidMiner
Teradata
KNIME
Rattle
Weka
H20

Gyanmanjari institute of technology 1


DWDM 191290116048

Orange Data Mining

Orange is a perfect machine learning and data mining software suite. It supports
the visualization and is a software-based on components written in Python
computing language and developed at the bioinformatics laboratory at the faculty
of computer and information science, Ljubljana University, Slovenia.
As it is a software-based on components, the components of Orange are called
"widgets." These widgets range from pre-processing and data visualization to the
assessment of algorithms and predictive modelling.
 Features
Data comes to orange is formatted quickly to the desired pattern, and moving the
widgets can be easily transferred where needed.
Orange allows its users to make smarter decisions in a short time by rapidly
comparing and analysing the data.
It is a good open-source data visualization as well as evaluation that concerns
beginners and professionals.
Data mining can be performed via visual programming or Python scripting.
 User Interface

Gyanmanjari institute of technology 2


DWDM 191290116048

RapidMiner
RapidMiner is a free to use Data mining tool. It is used for data prep, machine
learning, and model deployment. This free data mining software offers a range of
products to build new data mining processes and predictive setup analysis.
 Features
Allow multiple data management methods.
GUI or batch processing.
Integrates with in-house databases.
Interactive, shareable dashboards.
Big Data predictive analytics.
Remote analysis processing.
Data filtering, joining, merging, and aggregating.
Build, train and validate predictive models.
Reports and triggered notifications.
 User Interface

Gyanmanjari institute of technology 3


DWDM 191290116048

Teradata
Teradata is a massively parallel open processing system for developing large-
scale data warehousing applications. Teradata can run on Unix/Linux/Windows
server platform.
 Features
Teradata Optimizer can handle up to 64 joins in a query.
Tera data has a low total cost of ownership. It is easy to set up, maintain, and
administrate.
It supports SQL to interact with the data stored in tables. It provides its extension.
It helps you to distribute the data to the disks automatically with no manual
intervention.
Teradata provides load & unload utilities to move data into/from Teradata
System.
 User Interface

Gyanmanjari institute of technology 4


DWDM 191290116048

KNIME
KNIME is opensource software for creating data science applications and
services. It is one of the best tools for data mining that helps you to understand
data and to design data science workflows.
 Features
Helps you to build an end-to-end data science workflow.
Blend data from any source.
Allows you to aggregate, sort, filter, and join data either on your local machine,
in-database or in distributed big data environments.
Build machine learning models for classification, regression, dimension
reduction.
 User Interface

Gyanmanjari institute of technology 5


DWDM 191290116048

H20
H2O is another excellent opensource software Data mining tool. It is used to
perform data analysis on the data held in cloud computing application systems.
 Features
H2O allows you to take advantage of the computing power of distributed systems
and in-memory computing.
It allows fast and easy deployment into production with Java and binary format.
It helps you to use the programming languages like R,
Python and others to build a model in H2O.
Distributed, In-memory Processing.
 User Interface

Signature:

Date:

Gyanmanjari institute of technology 6


DWDM 191290116048

Practical 2
Aim: Analysis of mining techniques using Weka Tool.

 Weka Tool
Weka: Waikato Environment for Knowledge Analysis

Gyanmanjari institute of technology 7


DWDM 191290116048

 Start Weka
Start Weka. This may involve finding it in program launcher or double clicking
on the weka.jar file. This will start the Weka GUI Chooser.
The Weka GUI Chooser lets you choose one of the Explorer, Experimenter,
Knowledge Explorer and the Simple CLI (command line interface).
Click the “Explorer” button to launch the Weka Explorer.
This GUI lets you load datasets and run classification algorithms. It also provides
other features, like data filtering, clustering, association rule extraction, and
visualization, but we won’t be using these features right now.
 Open the data/iris. Arff Dataset
Click the “Open file…” button to open a data set and double click on the “data”
directory.
Weka provides a number of small common machine learning datasets that you
can use to practice on.
Select the “iris. arff” file to load the Iris dataset.
The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed
by researchers in machine learning. It contains 150 instances (rows) and 4
attributes (columns) and a class attribute for the species of iris flower.
 Select and Run an Algorithm
Now that you have loaded a dataset, it’s time to choose a machine learning
algorithm to model the problem and make predictions.
Click the “Classify” tab. This is the area for running algorithms against a loaded
dataset in Weka.
You will note that the “Zero” algorithm is selected by default.
Click the “Start” button to run this algorithm.
The Zero algorithm selects the majority class in the dataset (all three species of
iris are equally present in the data, so it picks the first one: setosa) and uses that
to make all predictions. This is the baseline for the dataset and the measure by
which all algorithms can be compared. The result is 33%, as expected (3 classes,
each equally represented, assigning one of the three to each prediction results in
33% classification accuracy).

Gyanmanjari institute of technology 8


DWDM 191290116048

You will also note that the test options select Cross Validation by default with 10
folds. This means that the dataset is split into 10 parts: the first 9 are used to train
the algorithm, and the 10th is used to assess the algorithm. This process is
repeated, allowing each of the 10 parts of the split dataset a chance to be the held-
out test set.
The Zero algorithm is important, but boring.
Click the “Choose” button in the “Classifier” section and click on “trees” and
click on the “J48” algorithm.
This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8,
hence the J48 name) and is a minor extension to the famous C4.5 algorithm.
Click the “Start” button to run the algorithm.
 Review Results
After running the J48 algorithm, you can note the results in the “Classifier output”
section.
The algorithm was run with 10-fold cross-validation: this means it was given an
opportunity to make a prediction for each instance of the dataset (with different
training folds) and the presented result is a summary of those predictions.
Firstly, note the Classification Accuracy. You can see that the model achieved a
result of 144/150 correct or 96%, which seems a lot better than the baseline of
33%.
Secondly, look at the Confusion Matrix. You can see a table of actual classes
compared to predicted classes and you can see that there was 1 error where an
Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was
classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified
as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy
achieved by the algorithm.

Signature:

Date:

Gyanmanjari institute of technology 9


DWDM 191290116048

Practical 3
Aim: Demonstration of classification rule process on dataset using J48
algorithm.
 Step 1: Select Database Student.

Gyanmanjari institute of technology 10


DWDM 191290116048

 Step 2: Select ARFF file.

 Step 3: Select J48 Algorithm from Trees Classifier.

Gyanmanjari institute of technology 11


DWDM 191290116048

 Step 4: Show Summary of Dataset Using J48 Algorithm.

 Step 5: Show Tree of Dataset.

Signature:

Date:

Gyanmanjari institute of technology 12


DWDM 191290116048

Practical 4
Aim: Demonstration of classification rule process on dataset using ID3
algorithm.
 Step 1: Select Database Employee.

Gyanmanjari institute of technology 13


DWDM 191290116048

 Step 2: Select ARFF File.

 Step 3: Select ID3 Algorithm from Trees Classifier.

Gyanmanjari institute of technology 14


DWDM 191290116048

 Step 4: Show Summary of Dataset Using ID3 Algorithm.

Signature:

Date:

Gyanmanjari institute of technology 15


DWDM 191290116048

Practical 5
Aim: Demonstration of classification rule process on dataset using Naive
Bayes algorithm.
 Step 1: Select Database Student.

Gyanmanjari institute of technology 16


DWDM 191290116048

 Step 2: Select ARFF File.

 Step 3: Select Naive Bayes Algorithm from Classifier.

Gyanmanjari institute of technology 17


DWDM 191290116048

 Step 4: Show Summary of Dataset Using Naive Bayes Algorithm.

Signature:

Date:

Gyanmanjari institute of technology 18


DWDM 191290116048

Practical 6
Aim: Demonstration of clustering rule process on dataset iris using simple
k-means.
 Step 1: Select Database IRIS.

Gyanmanjari institute of technology 19


DWDM 191290116048

 Step 2: Select ARFF File.

 Step 3: Show Attributes of Current Relation IRIS.

Gyanmanjari institute of technology 20


DWDM 191290116048

 Step 4: Select Simple K Means Cluster from Clusterers.

 Step 5: Show Cluster Output of Dataset IRIS using Simple K Means


Cluster.

Signature:

Date:

Gyanmanjari institute of technology 21


DWDM 191290116048

Practical 7
Aim: Demonstration of clustering rule process on dataset student using
simple k-means.
 Step 1 : Select Database Student.

Gyanmanjari institute of technology 22


DWDM 191290116048

 Step 2: Select ARFF File.

 Step 3: Show Attributes of Current Relation Student.

Gyanmanjari institute of technology 23


DWDM 191290116048

 Step 4: Select Simple K Means Cluster from Clusterers.

 Step 5: Show Cluster Output of Dataset Student using Simple K


Means Cluster.

Signature:

Date:

Gyanmanjari institute of technology 24


DWDM 191290116048

Practical 8
Aim: Demonstration of Association rule process on dataset supermarket
using Apriori.
 Step 1 : Select Database Supermarket.

Gyanmanjari institute of technology 25


DWDM 191290116048

 Step 2: Select ARFF File.

 Step 3: Show Attributes of Current Relation Supermarket.

Gyanmanjari institute of technology 26


DWDM 191290116048

 Step 4: Select Apriori Associator from Associations.

 Step 5: Best rules found From Supermarket Dataset using Apriori


Associator.

Signature:

Date:

Gyanmanjari institute of technology 27


DWDM 191290116048

Practical 9
Aim: Demonstrate how we can insert particular algorithm in Weka by
external package.
 Step 1 : Select Package manager From Tools.

 Step 2: Select Package from Package manager to Insert External


Package.

Gyanmanjari institute of technology 28


DWDM 191290116048

 Step 3: After Select Package from Package manager Install External


Package.

Signature:

Date:

Gyanmanjari institute of technology 29


DWDM 191290116048

Practical 10
Aim: Study and Analyze DTREG Data Mining Tool.
 DTREG Data Mining Tool.
DTREG is a robust application that is installed easily on any Windows system.
DTREG reads Comma Separated Value (CSV) data files that are easily created
from almost any data source.
Once you create your data file, just feed it into DTREG, and let DTREG do all of
the work of creating a decision tree, Support Vector Machine, K-Means
clustering, Linear Discriminant Function, Linear Regression or Logistic
Regression model. Even complex analyses can be set up in minutes.

 Features.
Data Import: DTREG can import data from various sources such as CSV, Excel,
SQL, ODBC, and Oracle. It also supports importing data from SAS datasets.
Data Visualization: DTREG provides a range of visualization tools such as scatter
plots, histograms, box plots, and line charts, to help users understand the
distribution and relationships among variables in their datasets.
Feature Selection: DTREG offers multiple feature selection methods such as
correlation-based feature selection, backward feature elimination, and forward
feature selection, which helps users to select the most relevant variables for
building predictive models.
Model Building: DTREG supports various algorithms for model building,
including decision trees, regression analysis, neural networks, and support vector
machines (SVMs). Users can choose the algorithm that best suits their data and
research question.
Model Evaluation: DTREG provides several evaluation metrics such as root
mean square error (RMSE), mean absolute error (MAE), and coefficient of
determination (R-squared) to help users assess the accuracy of their predictive
models.
Model Deployment: DTREG allows users to export their predictive models as
C++ or Java code, which can be integrated into other software applications.

Gyanmanjari institute of technology 30


DWDM 191290116048

 Step 1: Select Zoo DTREG(.dtr) Dataset .

 Step 2:Show the Zoo Dataset Variables.

 Step 3: Show the Tree of Zoo Dataset.

Gyanmanjari institute of technology 31


DWDM 191290116048

 Step 4: Show the Model Size and Error Rate of Zoo Dataset.

Signature:

Date:

Gyanmanjari institute of technology 32

You might also like