You are on page 1of 29

DATA-WAREHOUSE AND DATA

MINING
(IT405)
Lab File

SUBMITTED BY:
GAURAV PATEL
2K20/EE/104

DELHI TECHNOLOGICAL UNIVERSITY


BAWANA ROAD, DELHI - 110042
INDEX
SNO. AIM DATE REMARKS
1 Introduction about launching the
WEKA Tool.
2 Introduction to WEKA Explorer.

3 Introduction to the classification of


mining techniques.

4 Introduction to Attribute Relation File


Format (ARFF).
5 To perform preprocessing,
classification, and visualization
techniques on customer datasets.
This experiment illustrates the basic
preprocessing, classification, and
visualization techniques using the
WEKA explorer, The sample dataset
used is the customer dataset in arff
format.
6 To perform preprocessing,
classification and visualization
techniques on agriculture dataset.This
experiment illustrates the basic
preprocessing, classification and
visualization techniques using the
WEKA explorer. The sample dataset
used is the agriculture dataset that
comprises eucalyptus plants in arff
format.
7 To perform classification,
preprocessing and visualization
techniques on weather dataset

8 To perform clustering techniques


on weather dataset using k-
means and Cobweb.
Experiment 1

AIM: Introduction about launching the WEKA Tool

THEORY:
Weka is a popular and widely-used open-source data mining and machine learning
software tool. It provides a graphical user interface (GUI) and a collection of machine
learning algorithms for data preprocessing, classification, regression, clustering,
association rule mining, and more. Weka is written in Java and is compatible with
various platforms, including Windows, macOS, and Linux.
Weka is known for its user-friendly interface and extensive documentation. If you're new
to Weka, consider exploring the official Weka documentation and tutorials to get started
with data mining and machine learning using this versatile tool.

Features of Weka:
● Data Preprocessing: Weka offers various tools for preprocessing data, including
filtering, normalization, and feature selection. It helps you clean and prepare your
data before applying machine learning algorithms.
● Machine Learning Algorithms: Weka includes a vast collection of machine
learning algorithms, making it suitable for both beginners and experienced data
scientists. These algorithms cover classification, regression, clustering,
association, and more.
● Data Visualization: Weka provides data visualization tools to help you understand
your data better. You can create scatter plots, histograms, and other
visualizations to explore your datasets.
● Experimentation and Evaluation: Weka enables you to perform experiments by
setting up machine learning workflows, evaluating models, and comparing
different algorithms to choose the best one for your task.

Launching Weka:
To launch Weka, follow these steps:
● Download Weka: First, download the Weka software from the official website
(https://www.cs.waikato.ac.nz/ml/weka/). Choose the version that is compatible
with your operating system.
● Install Weka (if necessary): Depending on your operating system, you may need
to install Weka after downloading it. Follow the installation instructions provided
for your specific platform.
Launch Weka:
● Windows: After installation, you can typically find a shortcut to launch Weka in
your Start menu. Alternatively, you can navigate to the Weka installation directory
and run the "weka.exe" or "Weka.jar" file.
● macOS: On macOS, you can launch Weka by double-clicking the Weka icon in
the Applications folder.
● Linux: On Linux, you can launch Weka by opening a terminal, navigating to the
Weka installation directory, and running the "weka" command.
● Explore the Weka GUI: Once Weka is launched, you'll be presented with the
graphical user interface. You can start loading your datasets, applying machine
learning algorithms, and experimenting with various features and tools provided
by Weka.
Learnings:
We successfully downloaded and launched the Weka tool.
Experiment 2

AIM: Introduction to Weka Explorer

THEORY:
The Weka Explorer component is specifically designed to help users with the process of
exploring, preprocessing, and analyzing datasets for various data mining and machine learning
tasks.

The common tabs that are typically found in the Weka Explorer:
● Preprocess: This tab is typically used for data preprocessing tasks. It includes
options for loading and saving datasets, handling missing values, filtering data,
and transforming attributes. Users can clean and prepare their data for analysis
in this tab.
● Classify: The "Classify" tab is where users can select and configure machine
learning classification algorithms to build predictive models. It provides options to
choose the target class, set algorithm parameters, and perform model evaluation.
● Cluster: This tab is used for clustering tasks, where the goal is to group data
points into clusters based on similarity. Users can select clustering algorithms,
set parameters, and visualize the results.
● Associate: The "Associate" tab is used for association rule mining. It allows users
to find interesting patterns and associations in their data using algorithms like
Apriori. Users can configure the algorithm and view discovered rules.
● Select Attributes: In this tab, users can perform feature selection, which involves
choosing the most relevant attributes or features from the dataset. This can help
improve model performance and reduce complexity.
● Select Test Options: This tab provides options for configuring how the data will be
split into training and testing sets for model evaluation. Users can set parameters
for cross-validation and other evaluation techniques.
● Start: The "Start" tab is where users initiate the execution of the selected task or
operation. After configuring various options in the previous tabs, users can run
the chosen algorithm or process their data.
● Visualize: This tab allows users to visualize the results of their data mining or
machine learning tasks. It provides various graphical representations of data,
including scatterplots, ROC curves, and more.
● Classify (unsupervised): This tab may be used for unsupervised machine
learning tasks, such as clustering, where there is no predefined target class.
Users can select and configure clustering algorithms and visualize the results.
● Attribute selection: Similar to the "Select Attributes" tab, this tab may provide
additional options for feature selection and attribute evaluation.

Learnings:
We successfully discussed various components of Weka Explorer.
Experiment 3

AIM: Introduction to classification of mining techniques

THEORY:
The Weka Explorer component is specifically designed to help users with the process of
exploring, preprocessing, and analyzing. Depending on various methods and
technologies, from the intersection of machine learning, database management, and
statistics, professionals in data mining have devoted their careers to better
understanding how to process and make conclusions from the huge amount of data, but
what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data into different classes.

2. Clustering:
Clustering divides information into groups of connected objects. Describing the data by
a few clusters mainly loses certain confined details but accomplishes improvement. It
models data by its clusters. This technique helps to recognize the differences and
similarities between the data. Clustering is very similar to classification, but it involves
grouping chunks of data together based on their similarities.

3. Regression:
Regression analysis is the data mining process used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of a specific variable. Regression is primarily a form of planning
and modeling.

4. Outlier Detection:
This type of data mining technique relates to the observation of data items in the data
set which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud, etc. It is also known as
Outlier Analysis or Outlier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field.

5. Sequential Patterns:
The sequential pattern is a data mining technique that evaluates sequential data to
discover patterns. It comprises finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different criteria like length,
occurrence frequency, etc. In other words, this technique of data mining helps to
discover or recognize similar patterns in transaction data over some time.

6. Prediction:
The prediction uses other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.

7. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set. Association rules are if-then statements that support
showing the probability of interactions between data items within large data sets in
different types of databases. Association rule mining has several applications and is
commonly used to help sales correlations in data or medical data sets.

LEARNINGS:
We successfully discussed the various techniques of data mining.
Experiment 4

AIM: Introduction to attribute relation file format(ARFF)

THEORY:
The Attribute-Relation File Format (ARFF) is a widely used file format in the field of data
mining and machine learning. ARFF files are commonly associated with the Weka
software, which is a popular toolkit for data mining and machine learning tasks. ARFF
files serve as a standardized way to represent datasets, making it easier to load and
work with data in Weka and other similar tools.

1. Structure of ARFF Files:


ARFF files have a specific structure designed to represent datasets with attributes and
their corresponding data instances. The key components of an ARFF file are:
● Header Section: The header section contains meta-information about the dataset
and the attributes. It typically includes the following:
● The dataset's name and a brief description.
● A list of attributes, along with their names and data types.
● The class attribute, which specifies the target variable or class label (for
classification tasks).
● Any additional comments or information about the dataset.
● Data Section: The data section follows the header section and contains the
actual data instances. Each line in this section represents an individual data
instance, and the values are separated by commas (or other delimiters).
Sample Databases

From above, you can infer the following points -


1. The @relation tag defines the name of the database.
2. The @attribute tag defines the attributes.
3. The @data tag starts the list of data rows, each containing comma- separated fields.
1. The attributes can take nominal values as in the case of the outlook shown
Here @attribute outlook (sunny, overcast, rainy)
2. The attributes can take real values as in this case @attribute temperature real
3. You can also set a Target or a Class variable called to play, as shown here @attribute
play (yes, no)
The Target assumes two nominal values, yes or no.

LEARNINGS:
We successfully discussed the concept of attribute relation file format(ARFF).
Experiment 5

AIM: To perform preprocessing, classification, and visualization techniques on


customer datasets. This experiment illustrates the basic preprocessing.
classification, and visualization techniques using the WEKA explorer. The sample
dataset used is the customer dataset in arff format.

THEORY:
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or
data mining algorithms.
In Weka, there are three ways to inject the data for preprocessing:

1. Open File - enables the user to select the file from the local machine
2. Open URL enables the user to select the data file from different locations
3. open Database - enables users to retrieve a data file from a database
source A screen for selecting a file from the local machine to be
preprocessed is
shown in the picture below.
After loading the data in Explorer, we can refine the data by selecting different
options. We can also select or remove the attributes as per our needs and even
apply filters on data to refine the result.

Classification
To predict nominal or numeric quantities, we have classifiers in Weka. The concept
of classification is basically distributed data among the various classes defined on a
data set. Classification algorithms learn this form of distribution from a given set of
training and then try to classify it correctly when it comes to test data for which the
class is not specified. The values that specify these classes on the dataset are given
a label name and are used to determine the class of data to be given during the test.
Before running any classification algorithm, we need to set test options. Available
test options are listed below: Use training set: Evaluation is based on how well it
can predict the class of the instances it was trained on.
Supplied training set: Evaluation is based on how well it can predict the class of
a set of instances loaded from a file.
Cross-validation: Evaluation is based on cross-validation by using the
number of folds entered in the 'Folds' text field.
Split percentage: Evaluation is based on how well it can predict a certain
percentage of the data, held out for testing by using the values entered in the '%'
field. To classify the data set based on the characteristics of attributes, Weka uses
classifiers.

Visualization
The user can see the final piece of the puzzle derived throughout the process. It
allows users to visualize a 2D representation of data and is used to determine the
difficulty of the learning problem. We can visualize single attributes (ID) and pairs
of attributes (2D), and rotate 3D visualizations in Weka. It has the Jitter option to
deal with nominal attributes and to detect 'hidden' data points.

Dataset
The customer dataset initially has 6 attributes/features and 15 instances. The last
attribute in the dataset is the label which checks based on the relationships among
attributes that a given entity is a customer or not. The attributes are age, salary,
marital status, children, and label.

PROCEDURE:

Data Preprocessing Loading the dataset- The dataset has to be


imported in ARFF format. If it is in CSV format, convert it to arff
format. In Weka, it is done using opening the file location.
Attribute Selection- In a dataset, some attributes are not required and are also
responsible for the skewed nature and problems in classification. After obtaining the
ranks of the attributes, the ones with low ranks are removed, i.e., Age and children,
as they don't contribute to classification.

Removing the feature with lower rank: Features with lower rank
are removed.

Visualization Visualization of attributes/features- All the attributes


are represented separately along with the values. This feature is
present in the visualization tab in the Weka Explorer.
Classification - Perform classification and obtain the results- In the classification
tab of the Weka explorer, we have various options for which algorithm to use for
the classification. Here we have used the J-48 algorithm, which takes the decision
tree approach.

DISCUSSIONS:
The preprocessing is a trivial step and helps analyses whether an attribute helps
classify. In preprocessing, the missing values are removed, the attributes with low
ranks are removed, and the dataset is robust. The visualization of the attributes is
done in two ways, i.e., separately and as well as among themselves. Finally, the
classification can be done in many ways, but we have a J-48 approach that involves
decision trees. The label of the dataset signifies whether a person is a customer or
not.

LEARNINGS:
We have successfully learnt the preprocessing, classification and visualization
techniques by the WEKA explorer
EXPERIMENT -
AlM
To perform preprocessing, classification and visualization techniques on agriculture
dataset. This experiment illustrates the basic preprocessing, classification and
visualization techniques using the WEKA explorer. The sample dataset used is the
agriculture dataset that comprises eucalyptus plants in arff format.
Dataset
The dataset has the details of the Eucalyptus plant. Comprising 20 attributes like
altitude, longitude, latitude, altitude etc., it classifies whether there is any
associated utility of this plant. There is a total of 736 instances in the dataset.
PROCEDURE

Preprocessing-

Loading the dataset- The dataset has to be imported in ARFF format, if it is


CSV format, convert it to arff format. In Weka, it is done using opening the file
location.

Removing the missing values - There were certain null values in the
dataset which were removed as they do not contribute in any way
towards the dataset.
Attribute Selection - In a dataset, some attributes are not required and are also
responsible for the skewed nature and problems in classification. After obtaining the
ranks of the attributes, the ones with low ranks are removed, i.e., Frosts, altitude
and rep, as they don't contribute to classification Removing the feature with low
ranks - After the ranks are obtained, the features/attributes with low ranks are
removed.

Nominal to Binary transformation of features - Some


attributes have values in the form of Yes/No, True/false. Still, it is
better to have binary values in the form of 0/1 as the input in the
model becomes uniform.
Visualization

Visualization of attributes/features- All the attributes are


represented separately along with the values. This feature is present
in the visualization tab in the Weka Explorer.

Visualization of the relation among attributes - The relation


among attributes helps in finding the connections among various
attributes and how one attribute contributes towards the others.
Classification

Perform classification and obtain the results- In the classification tab of the
Weka explorer, we have various options for which algorithm to use for the
classification. Here we have used the )-48 algorithm, which takes the decision
tree approach.
DISCUSSIONS:
The preprocessing is a trivial step and helps in the analysis that whether an
attribute helps in the classification or not. In preprocessing, the missing values
are removed, the attributes with low ranks are removed, and the dataset is
made, robust. The visualization of the attributes is done, in two ways i.e.,
separately and as well as among themselves. Finally, the classification can be
done in many ways a but we have J-48 approach that involves decision trees.
The label of the dataset signifies whether there, is any utility associated with the
eucalyptus plant.
LEARNINGS:
We have successfully learnt the preprocessing, classification and
visualization Techniques by the WEKA Explore
EXPERIMENT-7

AIM: To perform classification, pre-processing and visualization techniques on


weather dataset THEORY:
Classification is a data science task of predicting the value of a categorical variable
(target or class) by building a model based on one or more numerical and/o
categorical variables (predictors attributes). Visualization is the process of
representing data graphically and interacting with these representations t gain
insight into the data. Data Pre-processing refers to the steps applied to make data
more suitable for mining. The steps used for Data Pre-processing usually involve
selecting data objects and attributes for the analysis or creating/changing the
attributes.
This experiment illustrates how pre-processing, classification and visualization of
the result are performed on weather datasets through weka using ZeroR and J-
48 Decision Tree Algorithms

PROCEDURE:
Step 1: We begin the experiment by loading the data (employee. arff) into
WEKA.
Step2: Once the data is loaded, WEKA will recognize the attributes, and during
the scan of the data, WEKA will compute some basic strategies for each
attribute. The left panel in the above figure shows the list of 'recognized
attributes. In contrast, the top panel indicates the names of the base relation or
table and the current working relation (which are the same initially).
Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes, and the frequency of each attribute value
is shown. In contrast, for continuous attributes, we can obtain min max, mean,
standard deviation and deviation etc. For pre-processing, we can remove
redundant or irrelevant attributes and apply filters to them.
Step3: Next, we select the "classify" tab and click the "choose" button to
select the 48' classifier.
Step4: Now, we specify the various parameters. These can be specified by
clicking in the text box to the right of the chosen button. In this example.
we accept the default values the default version does perform some
pruning but does not perform error pruning
Step5: Under the "text "options in the main panel. We select the 10-fo1d cross-
validation as our evaluation approach. Since we don't have separate evaluation
data set, this is necessary to get a reasonable idea of the accuracy of the
generated model.
Step-6: We now click "start" to generate the model. The ASCIl version of the tree,
as well as the evaluation statistic, will appear in the right panel when the model
construction is complete
Step-7: Note that the classification accuracy of the model is about 69%. This
indicates that we may find more work. (Either in preprocessing or in selecting
current parameters for the classification)
Step-8: Now WEKA also lets us view a graphical version of the classification
tree. This can be done by right-clicking the last result set and selecting
"visualize tree" from the pop-up menu.
Step-9: We will use our model to classify the new instances
Step-10: In the main panel, under "text "options, click the "supplied test set"
radio button and then click the "set" button. This will pop-up a window allowing
you to open the file containing test instances Dataset employee,
arff:
LEARNINGS:
We successfully performed preprocessing, classification and visualization on the
weather dataset.
EXPERIMENT-8
AIM: To perform clustering techniques on weather dataset using k-means and
Coweb.
THEORY:
Clustering to is Keep the in method Mind A of set of dividing data a set objects of
can abstract be objects viewed into as a groups. single entity. When performing
cluster analysis, we divide the data set into groups based on data similarity, then
assign labels to the groups. Some clustering techniques produce a hierarchical
clustering -a tree of clusters Clusters at one level break up into child subclusters,
and SO on. Leaves contain indivisible clusters consisting of one or:more instances.
Cobweb is a hierarchical clustering method based on a measure of clustering
quality called "category utility". The category utility is denoted CU (clusters). Higher
values of CU indicate better clustering's.
This experiment illustrates how clustering algorithms like k-means and
Coweb are performed on weather dataset using Weka. The dataset
contains 14 instances
PROCEDURE:

Step 1: Run the WEKA explorer and load the data file weather.arff in
preprocessing interface.

Step 2: In order to perform clustering, select the 'cluster tab in the explorer
and click on the choose button. This step results in a dropdown list of
available clustering algorithms

Step 3: In this case we select 'simple k-means'


Step 4 : next click on in text button to the right of the choose button to get
popup window shown in the screenshot . in this window we enter six on the
number of cluster and we leave the value of the on as it is . the seed value
is used in generating a random number which is used for making the
internal assignment of instances of cluster .
Step 5: Once of the option have been specified. We run the clustering
algorithm there we must make sure that they are in the 'cluster mode' panel.
The use of training set option is selected and then we click 'start button.
This process and resulting window are shown in the following screenshots.

Step 6: The result window shows the centroid of each cluster as well as
statistics on the number and the percent of instances assigned to different
clusters. Here clusters centroid is means vectors for each cluster. These
clusters can be used to characterize the cluster

Step 7: Another way of understanding characteristics of each cluster


through visualization, we can do this, try right clicking the result set on the
result. List panel and selecting the visualize cluster assignments
Step 8: We can assure that resulting dataset which included each instance
along with its asign cluster. To do so we click the save button in the
visualization window and save the result student k-mean. The top portion of
this file is shown in the following figure.
Step 9: Next we do the same clustering Using Coweb and visualize the
clustering using tree.
OUTPUT:

K-
MEANS:
Cobweb
LEARNINGS:
We successfully performed clustering on weather dataset using k-means and
Coweb clustering.

You might also like