JSPM'S Bhivarabai Sawant Institute of Technology & Research: Mini Project Report On

JSPM’S
Bhivarabai Sawant Institute of Technology & Research

Pune-412207
Department Of Computer Engineering
Academic Year 2019-20
Mini Project Report

On
Submitted by:
Charul Joshi(BEA_40)
Kirti Reddy(BEA_39)
Danesh Bastani(BEA_48)
Under the guidance of Prof.

Prof. Nilufar Zaman
Subject : Laboratory Practice II

DEPARTMENT OF COMPUTER ENGINEERING
BHIVARABAI SAWANT INSTITUTE OF TECHNOLOGY & RESEARCH
WAGHOLI, PUNE – 412 207
CERTIFICATE
This is to certify that the Charul Joshi(BEA_40) , Kirti Reddy(BEA_39) and Danesh
Bastani(BEA_48) submitted there Project report on under my guidance and supervision. The work has
been done to my satisfaction during the academic year 2019-2020 under Savitribai Phule Pune University
guidelines.
Date:
Place: BSIOTR, PUNE.
Prof. Nilufar Zaman Dr. Prof. Gayatri Bhandari

Project Guide H.O. D.
ACKNOWLEDGEMENT
This is a great pleasure & immense satisfaction to express our deepest sense of
gratitude & thanks to everyone who has directly or indirectly helped us in
completing my Project work successfully.
we express our gratitude towards guide Prof. Nilufar Zaman and Dr.Prof.
G.M.Bhandari Head of Department of Computer Engineering, Bhivarabai Sawant
Institute Of Technology and Research, Wagholi, Pune who guided & encouraged
us in completing the Project work in scheduled time. we would like to thanks our
Principal, for allowing us to pursue our Project in this institute.
Charul Joshi(BEA_40)
Kirti Reddy(BEA_39)
Danesh Bastani(BEA_48)
INDEX
Sr. No. Chapters (14 points) Page
No
CERTIFICATE PAGE I
ACKNOWLEDGEMENT II
ABSTRACT III
INDEX PAGE IV
LIST OF FIGURES V
1. INTRODUCTION 1
2. OBJECTIVES AND SCOPE 3
3. PROPOSED SYSTEM 4
METHODOLOGY
4. RESULTS AND DISCUSSIONS 11
5. ADVANTAGES AND 19
DISADVANTAGES
6. CONCLUSION 20
7. REFERENCES 21
LIST OF FIGURES
Fig. No. Name of the Figures Page

No.
3.1 WEKA ARCHITECTURE
3.2 Classification steps
ABSTRACT
The concepts of “Artificial Intelligence”, “Deep Learning” and “Machine
Learning” are getting so popular among society. But here lets take a look at
datamining . The fundamental of data mining (DM) is to analyses data from
various points of view. Classify the data and summarize it, DM has begun to
be widespread in every and each application. Although we have huge
magnitude of data, but we do not have helpful information in each field,
there are a lot of DM software and tools to aid us the advantageous
information. One of them is WEKA . Weka is a data mining/machine
learning application and is being developed by Waikato University in New
Zealand. We can use WEKA for essentials of DM steps such as
preprocessing data (remove outlier, replace missing values etc.), attribute
selection, choose just relevant attribute and removing the irrelevant attribute
and redundant attribute, classification and assessment of varied classifier
models.The WEKA software is helpful for a lot of application’s type, and
it can be used in different applications. This tool is consisting of a lot of
algorithms for attribute selection, classification regression and clustering.
Weka provides access to SQL databases using Java Database connectivity
and can process the result returned by a database query. Weka provides
access to learning with Deep learning.
Weka is a collection of machine learning algorithm for data mining tasks.Weka
supports several standard functions as:
 data mining tasks
 data pre-processing
 clustering
 classification
 regression
 visualization
Fisher’s Iris data base (Fisher, 1936) is perhaps the best known database to be
found in the pattern recognition literature. The data set contains 3 classes of 50
instances each, where each class refers to a type of iris plant. One class is linearly
separable from the other two; the latter are not linearly separable from each other.
The data base contains the following attributes:
1). sepal length in cm
2). sepal width in cm
). petal length in cm
4). petal width in cm
On basis of attributes we classify iris plant into different classes
class: -
Iris Setosa
Iris Versicolour
Iris Virginica
First we start from Data preprocessing where we handle the null values in the data
and handle the outliers (we need to manage the data which are not within the
range). The next step is Explanatory data analysis (Cleaning the data) where we
perform visualization step and correlation step between each attribute and output
(always varies between +1 and -1) and we plot the graphs for all the attributes in
order to visualize then we get the important features.
Data preprocessing and transformation of the initial dataset. The process of Data
Preprocessing are described below:
-Data Cleaning:-Fill in missing values, resolve inconsistencies and
smooth noisy data.
-Data integration:-Using multiple databases or files.
-Data Transformation :-aggregation and normalization.
Data reduction:-reducing the volume but predicting similar analytical
results.
CHAPTER 2
OBJECTIVES AND SCOPE
The main objective of WEKA are too:
1.Make Machine Learning(ML) techniques generally available.

2.Apply them to practical problem as in weather analysis.
3.Analyze the dataset well and display results graphically.
4.Generating more clever view about weather analysis.
AREA OR SCOPE OF INVESTIGATION:
This project requires investigations in following areas:

1.Iris flower classification
2.Data mining techniques.
3.Best fit to predict accurate analysis.
The goal is to demonstrate the process of building a neural network based classifier that
solves the classification problem. In this project , neural network based approaches
will be shown, the process of building various neural network architectures will be demonstrated,
and finally classification results will be presented.
CHAPTER 3
PROPOSED SYSTEM METHODOLOGY
Fig 3.1 weka architecture
Data Mining is defined as extracting information from huge sets of data. In other
words, we can say that data mining is the procedure of mining knowledge from
data. Data Mining could be a promising and flourishing frontier in analysis of data
and additionally the result of analysis has many applications. Data Mining can also
be referred as Knowledge Discovery from Data (KDD).This system functions as
the machine-driven or convenient extraction of patterns representing knowledge
implicitly keep or captured in huge databases, data warehouses, the Web, data
repositories, and information streams. Data Mining is a multidisciplinary field,
encompassing areas like information technology, machine learning, statistics,
pattern recognition, data retrieval, neural networks, information based systems,
artificial intelligence and data visualization.
Dataset in ARFF Format
Classification:
Preprocessing the dataset
Select Dataset
Choose Classifier
Turn the Train the

Performance classifier
Evaluate the model for test dataset
Find Performance Criteria
Fig 3.2 classification steps

Cross-validation:
Cross-validation is a technique that is used for the assessment of how the
results of statistical analysis generalize to an independent data set. This
results in a loss of testing and modeling capability. Crossvalidation is also
known as rotation estimation.
Cross validation is an extension of data split. In my understanding, the
purpose of k-fold cross validation is to test how well your model is trained
upon a given data and test it on unseen data. . So, for this purpose we use K-
fold cross validation to make sure that each and every data point comes to
test at-least once.
Discretize:
Data discretization converts a large number of data values into smaller
once, so that data evaluation and data management becomes very easy.
One reason to discretize continuous features is to improve signal-to-noise
ratio. Fitting a model to bins reduces the impact that small fluctuates in the
data has on the model, often small fluctuates are just noise. Each bin
"smooth" out the fluctuates/noises in sections of the data.
Normalize:
n statistics and applications of statistics, normalization can have a
range of meanings. In the simplest cases, normalization of ratings
means adjusting values measured on different scales to a notionally
common scale, often prior to averaging.
J48:
C4.5 (J48) is an algorithm used to generate a decision tree developed by
Ross Quinlan mentioned earlier. C4.5 is an extension of Quinlan's earlier
ID3 algorithm. The decision trees generated by C4.5 can be used for
classification, and for this reason, C4.5 is often referred to as a statistical
classifier.
Naïve Bayes:
Naive Bayes classifiers are a collection of classification algorithms based on
Bayes' Theorem. It is not a single algorithm but a family of algorithms
where all of them share a common principle, i.e. every pair of features being
classified is independent of each other. ... Here is a tabular representation of
our dataset.
LMT:
Logistic model trees are based on the earlier idea of a model tree: a
decision tree that has linear regression models at its leaves to provide a
piecewise linear regression model (where ordinary decision trees with
constants at their leaves would produce a piecewise constant model).In the
logistic variant, this algorithm is used to produce an LR model at every node
in the tree; the node is then split using the C4.5 criterion.
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
%
%
%
CHAPTER 4
RESULT AND DISCUSSIONS
Iris Flower Dataset using

WEKA
Screenshot Captured:
Pre-Processing done using normalize filter

Fig 4.1 normalization
Pre-Processing done using discretize filter:

Fig 4.2 discretization
Pre-Processing done using replace null values filter:

Fig 4.3 replacement of null value
Classifier LMT used for classification with 97% of accuracy:

== Run information ===
Scheme: weka.classifiers.trees.LMT -I -1 -M 15 -W 0.0

Relation: vote-weka.filters.unsupervised.attribute.Normalize-S1.0-T0.0-
weka.filters.unsupervised.attribute.Normalize-S1.0-T0.0
Instances: 435
Attributes: 17
handicapped-infants
water-project-cost-sharing
adoption-of-the-budget-resolution
physician-fee-freeze
el-salvador-aid
religious-groups-in-schools
anti-satellite-test-ban
aid-to-nicaraguan-contras
mx-missile
immigration
synfuels-corporation-cutback
education-spending
superfund-right-to-sue
crime
duty-free-exports
export-administration-act-south-africa
Class
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Logistic model tree

------------------
: LM_1:3/3 (435)
Number of Leaves : 1
Size of the Tree : 1

LM_1:
Class democrat :
0.41 +
[adoption-of-the-budget-resolution=y] * 0.81 +
[physician-fee-freeze=y] * -1.8 +
[synfuels-corporation-cutback=y] * 0.8
Class republican :
-0.41 +
[adoption-of-the-budget-resolution=y] * -0.81 +
[physician-fee-freeze=y] * 1.8 +
[synfuels-corporation-cutback=y] * -0.8
Time taken to build model: 0.66 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.02 seconds
=== Summary ===
Correctly Classified Instances 419 96.3218 %

Incorrectly Classified Instances 16 3.6782 %
Kappa statistic 0.9224
Mean absolute error 0.1066
Root mean squared error 0.1841
Relative absolute error 22.478 %
Root relative squared error 37.8067 %
Total Number of Instances 435
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.970 0.048 0.970 0.970 0.970 0.922 0.986 0.987 democrat
0.952 0.030 0.952 0.952 0.952 0.922 0.986 0.972 republican
Weighted Avg. 0.963 0.041 0.963 0.963 0.963 0.922 0.986 0.981
=== Confusion Matrix ===
a b <-- classified as
259 8 | a = democrat
8 160 | b = republican
Classifier Naïve Bayes used for classification with 96% of
accuracy:
== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Instances: 435
Attributes: 17
handicapped-infants
el-salvador-aid
mx-missile
immigration
education-spending
crime
duty-free-exports
Class
Naive Bayes Classifier
Class
Attribute democrat republican
(0.61) (0.39)
===============================================================
handicapped-infants
n 103.0 135.0
y 157.0 32.0
[total] 260.0 167.0
n 120.0 74.0
y 121.0 76.0
[total] 241.0 150.0
n 30.0 143.0
y 232.0 23.0
[total] 262.0 166.0
n 246.0 3.0
y 15.0 164.0
[total] 261.0 167.0
el-salvador-aid
n 201.0 9.0
y 56.0 158.0
[total] 257.0 167.0
n 136.0 18.0
y 124.0 150.0
[total] 260.0 168.0
n 60.0 124.0
y 201.0 40.0
[total] 261.0 164.0
n 46.0 134.0
y 219.0 25.0
[total] 265.0 159.0
mx-missile
n 61.0 147.0
y 189.0 20.0
[total] 250.0 167.0
immigration
n 140.0 74.0
y 125.0 93.0
[total] 265.0 167.0
n 127.0 139.0
y 130.0 22.0
[total] 257.0 161.0
education-spending
n 214.0 21.0
y 37.0 136.0
[total] 251.0 157.0
n 180.0 23.0
y 74.0 137.0
[total] 254.0 160.0
crime
n 168.0 4.0
y 91.0 159.0
[total] 259.0 163.0
duty-free-exports
n 92.0 143.0
y 161.0 15.0
[total] 253.0 158.0
n 13.0 51.0
y 174.0 97.0
[total] 187.0 148.0
=== Summary ===

0.891 0.077 0.948 0.891 0.919 0.802 0.974 0.984 democrat
0.923 0.109 0.842 0.923 0.881 0.802 0.974 0.960 republican
Weighted Avg. 0.903 0.089 0.907 0.903 0.904 0.802 0.974 0.975
Classifier J48 used for classification with 96% of accuracy:
= Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

Instances: 435
Attributes: 17
handicapped-infants
el-salvador-aid
mx-missile
immigration
education-spending
crime
duty-free-exports
Class
J48 pruned tree

------------------
physician-fee-freeze = n: democrat (253.41/3.75)

physician-fee-freeze = y
| synfuels-corporation-cutback = n: republican (145.71/4.0)
| synfuels-corporation-cutback = y
| | mx-missile = n
| | | adoption-of-the-budget-resolution = n: republican (22.61/3.32)
| | | adoption-of-the-budget-resolution = y
| | | | anti-satellite-test-ban = n: democrat (5.04/0.02)
| | | | anti-satellite-test-ban = y: republican (2.21)
| | mx-missile = y: democrat (6.03/1.03)
Number of Leaves : 6
Size of the tree : 11
=== Summary ===

0.978 0.036 0.978 0.978 0.978 0.942 0.986 0.987 democrat
0.964 0.022 0.964 0.964 0.964 0.942 0.986 0.970 republican
Weighted Avg. 0.972 0.031 0.972 0.972 0.972 0.942 0.986 0.981
Fig 4.6 J48
Cross validation performed on Naïve Bayes:
fig 4.7 cross validation performed on naïve bayes
Cross Validation performed on Logistics:

Fig 4.8 cross validation performed on logistics
Cross Validation performed on Random Forest:
Fig 4.8 cross validation performed on random forest
Overall Analysis Of Classification Done:

Sr no Classifier used Instances correctly Instances incorrectly Overall Accuracy
Classified classified
1. LMT 146 4 97.3%

2 Naïve Bayes 144 6 96%
3. J48 144 6 96%

4. Cross 142 8 94.6%
validation on
Naïve Bayes
5 Cross 144 6 96%

validation on
Logistics
6. Cross validation 143 47 95.3%

on
Random
Forest
Fig 4.9
So ,we have concluded that as LMT Algorithm works out best for our iris
flower dataset analysis giving the accuracy of 97%,hereby is considered to be
suitable enough foe analyzing out given dataset.
CHAPTER 5
ADVANTAGES AND DISADVATAGES
ADVANTAGES:
1.Free available under the GNU General Public License.
2.Portability,Since it is fully implemented in java programming languages
3.Runs on almost any modern computing platform 4.Ease of use due to its graphical user
interface.
DISADVANTAGES:
1.It can only handle small datasets.
2.Blockchain can be a thing to be consider.
3.Using it via command line is a pain without read line
capability of the shell.
CHAPTER 6
CONCLUSION
Finally after all analysis we obtained the result for the corresponding dataset. We analysis that
J48 is the best classification algorithm analyzed, it’s then followed by naive bayes and LMT
with the approximate accuracy nearby to J48. But at some point both Naïve Bayes and LMT
and shows same level of accuracy . ,we have concluded that as LMT Algorithm works out best
for our weather dataset analysis giving the accuracy of 97%,hereby is considered to be suitable
enough foe analyzing out given dataset.
REFRENCES
1. https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/iris.arff
2. https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf
3. https://courses.soe.ucsc.edu/courses/tim245/Spring12/01/pages/attached-
files/attachments/11549

JSPM'S Bhivarabai Sawant Institute of Technology & Research: Mini Project Report On

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JSPM'S Bhivarabai Sawant Institute of Technology & Research: Mini Project Report On

Uploaded by

Copyright:

Available Formats

JSPM’S

Bhivarabai Sawant Institute of Technology & Research

Department Of Computer Engineering

Academic Year 2019-20

Mini Project Report

Under the guidance of Prof.

Subject : Laboratory Practice II

BHIVARABAI SAWANT INSTITUTE OF TECHNOLOGY & RESEARCH

WAGHOLI, PUNE – 412 207

Place: BSIOTR, PUNE.

Prof. Nilufar Zaman Dr. Prof. Gayatri Bhandari

2. OBJECTIVES AND SCOPE 3

4. RESULTS AND DISCUSSIONS 11

Fig. No. Name of the Figures Page

The main objective of WEKA are too:

1.Make Machine Learning(ML) techniques generally available.

AREA OR SCOPE OF INVESTIGATION:

This project requires investigations in following areas:

PROPOSED SYSTEM METHODOLOGY

Fig 3.1 weka architecture

Preprocessing the dataset

Turn the Train the

Evaluate the model for test dataset

Find Performance Criteria

Fig 3.2 classification steps

Iris Flower Dataset using

Pre-Processing done using normalize filter

Pre-Processing done using discretize filter:

Pre-Processing done using replace null values filter:

Classifier LMT used for classification with 97% of accuracy:

Scheme: weka.classifiers.trees.LMT -I -1 -M 15 -W 0.0

=== Classifier model (full training set) ===

Logistic model tree

Size of the Tree : 1

=== Evaluation on training set ===

Time taken to test model on training data: 0.02 seconds

=== Summary ===

Correctly Classified Instances 419 96.3218 %

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

== Run information ===

=== Classifier model (full training set) ===

Naive Bayes Classifier

Time taken to build model: 0.01 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.02 seconds

=== Summary ===

Correctly Classified Instances 393 90.3448 %

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

Classifier J48 used for classification with 96% of accuracy:

= Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

=== Classifier model (full training set) ===

J48 pruned tree

physician-fee-freeze = n: democrat (253.41/3.75)

Size of the tree : 11

Time taken to build model: 0.01 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.02 seconds

=== Summary ===

Correctly Classified Instances 423 97.2414 %