This action might not be possible to undo. Are you sure you want to continue?

# MSC INTERNET SYSTEMS ENGINEERING

COMPARITIVE STUDIES OF DATA MINING TECHINIQUES BY MAREPALLY ARUNKUMAR REDDY 0853756 SUBMITTED TO UNIVERSITY OF EAST LONDON

0853756

1

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Table of contents

1. Abstract 2. Introduction

2.1 Introduction to WEKA 2.2 Data Mining 2.3 Implementation of Algorithms 2.3.1 Dataset Introduction: Habermans Survival Data 2.3.2 Data Cleaning 2.3.3 Dataset Transformation 2.3.4 Attribute Description 2.3.5 Class Description 2.3.6 Input Encoding/Input Representation 2.3.7 Input Type 2.4 Implementing the Algorithms 2.4.1 Reasons why WEKA is chosen 2.4.1.1 Implementation of CRUISE 2.4.2 Implementing Algorithms in WEKA 2.4.3 In-depth Analysis of Results of three Algorithms J48, Multilayer Perceptron and Naïve Bayes 2.4.3.1 Evaluation of J48: 2.4.3.2 Calculation of Accuracy & Confusion Matrix in J48 2.4.3.3 Evaluation of Multi Layer Perceptron 2.4.3.4 Evaluation of Naïve Bayes 2.5 Overall Accuracy rate of the algorithms implemented 2.6 Difficulties faced while using WEKA

3. Research analysis

3.1 Introduction 3.2 Decision Tree 3.3 Neural Networks 3.4 Comparison of J48 Decision Tree and Neural Networks-Multilayer Perceptron 3.5 Results based on J48 and Multilayer Perceptron 3.6 Conclusion

0853756

2

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

**Comparative Studies of Data Mining Techniques 1. Abstract:
**

This paper here discusses about implementing Data Mining techniques using WEKA. This paper is divided into two parts the first part gives the information about how the implementation of dataset is done using trees like J48 (Decision tree), Naïve Bayes and Neural Networks (Multi layer Perceptron). The comparison is done on the basis of accuracy rate, correctly classified instances and results before and after pruning by implementing the three algorithms. Visualization of trees, step by step screen dumps of algorithms and graphs are involved in the first part. Second part involves comparison and evaluation of J48 classifier; Naïve Bayes classifier and implementation Multi layer Perceptron of Neural Networks. This paper also gives a brief description on how the accuracy rate is calculated in the algorithms.

2. Introduction:

The main objective of this paper is to look at various tools and techniques in data mining based on classification of testing dataset which gives us results based on accuracy, correctly classified instances, incorrectly classified instances and misclassified instances of the particular dataset.

2.1 Introduction to WEKA:

A software tool used to develop various data mining techniques. It is developed by University of Waikato, New Zealand. I have used this software tool to test the dataset based on three different classifications like J48 decision tree, implementing multi layer perceptron of Neural Networks and probability based algorithm like Naïve Bayes.

**2.2 Data Mining: is defined as acquiring information or
**

knowledge from huge amount of data. In the data mining area various algorithms which can be referred as intelligent algorithms are applied to the data in order to get different patterns. Keywords: Data mining, Multilayer Perceptron (Neural Networks), J48 (Decision Trees)

0853756

3

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

2.3 Implementation of Algorithms:

In this section we will come across how the Decision Trees, Naïve Bayes algorithm and Neural Networks are being implemented to the dataset and are compared to each of the algorithm based on its results.

**2.3.1 Dataset Introduction: Habermans Survival Data
**

The dataset Habermans survival data is been chosen from the archives of UCI – Machine Learning Repository KDD [2]. The dataset is the multivariate data type and its default task is classification which consists of 306 instances and 3 attributes and one class attribute. All the instances, attributes and class attributes are represented in CSV (comma separated values) format.

2.3.2 Dataset Cleaning:

According to the UCI Archives there are some datasets which posses missing values and inconsistent values but the dataset Habermans Survival Data has no missing values. If there are missing values they are represented as ‘?’ in the data.

2.3.3 Dataset Transformation:

The dataset which is downloaded is then stored in a word document. The following is the format to save the data. @Relation ‘Dataset name’ @Attribute ‘------‘ @Attribute ‘------‘ @data ‘Downloaded data’ After following this then the file is saved as ARFF (.arff - Attribute Relation File Format) format in order to use it in WEKA and is then opened with WEKA software in order to get the patterns.

**2.3.4 Attribute Description: Attribute
**

0853756

Description

4

Type

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Patient age Operation year Number of nodes

Patients age at the time of operation. Patient’s year of operation. (19581990) Number of positive auxiliary nodes detected.

Continuous Discrete Discrete

**2.3.5 Class Description: Class Survival status Description
**

Survival status of the patient.

**2.3.6 Input Encoding/ Input Representation:
**

The attributes in the dataset consists of both Discrete and Continuous data types. A simple definition of these data types are, if the data is changing for example age and experience then the data is Continuous. If the data remains unchanged such as gender, name, id etc then the data is Discrete. Here in the data set which was chosen has a continuous data attribute which can cause some defects in accuracy and performance of the dataset. Continuous data can be converted into discrete data by following supervised discrete procedure in WEKA.

2.3.7 Input Type:

The input file which we open in WEKA should be in .arff file format. ARFF – Attribute Relation File Format.

0853756

5

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Fig: ARFF viewer of the dataset.

**2.4 Implementing the Algorithms: 2.4.1 Reasons why WEKA is chosen:
**

In this coursework I have implemented the algorithms to a particular dataset using software called WEKA. The main reason for choosing this software is its user friendly access to the beginners. Various pre-processed classification, clustering and association algorithms are being implemented by giving an input file which must be .arff format. The reason why the file should be in ARFF format is mostly various data mining tools operate with CSV but in WEKA the file has to be modified by altering the description of the dataset. Data format for WEKA: Input file @RELATION HabemansSurvivalData @ATTRIBUTE @ATTRIBUTE @ATTRIBUTE @ATTRIBUTE @DATA 0853756 6 DATA MINING PatientAge NUMERIC operationYear NUMERIC NumberOfNodes NUMERIC class {1, 2}

MSC INTERNET SYSTEMS ENGINEERING 30, 64, 1, 1 30, 62, 3, 1 And so on One more tool which I have compared with WEKA is CRUISE ()

2.4.1.1 Implementation of CRUISE:

CRUISE is one of the data mining software mainly used for developing decision tree classifications. In this tool the input is in two files where as in WEKA the input is just one file. Here in CRUISE, we give the input in two ways one id the description file and another is the data file which can be seen in the Fig 2below. Data format for CRUISE: 1st input file – Descriptive file Habermans.txt ? (If there are any missing values?) Column, varname, vartype 1, PatientAge, n 2, operationYear, n 3, NumberOfNodes, n 4, class, d 2nd input file – Data file 30, 64, 1, 1 30, 62, 3, 1 And so on.

0853756

7

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Fig 2: CRUISE input format.

0853756

8

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

**2.4.2 Implementing algorithms in WEKA:
**

The dataset which is saved in .arff format is then retrieved through WEKA tool for implementing the classification algorithms. 1. After opening the file if there are too many continuous data types then the dataset has to follow supervised discrete procedure under filters.

Fig: Discretize procedure for the dataset. 2. Then after choosing the filter click on the classify tab to choose the algorithm. By clicking on the ‘choose’ button under classifier we can choose j48 under trees and click start.

0853756

9

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Fig: choosing J48 Algorithm 3. In this coursework we come across three algorithms J48 – Decision Tree, Naïve Bayes and Multi Layer Perceptron – Neural Networks.

0853756

10

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING 4. After implementing the J48 algorithm a multi layer perceptron is implemented. The Multi Layer Perceptron is implemented under functions.

Fig: Implementation of Multi layer Perceptron. 5. After Implementing Multi layer Perceptron another algorithm Naïve Bayes is introduced to the dataset.

0853756

11

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Fig: Implementation of Naïve Bayes algorithm

**2.4.3 In depth analysis of Results of three algorithms J48, Multi Layer Perceptron and Naïve Bayes:
**

Here after implementing the algorithms WEKA produces output based on accuracy and performance of the dataset. The evaluation of these algorithm results are based on correctly classified instances, incorrectly classified instances and misclassified instances (if applicable).

2.4.3.1 Evaluation of J48:

Here in J48 algorithm the results are compared on the basis of ‘Test Options’ through which the dataset can be tested using J48 decision tree. Each results of accuracy differ from each ‘Test Option’ used. The evaluation is also done on the basis of raw and discrete data.

Data type Raw Data

J48Binary split=Fal se 71.8954 %

J48Binary split=Tr ue 71.8954 %

0853756

12

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING Discret 72.2222 72.2222 e Data % % Table: Accuracy Rate in J48. Algorithm J48 Test Option Use Training Set J48 Cross Validation (10 Folds) J48 Percentage Split (66%) Table: Results based on Results 73.5294% 72.2222% 76.9231% Test options

Algorithm

Confusion Matrix J48 225 0 81 0 J48 210 1 5 70 11 J48 Percentag 71 1 e Split 2 (66%) 12 9 Table: Confusion Matrix based on Test Options

Test Options Use Training Set Cross Validation (10 Folds)

**2.4.3.2 Calculation of Accuracy and Confusion Matrix in J48:
**

Total number of Instances: 306 Accuracy rate = number of correctly classified instance/ number of instances Accuracy rate of J48 (test option- use training set) = 225/306 = 0.735294= 73.5294% (as shown in table) Similarly accuracy rate of J48 – Cross Validation is done. But when we use Percentage Split (66%) the Accuracy rate is calculated in the following way: As it says 66% the remaining 34% of total number of instances (306) is = 104 So here the total number of instances used is 104 and Accuracy rate = 80/104= 0.769231= 76.9231% (as shown in table)

0853756

13

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Confusion matrix of J48 (Cross Validation) is solved in the following way: a b 210 70 A 210+ 70+ 15 11 B 15 11 Result 225 81 306(total no of instances)

**2.4.3.3 Evaluation of Multi Layer Perceptron:
**

Here in this Multi Layer Perceptron algorithm the data is evaluated on both Raw and Discrete data and the results are compared on the basis of accuracy and time taken to build the model. Data type Multi Layer Perceptro n 72.8758% Time taken to build

Raw Data

0.47 seconds Discrete 73.5294% 0.42 Data seconds Table: Accuracy of Multi Layer Perceptron. After calculating and comparing the accuracy of the data then errors are classified. The accuracy rate is more to discrete data when compared to raw data because in supervised discretization class label is known and also there is improvement in time taken to build the model.

0853756

14

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

Error classifier gives the graph which shows the wrongly classified instances. By clicking on the instance in the graph it displays “instance info” which shows “wrongly classified instance number” and what could be the prediction class. By changing the prediction class in the data there could be a certain amount of increasing difference in accuracy.

**2.4.3.4 Evaluation of Naïve Bayes:
**

The Naïve Bayes algorithm is implemented by taking both the raw and discrete data. The raw data is changed to discrete in supervised filter method. Data Type Raw Supervised Discretization=Fa lse 74.8366% Supervised Discretization=T rue 72.8758%

Discret 72.8758% 72.8758% e Table: accuracy rate of Naïve Bayes. Here in Naïve Bayes algorithm there is decrease in accuracy rate when the data is discretized in supervised filter. But when the data is changed to discrete in Unsupervised filter there is increase in accuracy rate (76.7974%).

0853756

15

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING

**2.5 Overall Accuracy rate of the algorithms implemented:
**

The table here shows the accuracy rates of the algorithms which are chosen to evaluate the dataset. data j48 bs=f multi layer naïve j48 bs=t naïve type perceptron bayes bayes Disc=F disc=t Raw 71.8954 72.8758% 74.8366 71.8954 72.8758 % % % % Discret 72.2222 73.5294% 72.8758 72.2222 72.8758 e % % % % Time 0 sec 0.47sec 0sec 0.02sec 0sec taken to build for Raw data Time 0 sec 0.42sec 0sec 0.02sec 0sec taken to build discret e data Table: percentage of instances classified correctly. From the above table the accuracy rate of Naïve Bayes is more when the supervised discretization is set to false for raw data. And when the raw data is discretized the highest accuracy rate is shown by Multi layer perceptron. Here in this comparison of accuracy rates with the algorithms there is not much difference in the percentage. There is small amount of increase only when the raw data is discretized.

**2.6 Difficulties faced while using WEKA:
**

• • • Initially saving the downloaded file into a text document and saving it to .arff format had some errors. It was reduced by saving the file in an already saved .arff file. WEKA is easy to use when compared to CRUISE because WEKA has only one input file where as CRUISE have two inputs (data file, description file).

3. Research analysis:

In this section detailed analysis between the two data mining techniques is done based on results obtained.

3.1 Introduction:

0853756 16 DATA MINING

MSC INTERNET SYSTEMS ENGINEERING As my chosen dataset deals with Medical and Health care industry which is growing undoubtedly day by day with the patients electronic health records. In terms of Medical and health care industry data mining can be defined as a research every minute about new diseases and its cure. Here research is nothing but to acquire, retrieve and gain knowledge from small to large amount of available data. Therefore to get the patterns from the data I have used two techniques Decision Trees and Neural Networks which are one of the techniques in data mining and also comparison and evaluation of J48 and Multi Layer Perceptron algorithm based on Neural Networks.

3.2 Decision Tree:

According to my view from my past learning in Unified Modelling Language (UML) a decision tree can be defined as a visual representation of the data which makes human understandable. A decision tree has a root, nodes, sub nodes. In order to get the patterns from huge amount of available data various approaches are being used and Decision tree is one among the methods which is adopted in areas such as science and technology, knowledge discovery databases and data mining. [1] Decision tree calculates and predicts the input which is given which could be numeric or text and it handles the input easily and generates decision tree even when there are missing values, unclassified instances, wrongly classified instances. If the input data has huge amount of instances there could be some difficulties in understanding the decision tree according to which node is connected to which of the decision node.

Fig: Decision Tree

3.3 Neural Networks:

0853756

17

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING In the data mining tools Neural Network can be used to solve data patterns and relations between input and output data which are complex in nature. Using this Neural Networks information can be gained from large databases. Neural Networks can be used in clustering algorithms, classification pattern recognition, function approximation and prediction. Neural Networks are assigned to only specific tasks. A simple artificial neuron has input and output based on the data but when compared to human brain contains more than 10billion nerve cells or neurons. Neural Networks are used in Face recognition and each pixel is divided into binary splits. If Neural Networks is operating 10 face recognitions, it first looks for 10 common features. In data mining Multilayer Perceptron is under Neural Networks algorithm which has input values, output values and hidden values. As we can see the output model generated in WEKA which shows that the input is assigned as weights.

Fig: Neural Network for the chosen dataset.

**3.4 Comparison of J48 (Decision Tree) and Neural Networks (Multilayer Perceptron):
**

A decision tree has leaf nodes and decision nodes which are similar to use cases in UML where the actor has use cases represented in tree fashion. Leaf nodes are nothing but classification of the instances and decision nodes are the nodes used to test the leaf nodes. In Neural Networks there are neurons which are used for classification. J48 is used as algorithm in decision trees and Multilayer algorithm is used in neural networks. Implementing J48 to a dataset can visualise the decision trees and implementing Multilayer perceptron and changing the properties such GUI (Graphic User Interface) to true gives the neural network for the chosen dataset.

0853756

18

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING In Multilayer Perceptron the values of the binary split cannot be changed where as in J48 the binary splits can be modified to ‘true’ which can increase its accuracy rate. The accuracy rate of Multilayer perceptron is relatively high when compared to J48 decision tree.

**3.5 Results based on J48 and Multilayer Perceptron:
**

From the above analysis of J48 and Multilayer perceptron results based on accuracy gives a clear view that Multilayer perceptron has more accuracy rate than J48 for both the raw data and discrete data. Neural Networks take more time to compile when compared to J48 decision tree because of its high network analysis of data. The time taken for compiling the model in Multilayer perceptron is 0.47 seconds and in J48 the time taken to build the model is 0 seconds. Multilayer perceptron gives the output which is more easily understood by the users where as decision trees sometimes can be very large and complex in nature.

3.6 Conclusion:

The Dataset used in this coursework deals with Medical and Health care area where the database has huge amount of information. Data mining in this area means applying different kinds of algorithms to derive patterns of data. By applying algorithms like J48 and Multilayer perceptron efficiency is maintained throughout the data. Even though by using the WEKA tool there are some accuracy differences between the algorithms which are applied, each algorithm plays a vital role in developing Decision trees or Neural Networks.

References: 1. 1.Rokach, L. and O. Maimon (2008). Data mining with decision trees: theory and applications, World Scientific Pub Co Inc. 2. Li, Y. and E. Hung "Building a Decision Cluster Forest Model to Classify High Dimensional Data with Multi-classes." Advances in Machine Learning: 263-277. 3. http://archive.ics.uci.edu/ml/datasets.html 4. Han, J. (1996). Data mining techniques, ACM.

0853756

19

DATA MINING

MSC INTERNET SYSTEMS ENGINEERING 5. Berkhin, P. (2006). "A survey of clustering data mining techniques." Grouping Multidimensional Data: 25-71. 6. Gardner, M. and S. Dorling (1998). "Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences." Atmospheric Environment 32(14-15): 2627-2636. 7. SUTER, B. (1990). "The multilayer perceptron as an approximation to a Bayes optimal discriminant function." IEEE Transactions on Neural Networks 1(4): 291. 8. Mitchell, T. (1997). "Decision tree learning." Machine Learning 414. 9. Lewis, D. (1998). "Naive (Bayes) at forty: The independence assumption in information retrieval." Machine Learning: ECML-98: 415.

0853756

20

DATA MINING