You are on page 1of 25

Machine Learning

Assignment 2

Name: Sakshi Jaju


Class: IT TY-C
Batch: C3
Roll No.: 333062
PRN no.: 22011167
Problem Statement: Study Assignment Write up to indicate the study of Data
processing and plotting the graphs and performance measure using Python and
corresponding libraries.

WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users
can perform machine learning tasks such as classification, regression, attribute selection, and
association on these sample datasets, and can also learn the tool using them.

WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used
by other computer programs.

In WEKA the output of preprocessing gives the attributes present in the dataset which can be
further used for statistical analysis and comparison with class labels.

WEKA also offers many classification algorithms for decision trees. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from
the Preprocess tab by removing the attributes which are not required and starting the
classification process again.

There are several ways of using Weka in a Python or Python-like environment.

Jython#

If you're starting from scratch, you might want to consider Jython, a rewrite of Python to
seamlessly integrate with Java. The drawback is, that you can only use the libraries that Jython
implements, not others like NumPy or SciPy. The article Using WEKA from Jython explains
how to use WEKA classes from Jython and how to implement a new classifier in Jython, with an
example of ZeroR implemented in Jython.

Jepp#

An approach making use of the javax.script package (new in Java 6) is Jepp, Java embedded
Python. Jepp seems to have the same limitations as Jython, not being able to import Scipy or
Numpy, but one can import pure Python libraries. The article Using WEKA via Jepp contains
more information and examples.

JPype#

Another solution, to access Java from within Python applications is JPype.

python-weka-wrapper3#

You can use the python-weka-wrapper Python 3 library to access most of the non-GUI
functionality of Weka (3.9.x):

● pypi
● github
● examples

sklearn-weka-plugin#

With the sklearn-weka-plugin library, you can use Weka from within the scikit-learn
framework. The library itself uses python-weka-wrapper3 under the hood to make use of the
Weka algorithms.

● pypi
● github
● Examples
The WEKA machine learning tool provides a directory of some sample datasets. These datasets
can be directly loaded into WEKA for users to start developing models immediately.

The WEKA datasets can be explored from the “C:\Program Files\Weka-3-8\data” link. The
datasets are in .arff format.
Sample WEKA Datasets

Some sample datasets present in WEKA are enlisted in the table below:

S.No. Sample Datasets

1. airline.arff

2. breast-cancer.arff

3. contact-lens.arff

4. cpu.arff

5. cpu.with-vendor.arff

6. credit-g.arff

7. diabetes.arff

8. glass.arff
9. hypothyroid.arff

10. ionospehre.arff

11. iris.2D.arff

12. iris.arff

13. labor.arff

14. ReutersCorn-train.arff

15. ReutersCorn-test.arff

16. ReutersGrain-train.arff

17. ReutersGrain-test.arff

18. segment-challenge.arff

19. segment-test.arff

20. soybean.arff

21. supermarket.arff

22. unbalanced.arff

23. vote.arff

24. weather.numeric.arff

25. weather.nominal.arff

Let’s take a look at some of these:


contact-lens.arff

contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor,
Benoit Julien in the year 1990.

Database: This database is complete. The examples used in this database are complete and
noise-free. The database has 24 instances and 4 attributes.

Attributes: All four attributes are nominal. There are no missing attribute values. The four
attributes are as follows:

#1) Age of the patient: The attribute age can take values:
● young
● pre-presbyopic
● presbyopic

#2) Spectacle prescription: This attribute can take values:

● myope
● hypermetrope

#3) Astigmatic: This attribute can take values

● no
● yes

#4) Tear production rate: The values can be

● reduced
● normal

Class: Three class labels are defined here. These are:

● the patient should be fitted with hard contact lenses.


● the patient should be fitted with soft contact lenses.
● the patient should not be fitted with contact lenses.

Class Distribution: The instances that are classified into class labels are enlisted below:

Class Label No of Instances

1. Hard contact lenses 4

2. Soft contact lenses 5


3. No contact lenses 15

iris.arff

iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.

Database: This database is used for pattern recognition. The data set contains 3 classes of 50
instances. Each class represents a type of iris plant. One class is linearly separable from the other
2 but the latter are not linearly separable from each other. It predicts to which species of the 3 iris
flowers the observation belongs. This is called a multi-class classification dataset.

Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing attributes.

The attributes are:

● sepal length in cm
● sepal width in cm
● petal length in cm
● petal width in cm
● class:
○ Iris Setosa
○ Iris Versicolour
○ Iris Virginica

Summary Statistics:

Min Max Mean SD Class Correlation

sepal length 4.3 7.9 5.84 0.83 0.7826

sepal width 2.0 4.4 3.05 0.43 -0.4194

petal length 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width 0.1 2.5 1.20 0.76 0.9565 (high!)

Class Distribution: 33.3% for each of 3 classes

Some other Datasets:

diabetes.arff

The database of this dataset is Pima Indians Diabetes. This dataset predicts whether the patient is
prone to be diabetic in the next 5 years. The patients in this dataset are all females of at least 21
years of age from Pima Indian Heritage. It has 768 instances and 8 numerical attributes plus a
class. This is a binary classification dataset where the output variable predicted is nominal
comprising of two classes.

ionosphere.arff

This is a popular dataset for binary classification. The instance in this dataset describes the
properties of radar returns from the atmosphere. It is used to predict whether the ionosphere has
some structure or not. It has 34 numerical attributes and a class.

The class attribute is “good” or “bad” which is predicted based on 34 attributes observation. The
received signals are processed by the autocorrelation function taking time pulse and pulse
number as arguments.

Regression Datasets

The regression datasets can be downloaded from the WEKA webpage “Collections of datasets”.
It has 37 regression problems obtained from different sources. The downloaded file will create a
numeric/directory with regression datasets in .arff format.

The popular datasets present in the directory are Longley economic dataset (longley.arff),
Boston house price dataset (housing.arff), and sleep in mammals data set (sleep.arff).

Let us now see how to identify real-valued and nominal attributes in the dataset using WEKA
explorer.

What Are Real-valued And Nominal Attributes

Real valued attributes are numeric attributes containing only real values. These are measurable
quantities. These attributes can be interval scaled such as temperature or ratio scaled such as
mean, or median.

Nominal attributes represent names or some representation of things. There is no order in such
attributes and they represent some category. For example, color.

Follow the steps enlisted below to use WEKA for identifying real values and nominal
attributes in the dataset.

#1) Open WEKA and select “Explorer” under ‘Applications’.


#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA users, you can access
WEKA sample files.

#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the
predefined .arff file “credit-g.arff” file and click on “Open”.
#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on the
right panel along with the histogram.

Analysis of the dataset:

In the left panel the current relation shows:

● Relation name: german_credit is the sample file.


● Instances: 1000 number of data rows in the dataset.
● Attributes: 21 attributes in the dataset.

The panel below the current relation shows the name of attributes.

In the right panel, the selected attribute statistics are displayed. Select the attribute
“checking_status”.

It shows:

● Name of the attribute


● Missing: Any missing values of the attribute in the dataset. 0% in this case.
● Distinct: The attribute has 4 distinct values.
● Type: The attribute is of the nominal type that is, it does not take any numeric
value.
● Count: Among the 1000 instances, the count of each distinct class label is written
in the count column.
● Histogram: It will display the output class label for the attribute. The class label
in this dataset is either good or bad. There are 700 instances of good (marked in
blue) and 300 instances of bad (marked in red).
○ For the label < 0, the instances for good or bad are almost the same in
number.
○ For label, 0<= X<200, the instances with decision good are more than
instances with bad.
○ Similarly, for label >= 200, the max instances occur for good and no
checking label has more instances with decision good.

For the next attribute “duration”.

The right panel shows:

● Name: This is the Name of the attribute.


● Type: The type of the attribute is numeric.
● Missing value: The attribute does not have any missing value.
● Distinct: It has 33 distinct values in 1000 instances. It means in 1000 instances it
has 33 distinct values.
● Unique: It has 5 unique values that do not match with each other.
● Minimum value: The min value of the attribute is 4.
● Maximum Value: The max value of the attribute is 72.
● Mean: Mean is adding all the values divided by instances.
● Standard Deviation: Stddeviation of attribute duration.
● Histogram: The histogram depicts the duration of 4 units, the max instances occur
for a good class. As the duration increases to 38 units, the number of instances
reduces for good class labels. The duration reaches 72 units which has only one
instance which classifies the decision as bad.
The class is the classification feature of the nominal type. It has two distinct values: good and
bad. The good class label has 700 instances and the bad class label has 300 instances.
To visualize all the attributes of the dataset, click on “Visualize All”.

#5) To find out only numeric attributes, click on the Filter button. From there, click on Choose
->WEKA >FILTERS -> Unsupervised Type ->Remove Type.

WEKA filters have many functionalities to transform the attribute values of the dataset to make
it suitable for the algorithms. For example, the numeric transformation of attributes.

Filtering the nominal and real-valued attributes from the dataset is another example of using
WEKA filters.
#6) Click on the RemoveType in the filter tab. An object editor window will open. Select
attributeType “Delete numeric attributes” and click on OK.
#7) Apply the filter. Only numeric attributes will be shown.

The class attribute is of the nominal type. It classifies the output and hence cannot be deleted.
Thus it is seen with the numeric attribute.
Output:

The real-valued and nominal values attributes in the dataset are identified. Visualization with the
class label is seen in the form of histograms.

Weka Decision Tree Classification Algorithms

Now, we will see how to implement decision tree classification on weather.nominal.arff dataset
using the J48 classifier.

weather.nominal.arff

It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is
suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label “play”
classifies the output as “yes’ or “no”.

What Is Decision Tree

A decision Tree is the classification technique that consists of three components root node,
branch (edge or link), and leaf node. Root represents the test condition for different attributes, the
branch represents all possible outcomes that can be there in the test, and leaf nodes contain the
label of the class to which it belongs. The root node is at the starting of the tree which is also
called the top of the tree.

J48 Classifier

It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3). It is
also known as a statistical classifier. For decision tree classification, we need a database.

Steps include:

#1) Open WEKA explorer.

#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.
#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose” button.
From this, select “trees -> J48”. Let us also have a quick look at other options in the Choose
button:

● Bayes: It is a density estimation for numerical attributes.


● Meta: It is a multi-response linear regression.
● Functions: It is logistic regression.
● Lazy: It sets the blend entropy automatically.
● Rule: It is a rule learner.
● Trees: Trees classify the data.
#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It shows
the run information in the panel as:

● Scheme: The classification algorithm used.


● Instances: Number of data rows in the dataset.
● Attributes: The dataset has 5 attributes.
● The number of leaves and the size of the tree describes the decision tree.
● Time is taken to build the model: Time for the output.
● Full classification of the J48 pruned with the attributes and number of instances.
#5) To visualize the tree, right-click on the result and select visualize the tree.
Output:

The output is in the form of a decision tree. The main attribute is “outlook”.

If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then the
class label play= “yes”.

If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the
classification is 4.

If outlook is rainy, further classification takes place to analyze the attribute “windy”. If
windy=true, the play = “no”. The number of instances which obey the classification for outlook=
windy and windy=true is 2.

Conclusion

WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users
can perform machine learning tasks such as classification, regression, attribute selection, and
association on these sample datasets, and can also learn the tool using them.

WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used
by other computer programs. In WEKA the output of preprocessing gives the attributes present
in the dataset which can be further used for statistical analysis and comparison with class labels.
WEKA also offers many classification algorithms for decision trees. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from
the Preprocess tab by removing the attributes which are not required and starting the
classification process again.

You might also like