You are on page 1of 46

Exploring Datasets

Instances
• The input to a machine learning scheme is a set
of instances.

• These instances are to be classified, associated,


or clustered.
Dataset
• Each dataset is represented as a matrix of
instances versus attribute called a flat file.
• Classification problem (Predicting class value)
• Whether to play or not that specific game
• By default class value will be last attribute
• To change default
Nominal and Numeric attributes

• Numeric attributes, sometimes called continuous


attributes, measure numbers—either real or
integer valued.

• Nominal attributes take on values in a pre-


specified, finite set of possibilities and are
sometimes called categorical.
• Class data can be discrete or continuous

• If data is discrete: Classification problem

• If data is continuous : Regression problem


• In this data set temperature and humidity are
numeric while in previous data set they were
nominal
Click visualize all option
• 7 different types of glass
• RI: Refractive Index
• Shows percentage of Si in a glass
• An ARFF (Attribute-Relation File Format) file is
an ASCII text file that describes a list of instances
sharing a set of attributes. 

• ARFF files were developed by the Machine


Learning Project at the Department of Computer
Science of The University of Waikato for use with
the Weka machine learning software.
ARFF file format
• % means comments
Activity
• Open the iris dataset.
• 1. How many instances are there?
• 2. How many attributes are there?
• 3. How many possible values does the class
attribute have?
• 4. Do an image search on the web to find pictures
of Iris setosa, Iris virginica and Iris versicolor to
see what the different types look like.
• 5. Label these images of irises according to their
type by choosing the correct sequence:
• 6. Does the class Iris-setosa tend to have high or
low values of sepal length?
• Low
• 7. Does the class Iris-virginica tend to have high
or low values of petalwidth?

 
• High
• 8. Which of these attributes, taken by itself, gives
the best indication of the class?

 sepallength
 sepalwidth
 petalwidth
• petalwidth
• 9. Examine the Iris ARFF file header and say
when the dataset was first used?

 1936
 1973
 1980
 1988
• 1936

You might also like