You are on page 1of 48

Clase 4

Data Mining
Practical Machine Learning Tools and Techniques

I. H. Witten, E. Frank,
M. A. Hall, and C. J. Pal
ISBN: 978-0-12-804291-5
Creación de archivos ARFF
Creación de archivos ARFF
More Data Mining with Weka

Class 5 – Lesson 1
Simple neural networks

Ian H. Witten
Department of Computer
Science University of Waikato
New Zealand

weka.waikato.ac.nz
Lesson 5.1: Simple neural
networks
Class 1 Exploring Weka’s
interfaces; working with
big data

Class 2 Discretization and Lesson 5.1 Simple neural networks


text classification
Lesson 5.2 Multilayer
Perceptrons
Class 3 Classification rules,
association rules, and clustering
Lesson 5.3 Learning curves

Class 4 Selecting attributes Lesson 5.4 Performance


and counting the cost optimization

Class 5 Neural networks, learning Lesson 5.5 ARFF and


curves, and performance
optimization
XRFF Lesson 5.6
Lesson 5.1: Simple neural
networks
Many people love neural networks (not
me) intelligence
!
… the very name is suggestive of …
Lesson 5.1: Simple neural
networks
Perceptron: simplest
form
 Determine the class using a linear combination of k
attributes
 for test instance a, x  w0  w1a1  w2 a2  ... wk ak  
j0 w j a j

 if x > 0 then class 1, if x < 0 then class 2


– Works most naturally with numeric attributes
Set ll weights to zero
a all instances in the training data are classified
Unti correctly r each instance i in the training data
l If i is classified incorrectly
Fo If i belongs to the first class add it to the weight
vector else subtract it from the weight vector
Perceptron convergence theorem
– converges if you cycle repeatedly through the training
data
– provided the problem is “linearly separable”
Lesson 5.1: Simple neural
networks
Linear decision boundaries
 Recall Support Vector Machines (Data Mining with Weka, lesson
4.5)
– also restricted to linear decision boundaries
– but can get more complex boundaries with the “Kernel trick” (not explained)
 Perceptron can use the same trick to get non-linear boundaries

Voted perceptron (in Weka)


 Store all weight vectors and let them vote on test examples
– weight them according to their “survival” time
 Claimed to have many of the advantages of Support Vector
Machines
 … faster, simpler, and nearly as good
Lesson 5.1: Simple neural
networks
How good is VotedPerceptron?
VotedPerceptron

Ionosphere dataset ionosphere.arff SMO 86% 89%

German credit dataset credit-g.arff 70% 75%

Breast cancer dataset breast-cancer.arff 71% 70%

Diabetes dataset diabetes.arff 67% 77%

Is it faster? …
yes
Ionosphere.arff Classifier > Functions > Voted Perceptron
Ionosphere.arff Classifier > Functions > SMO
Lesson 5.1: Simple neural
networks
History of the Perceptron
 1957: Basic perceptron algorithm
– Derived from theories about how the brain works
– “A perceiving and recognizing automaton”
– Rosenblatt “Principles of neurodynamics: Perceptrons
and the theory of brain mechanisms”
 1970: Suddenly went out of fashion
– Minsky and Papert “Perceptrons”
 1986: Returned, rebranded “connectionism”
– Rumelhart and McClelland “Parallel distributed
processing”
– Some claim that artificial neural networks mirror brain
function
 Multilayer perceptrons
– Nonlinear decision boundaries
Lesson 5.1: Simple neural
networks
 Basic Perceptron algorithm: linear decision boundary
– Like classification-by-regression
– Works with numeric attributes
– Iterative algorithm, order dependent
 My MSc thesis (1971) describes a simple improvement!
– Still not impressed, sorry
 Modern improvements (1999):
– get more complex boundaries using the “Kernel trick”
– more sophisticated strategy with multiple weight vectors and
voting

Course text
 Section 4.6 Linear classification using the Perceptron
More Data Mining with Weka

Class 5 – Lesson 2
Multilayer Perceptrons

Ian H. Witten
Department of Computer
Science University of Waikato
New Zealand

weka.waikato.ac.nz
Lesson 5.2: Multilayer
Perceptrons
Class 1 Exploring Weka’s
interfaces; working with
big data

Class 2 Discretization and Lesson 5.1 Simple neural networks


text classification
Lesson 5.2 Multilayer Perceptrons
Class 3 Classification rules,
association rules, and clustering
Lesson 5.3 Learning curves

Class 4 Selecting attributes Lesson 5.4 Performance


and counting the cost optimization

Lesson 5.5 ARFF and


Class 5 Neural networks, learning XRFF
curves, and performance
optimization
Lesson 5.6 Summary
Weather Numeric.arff
Lesson 5.2: Multilayer
Basic Perceptron: mayor o menor a 0
Perceptrons

outpu
Multilayer Perceptron: función sigmoide
Network of sigmoid

t
perceptrons
 Input layer, hidden layer(s), and output
layer input
 Each connection has a weight (a number)
 Each node performs a weighted sum
–of usually
its inputs
withand thresholds
a sigmoid functionthe
result
– nodes are often called
“neurons”

output
output

input input 3 hidden


Lesson 5.2: Multilayer
Perceptrons
How many layers, how many nodes in
each?
 Input layer: one for each attribute (attributes are numeric, or
 binary) Output layer: one for each class (or just one if the class is
 numeric) How many hidden layers? — Big Question #1
 Zero hidden layers:
– standard Perceptron algorithm
– suitable if data is linearly separable
 One hidden layer:
– suitable for a single convex region of the decision space
 Two hidden layers:
– can generate arbitrary decision boundaries
 How big are they? — Big Question #2 En general el promedio de input y output
– usually chosen somewhere between the input and output layers
– common heuristic: mean value of input and output layers (Weka’s
Lesson 5.2: Multilayer
Perceptrons
What are the weights?
 They’re learned from the training set
 Iteratively minimize the error using steepest descent
 Gradient is determined using the “backpropagation” algorithm
 Change in weight computed by multiplying the gradient by the “learning
rate” and adding the previous change in weight multiplied by the
“momentum”:
Wnext = W + ΔW
ΔW = – learning_rate × gradient + momentum × ΔWprevious

Can get excellent results


 Often involves (much) experimentation
– number and size of hidden layers
Lesson 5.2: Multilayer
Perceptrons
Classifier > Function > Multi layer perceptron

MultilayerPerceptron performance
Numeric weather.arrf
 Numeric weather data 79%!
 (J48, NaiveBayes both 64%, SMO 57%, IBk 79%)
 On real problems does quite well – but slow

Parameters
 hiddenLayers: set GUI to true and try 5, 10, 20
 learningRate, momentum
 makes multiple passes (“epochs”) through the
data
 training continues until
– error on the validation set consistently increases
– or training time is exceeded
Lesson 5.2: Multilayer
Perceptrons
Create your own network structure!
 Selecting nodes
– click to select
– right-click in empty space to deselect
 Creating/deleting nodes
– click in empty space to create
– right-click (with no node
selected) to delete
 Creating/deleting connections
– with a node selected, click on
another to connect to it
– … and another, and another
– right-click to delete connection
 Can set parameters here too
Lesson 5.2: Multilayer
Perceptrons
Are they any good?
 Experimenter with 6 datasets
– Iris, breast-cancer, credit-g, diabetes, glass, ionosphere
 9 algorithms
– MultilayerPerceptron, ZeroR, OneR, J48, NaiveBayes, IBk,
SMO, AdaBoostM1, VotedPerceptron
 MultilayerPerceptron wins on 2 datasets
 Other wins:
– SMO on 2 datasets
– J48 on 1 dataset
– IBk on 1 dataset
 But … 10–2000 times slower than other methods
Lesson 5.2: Multilayer
Perceptrons
 Multilayer Perceptrons implement arbitrary decision
boundaries
– given two (or more) hidden layers, that are large enough
– and are trained properly
 Training by backpropagation
– iterative algorithm based on gradient descent
 In practice??
– Quite good performance, but extremely slow
– Still not impressed, sorry
– Might be a lot more impressive on more complex datasets

Course text
 Section 4.6 Linear classification using the Perceptron
More Data Mining with Weka

Class 1 – Lesson 2
Exploring the Experimenter

Ian H. Witten
Department of Computer
Science University of Waikato
New Zealand

weka.waikato.ac.nz
Lesson 1.2: Exploring the
Experimenter
Class 1 Exploring Weka’s interfaces;
working with big data
Lesson 1.1 Introduction

Class 2 Discretization and Lesson 1.2 Exploring the


text classification
Experimenter

Lesson 1.3 Comparing classifiers


Class 3 Classification rules,
association rules, and clustering
Lesson 1.4 Knowledge Flow
interface
Class 4 Selecting attributes
and counting the cost Lesson 1.5 Command Line
interface

Class 5 Neural networks, learning


Lesson 1.6 Working with big data
curves, and performance
optimization
Lesson 1.2: Exploring the
Experimenter

Trying out
classifiers/filter
s

Performanc
e
comparisons

Graphic
al
interface

Command-line
interface
Lesson 1.2: Exploring the
Experimenter
Use the Experimenter for …

 determining mean and standard deviation performance of


a classification algorithm on a dataset
… or several algorithms on several datasets
 Is one classifier better than another on a particular
dataset?
… and is the difference statistically significant?
 Is one parameter setting for an algorithm better than
another?
 The result of such tests can be expressed as an ARFF file
 Computation may take days or weeks
… and can be distributed over several computers
Lesson 1.2: Exploring the
Experimenter
Lesson 1.2: Exploring the
Experimenter
Test
dat
a

Trainin ML Classifie Deploy


g algorith r !
data m

Evaluatio
n
results
Basic assumption: training and test sets produced by
Lesson 1.2: Exploring the
Experimenter
Evaluate J48 on segment-challenge (Data Mining with Weka, Lesson
2.3)
0.967
 With segment-challenge.arff 0.940
 and J48 (trees>J48) …
0.940
 Set percentage split to 9 0.967
 Run it: 96.7% accuracy 0% 0.953
0.967
 Repeat
0.920
 [More options] Repeat ith 0.947
w 2, 3, 4, 5, 6, 7, 8, 9 1 seed 0.933
0
0.947
Lesson 1.2: Exploring the
Experimenter
Evaluate J48 on segment-challenge (Data Mining with Weka, Lesson
2.3)
0.967
0.940
Sample x
 i 0.940
xn
mean = 0.967
Variance =
 (xi x )2 0.953
2 – n– 0.967
1 0.920
Standard  0.947
deviation 0.933
0.947

x = 0.949,  =
0.018
Lesson 1.2: Exploring the
Experimenter
10 fold cross-validation (Data Mining with Weka, Lesson 2.5)

 Divide dataset into 10 parts (folds)


 Hold out each part in turn
 Average the results
 Each data point used once for testing, 9 times for training

Stratified cross-validation
 Ensure that each fold has the
right proportion of each class
value
Lesson 1.2: Exploring the
Experimenter
Setup panel
 click New
 note defaults
– 10-fold cross-validation, repeat 10
times
 under Datasets, click Add
new, open segment-
challenge.arff
 under Algorithms, click Add
new, open trees>J48
Run panel
 click Start
Analyse panel
 click Experiment
 Select Show std. deviations
 Click Perform test
Repertir con 90% split, segment-challenge.arrf, J48
Lesson 1.2: Exploring the Guarda resultados en Clase_Experiemnter.cvs

Experimenter
To get detailed results

return to Setup panel


 select .csv file
 enter filename for
results
 Train/Test Split; 90%
Lesson 1.2: Exploring the
Experimenter

switch to Run panel


 click Start
 Open results
spreadsheet
Lesson 1.2: Exploring the
Experimenter
Re-run cross-validation Repertir Cross Validation, segment-challenge.arrf, J48
Guarda resultados en Clase_Experiemnter.cvs
experiment

 Open results spreadsheet


Lesson 1.2: Exploring the
Experimenter
Re-run cross-validation AND 90% Repertir Cross Validation, segment-challenge.arrf, J48
Guarda resultados en Clase_Experiemnter.cvs
Split experiment

 Open results spreadsheet


Lesson 1.2: Exploring the
Experimenter
Setup panel
 Save/Load an experiment
 Save the results in Arff file … or in a database
 Preserve order in Train/Test split (can’t do
repetitions)
 Use several datasets, and several classifiers
 Advanced mode

Run panel
Analyse
panel
Lesson 1.2: Exploring the
Experimenter
 Open Experimenter
 Setup, Run, Analyse panels
 Evaluate one classifier on one dataset
… using cross-validation, repeated 10 times
… using percentage split, repeated 10 times
 Examine spreadsheet output
 Analyse panel to get mean and standard
deviation
 Other options on Setup and Run panels
Course text
 Chapter The
13 Experimenter
More Data Mining with Weka

Class 1 – Lesson 3
Comparing classifiers

Ian H. Witten
Department of Computer
Science University of Waikato
New Zealand

weka.waikato.ac.nz
Lesson 1.3: Comparing classifiers

Class 1 Exploring Weka’s interfaces;


working with big data
Lesson 1.1 Introduction

Class 2 Discretization and Lesson 1.2 Exploring the


text classification
Experimenter
Lesson 1.3 Comparing
classifiers
Class 3 Classification rules,
association rules, and clustering Lesson 1.4 Knowledge Flow
interface

Class 4 Selecting attributes


and counting the cost Lesson 1.5 Command Line

interface Lesson 1.6 Working with


Class 5 Neural networks, learning
curves, and performance
optimization
Lesson 1.3: Comparing classifiers

Is J48 better than (a) ZeroR and (b) OneR on the Iris
data?

 In the Explorer, open iris.arff


 UsingZeroR (rules>ZeroR)
cross-validation, 33%
evaluate classification accuracy with …
OneR (rules>OneR) 92%
J48 (trees>J48) 96%

But how reliable is this?


What would happen if you used a different
random number seed??
Lesson 1.3: Comparing classifiers

 In the Experimenter, click New


 Under Datasets, click Add new, open
iris.arff
 Under Algorithms, click Add new, open
trees>J48 rules>OneR
rules>ZeroR
Lesson 1.3: Comparing classifiers

 Switch to Run; click Start


 Switch to Analyse, click
Experiment
click Perform test
 Statistical significance: the “null hypothesis”
Lesson 1.3: Comparing classifiers Classifier A’s performance is the same as B’s

v significantly
better
* significantly
worse

 ZeroR (33.3%) is significantly worse than J48 (94.7%)


 Cannot be sure that OneR (92.5%) is significantly worse than J48
 … at the 5% level of statistical significance
 J48 seems better than ZeroR: pretty sure (5% level) that this is not due to chance
 … and better than OneR; but this may be due to chance (can’t rule it out at 5%
level)
Lesson 1.3: Comparing classifiers

J48 is significantly (5% level) better than


 both OneR and ZeroR on Glass, ionosphere,
segment
 OneR on breast-cancer,
german_credit
 ZeroR on iris, pima_diabetes
Lesson 1.3: Comparing classifiers

Comparing OneR with ZeroR


Change “Test base” on Analyse panel
 significantly worse on german-
 credit
about the same on breast-
 significantly better cancer all
on the rest
Lesson 1.3: Comparing classifiers

 Row: select Scheme (not Dataset)


 Column: select Dataset (not
Scheme)
Lesson 1.3: Comparing classifiers

 Statistical significance: the “null hypothesis”


Classifier A’s performance is the same as B’s
 The observed result is highly unlikely if the null hypothesis is
true
“The null hypothesis can be rejected at the 5%
level” [of statistical significance]
“A performs significantly better than B at the 5%
level”
 Can change the significance level (5% and
1% are common)
 Can change the comparison field (we have
used % correct)

You might also like