HAI C-09 Martes 27-10-2020

Clase 4
Data Mining
Practical Machine Learning Tools and Techniques
I. H. Witten, E. Frank,
M. A. Hall, and C. J. Pal
ISBN: 978-0-12-804291-5
Creación de archivos ARFF
Creación de archivos ARFF
More Data Mining with Weka
Class 5 – Lesson 1
Simple neural networks
Ian H. Witten
Department of Computer
Science University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 5.1: Simple neural
networks
Class 1 Exploring Weka’s
interfaces; working with
big data
Class 2 Discretization and Lesson 5.1 Simple neural networks

text classification
Lesson 5.2 Multilayer
Perceptrons
Class 3 Classification rules,
association rules, and clustering
Lesson 5.3 Learning curves
Class 4 Selecting attributes Lesson 5.4 Performance

and counting the cost optimization
Class 5 Neural networks, learning Lesson 5.5 ARFF and

curves, and performance
optimization
XRFF Lesson 5.6
networks
Many people love neural networks (not
me) intelligence
!
… the very name is suggestive of …
networks
Perceptron: simplest
form
 Determine the class using a linear combination of k
attributes
 for test instance a, x  w0  w1a1  w2 a2  ... wk ak  
j0 w j a j
 if x > 0 then class 1, if x < 0 then class 2

– Works most naturally with numeric attributes
Set ll weights to zero
a all instances in the training data are classified
Unti correctly r each instance i in the training data
l If i is classified incorrectly
Fo If i belongs to the first class add it to the weight
vector else subtract it from the weight vector
Perceptron convergence theorem
– converges if you cycle repeatedly through the training
data
– provided the problem is “linearly separable”
networks
Linear decision boundaries
 Recall Support Vector Machines (Data Mining with Weka, lesson
4.5)
– also restricted to linear decision boundaries
– but can get more complex boundaries with the “Kernel trick” (not explained)
 Perceptron can use the same trick to get non-linear boundaries
Voted perceptron (in Weka)

 Store all weight vectors and let them vote on test examples
– weight them according to their “survival” time
 Claimed to have many of the advantages of Support Vector
Machines
 … faster, simpler, and nearly as good
networks
How good is VotedPerceptron?
VotedPerceptron
Ionosphere dataset ionosphere.arff SMO 86% 89%
German credit dataset credit-g.arff 70% 75%
Breast cancer dataset breast-cancer.arff 71% 70%
Diabetes dataset diabetes.arff 67% 77%
Is it faster? …
yes
Ionosphere.arff Classifier > Functions > Voted Perceptron
Ionosphere.arff Classifier > Functions > SMO
networks
History of the Perceptron
 1957: Basic perceptron algorithm
– Derived from theories about how the brain works
– “A perceiving and recognizing automaton”
– Rosenblatt “Principles of neurodynamics: Perceptrons
and the theory of brain mechanisms”
 1970: Suddenly went out of fashion
– Minsky and Papert “Perceptrons”
 1986: Returned, rebranded “connectionism”
– Rumelhart and McClelland “Parallel distributed
processing”
– Some claim that artificial neural networks mirror brain
function
 Multilayer perceptrons
– Nonlinear decision boundaries
networks
 Basic Perceptron algorithm: linear decision boundary
– Like classification-by-regression
– Works with numeric attributes
– Iterative algorithm, order dependent
 My MSc thesis (1971) describes a simple improvement!
– Still not impressed, sorry
 Modern improvements (1999):
– get more complex boundaries using the “Kernel trick”
– more sophisticated strategy with multiple weight vectors and
voting
Course text
 Section 4.6 Linear classification using the Perceptron
Multilayer Perceptrons
Ian H. Witten
New Zealand
weka.waikato.ac.nz
Lesson 5.2: Multilayer
Perceptrons
Class 1 Exploring Weka’s
interfaces; working with
big data
Class 2 Discretization and Lesson 5.1 Simple neural networks

text classification
Lesson 5.2 Multilayer Perceptrons
Lesson 5.3 Learning curves
Class 4 Selecting attributes Lesson 5.4 Performance

and counting the cost optimization
Lesson 5.5 ARFF and

Class 5 Neural networks, learning XRFF
optimization
Lesson 5.6 Summary
Weather Numeric.arff
Basic Perceptron: mayor o menor a 0
Perceptrons
outpu
Multilayer Perceptron: función sigmoide
Network of sigmoid
t
perceptrons
 Input layer, hidden layer(s), and output
layer input
 Each connection has a weight (a number)
 Each node performs a weighted sum
–of usually
its inputs
withand thresholds
a sigmoid functionthe
result
– nodes are often called
“neurons”
output
output
input input 3 hidden

Perceptrons
How many layers, how many nodes in
each?
 Input layer: one for each attribute (attributes are numeric, or
 binary) Output layer: one for each class (or just one if the class is
 numeric) How many hidden layers? — Big Question #1
 Zero hidden layers:
– standard Perceptron algorithm
– suitable if data is linearly separable
 One hidden layer:
– suitable for a single convex region of the decision space
 Two hidden layers:
– can generate arbitrary decision boundaries
 How big are they? — Big Question #2 En general el promedio de input y output
– usually chosen somewhere between the input and output layers
– common heuristic: mean value of input and output layers (Weka’s
Perceptrons
What are the weights?
 They’re learned from the training set
 Iteratively minimize the error using steepest descent
 Gradient is determined using the “backpropagation” algorithm
 Change in weight computed by multiplying the gradient by the “learning
rate” and adding the previous change in weight multiplied by the
“momentum”:
Wnext = W + ΔW
ΔW = – learning_rate × gradient + momentum × ΔWprevious
Can get excellent results

 Often involves (much) experimentation
– number and size of hidden layers
Perceptrons
Classifier > Function > Multi layer perceptron
MultilayerPerceptron performance
Numeric weather.arrf
 Numeric weather data 79%!
 (J48, NaiveBayes both 64%, SMO 57%, IBk 79%)
 On real problems does quite well – but slow
Parameters
 hiddenLayers: set GUI to true and try 5, 10, 20
 learningRate, momentum
 makes multiple passes (“epochs”) through the
data
 training continues until
– error on the validation set consistently increases
– or training time is exceeded
Perceptrons
Create your own network structure!
 Selecting nodes
– click to select
– right-click in empty space to deselect
 Creating/deleting nodes
– click in empty space to create
– right-click (with no node
selected) to delete
 Creating/deleting connections
– with a node selected, click on
another to connect to it
– … and another, and another
– right-click to delete connection
 Can set parameters here too
Perceptrons
Are they any good?
 Experimenter with 6 datasets
– Iris, breast-cancer, credit-g, diabetes, glass, ionosphere
 9 algorithms
– MultilayerPerceptron, ZeroR, OneR, J48, NaiveBayes, IBk,
SMO, AdaBoostM1, VotedPerceptron
 MultilayerPerceptron wins on 2 datasets
 Other wins:
– SMO on 2 datasets
– J48 on 1 dataset
– IBk on 1 dataset
 But … 10–2000 times slower than other methods
Perceptrons
 Multilayer Perceptrons implement arbitrary decision
boundaries
– given two (or more) hidden layers, that are large enough
– and are trained properly
 Training by backpropagation
– iterative algorithm based on gradient descent
 In practice??
– Quite good performance, but extremely slow
– Still not impressed, sorry
– Might be a lot more impressive on more complex datasets
Course text
 Section 4.6 Linear classification using the Perceptron
Exploring the Experimenter
Ian H. Witten
New Zealand
weka.waikato.ac.nz
Lesson 1.2: Exploring the
Experimenter
Class 1 Exploring Weka’s interfaces;
working with big data
Lesson 1.1 Introduction
Class 2 Discretization and Lesson 1.2 Exploring the

text classification
Experimenter
Lesson 1.3 Comparing classifiers

Lesson 1.4 Knowledge Flow
interface
Class 4 Selecting attributes
and counting the cost Lesson 1.5 Command Line
interface
Class 5 Neural networks, learning

Lesson 1.6 Working with big data
optimization
Experimenter
Trying out
classifiers/filter
s
Performanc
e
comparisons
Graphic
al
interface
Command-line
interface
Experimenter
Use the Experimenter for …
 determining mean and standard deviation performance of

a classification algorithm on a dataset
… or several algorithms on several datasets
 Is one classifier better than another on a particular
dataset?
… and is the difference statistically significant?
 Is one parameter setting for an algorithm better than
another?
 The result of such tests can be expressed as an ARFF file
 Computation may take days or weeks
… and can be distributed over several computers
Experimenter
Experimenter
Test
dat
a
Trainin ML Classifie Deploy

g algorith r !
data m
Evaluatio
n
results
Basic assumption: training and test sets produced by
Experimenter
Evaluate J48 on segment-challenge (Data Mining with Weka, Lesson
2.3)
0.967
 With segment-challenge.arff 0.940
 and J48 (trees>J48) …
0.940
 Set percentage split to 9 0.967
 Run it: 96.7% accuracy 0% 0.953
0.967
 Repeat
0.920
 [More options] Repeat ith 0.947
w 2, 3, 4, 5, 6, 7, 8, 9 1 seed 0.933
0
0.947
Experimenter
Evaluate J48 on segment-challenge (Data Mining with Weka, Lesson
2.3)
0.967
0.940
Sample x
 i 0.940
xn
mean = 0.967
Variance =
 (xi x )2 0.953
2 – n– 0.967
1 0.920
Standard  0.947
deviation 0.933
0.947
x = 0.949,  =
0.018
Experimenter
10 fold cross-validation (Data Mining with Weka, Lesson 2.5)
 Divide dataset into 10 parts (folds)

 Hold out each part in turn
 Average the results
 Each data point used once for testing, 9 times for training
Stratified cross-validation
 Ensure that each fold has the
right proportion of each class
value
Experimenter
Setup panel
 click New
 note defaults
– 10-fold cross-validation, repeat 10
times
 under Datasets, click Add
new, open segment-
challenge.arff
 under Algorithms, click Add
new, open trees>J48
Run panel
 click Start
Analyse panel
 click Experiment
 Select Show std. deviations
 Click Perform test
Repertir con 90% split, segment-challenge.arrf, J48
Lesson 1.2: Exploring the Guarda resultados en Clase_Experiemnter.cvs
Experimenter
To get detailed results
return to Setup panel

 select .csv file
 enter filename for
results
 Train/Test Split; 90%
Experimenter
switch to Run panel

 click Start
 Open results
spreadsheet
Experimenter
Re-run cross-validation Repertir Cross Validation, segment-challenge.arrf, J48
Guarda resultados en Clase_Experiemnter.cvs
experiment
 Open results spreadsheet

Experimenter
Re-run cross-validation AND 90% Repertir Cross Validation, segment-challenge.arrf, J48
Guarda resultados en Clase_Experiemnter.cvs
Split experiment
 Open results spreadsheet

Experimenter
Setup panel
 Save/Load an experiment
 Save the results in Arff file … or in a database
 Preserve order in Train/Test split (can’t do
repetitions)
 Use several datasets, and several classifiers
 Advanced mode
Run panel
Analyse
panel
Experimenter
 Open Experimenter
 Setup, Run, Analyse panels
 Evaluate one classifier on one dataset
… using cross-validation, repeated 10 times
… using percentage split, repeated 10 times
 Examine spreadsheet output
 Analyse panel to get mean and standard
deviation
 Other options on Setup and Run panels
Course text
 Chapter The
13 Experimenter
Comparing classifiers
Ian H. Witten
New Zealand
weka.waikato.ac.nz
Lesson 1.3: Comparing classifiers
Class 1 Exploring Weka’s interfaces;

working with big data
Lesson 1.1 Introduction
Class 2 Discretization and Lesson 1.2 Exploring the

text classification
Experimenter
Lesson 1.3 Comparing
classifiers
association rules, and clustering Lesson 1.4 Knowledge Flow
interface
Class 4 Selecting attributes

and counting the cost Lesson 1.5 Command Line
interface Lesson 1.6 Working with

Class 5 Neural networks, learning
optimization
Is J48 better than (a) ZeroR and (b) OneR on the Iris
data?
 In the Explorer, open iris.arff

 UsingZeroR (rules>ZeroR)
cross-validation, 33%
evaluate classification accuracy with …
OneR (rules>OneR) 92%
J48 (trees>J48) 96%
But how reliable is this?

What would happen if you used a different
random number seed??
 In the Experimenter, click New

 Under Datasets, click Add new, open
iris.arff
 Under Algorithms, click Add new, open
trees>J48 rules>OneR
rules>ZeroR
 Switch to Run; click Start

 Switch to Analyse, click
Experiment
click Perform test
 Statistical significance: the “null hypothesis”
Lesson 1.3: Comparing classifiers Classifier A’s performance is the same as B’s
v significantly
better
* significantly
worse
 ZeroR (33.3%) is significantly worse than J48 (94.7%)

 Cannot be sure that OneR (92.5%) is significantly worse than J48
 … at the 5% level of statistical significance
 J48 seems better than ZeroR: pretty sure (5% level) that this is not due to chance
 … and better than OneR; but this may be due to chance (can’t rule it out at 5%
level)
J48 is significantly (5% level) better than

 both OneR and ZeroR on Glass, ionosphere,
segment
 OneR on breast-cancer,
german_credit
 ZeroR on iris, pima_diabetes
Comparing OneR with ZeroR

Change “Test base” on Analyse panel
 significantly worse on german-
 credit
about the same on breast-
 significantly better cancer all
on the rest
 Row: select Scheme (not Dataset)

 Column: select Dataset (not
Scheme)
 Statistical significance: the “null hypothesis”

Classifier A’s performance is the same as B’s
 The observed result is highly unlikely if the null hypothesis is
true
“The null hypothesis can be rejected at the 5%
level” [of statistical significance]
“A performs significantly better than B at the 5%
level”
 Can change the significance level (5% and
1% are common)
 Can change the comparison field (we have
used % correct)

HAI C-09 Martes 27-10-2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HAI C-09 Martes 27-10-2020

Uploaded by

Copyright:

Available Formats

Clase 4

Class 2 Discretization and Lesson 5.1 Simple neural networks

Class 4 Selecting attributes Lesson 5.4 Performance

Class 5 Neural networks, learning Lesson 5.5 ARFF and

 if x > 0 then class 1, if x < 0 then class 2

Voted perceptron (in Weka)

Ionosphere dataset ionosphere.arff SMO 86% 89%

German credit dataset credit-g.arff 70% 75%

Breast cancer dataset breast-cancer.arff 71% 70%

Diabetes dataset diabetes.arff 67% 77%

Class 2 Discretization and Lesson 5.1 Simple neural networks

Class 4 Selecting attributes Lesson 5.4 Performance

Lesson 5.5 ARFF and

input input 3 hidden

Can get excellent results

Class 2 Discretization and Lesson 1.2 Exploring the

Lesson 1.3 Comparing classifiers

Class 5 Neural networks, learning

 determining mean and standard deviation performance of

Trainin ML Classifie Deploy

 Divide dataset into 10 parts (folds)

return to Setup panel

switch to Run panel

 Open results spreadsheet

 Open results spreadsheet

Class 1 Exploring Weka’s interfaces;

Class 2 Discretization and Lesson 1.2 Exploring the

Class 4 Selecting attributes

interface Lesson 1.6 Working with

 In the Explorer, open iris.arff

But how reliable is this?

 In the Experimenter, click New

 Switch to Run; click Start

 ZeroR (33.3%) is significantly worse than J48 (94.7%)

J48 is significantly (5% level) better than

Comparing OneR with ZeroR

 Row: select Scheme (not Dataset)

 Statistical significance: the “null hypothesis”

You might also like