Slides8 Misc Topics

Discussion of Assignments 3 and 4
and
Cross-Validation Methods
Padhraic Smyth
Information and Computer Science
CS 175, Fall 2007
Review of Assignment 3 (Perceptron)
perceptron.m function
function [thresholded_outputs] = perceptron(weights,data)

% function [thresholded_outputs] = perceptron(weights,data)
%
% Compute the class predictions for perceptron (linear classifier)
% Sample code for CS 175
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% data: N x (d+1) matrix of training data
%
% Outputs
% outputs: N x 1 vector of perceptron outputs
% error checking
if size(weights,1) ~= 1
error('The first argument (weights) should be a row vector');
end
if size(data,2) ~= size(weights,2)
error('The arguments (weights and data) should have the same number of
columns');
end
% calculate the thresholded outputs of the perceptron (vectorized)

thresholded_outputs = sign(data*weights');
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 3
perceptron_error.m function
function [cerror, mse] = perceptron_error(weights,data,targets)
% function [cerror, mse] = perceptron_error(weights,data,targets)
%
% Compute mis-classification error and mean squared error for
% a perceptron (linear) classifier
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% targets: N x 1 vector of target values (+1 or -1)
%
% Outputs
% cerror: the percentage of examples misclassified (between 0 and 100)
% mse: the mean-square error (sum of squared errors divided by N)
N = size(data, 1);
% error checking
if nargin ~= 3
error('The function takes three arguments (weights, data, targets)');
end
if size(weights,1) ~= 1
error('The first argument (weights) should be a row vector');
end
if size(data,2) ~= size(weights,2)
error('The first two arguments (weights and data) should have the same number of
columns');
end
if size(data,1) ~= size(targets,1)
error('The last two arguments (targets and data) should have the same number of rows');
end
% calculate the unthresholded outputs, for all rows in data, N x 1 vector
f = (weights * data‘) ‘;
% compare thresholded output to the target values to get the accuracy

cerror = 100 * sum(sign(f) ~= targets)/N;
% calculate the sigmoid version of the outputs, for all rows in data, N x 1 vector
outputs = sigmoid(f);
% compare sigmoid output vector to the target vector to get the mse
mse = sum((outputs-targets).^2)/N;
function s = sigmoid(x)
% Computes sigmoid function (scaled to -1, +1)
s = 2./(1+exp(-x)) - 1;
% calculate the unthresholded outputs, for all rows in data, N x 1 vector
f = (weights * data‘) ‘;
Vectorized computation of
classification error rate
% compare thresholded output to the target values to get the accuracy
cerror = 100 * sum(sign(f) ~= targets)/N;
% calculate the sigmoid version of the outputs,
sigmoid for all rows in data, N x 1 vector
output
outputs = sigmoid(f);
% compare sigmoid output vector to the target vector to get the mse
MSE
mse = sum((outputs-targets).^2)/N;
% Computes sigmoid function (scaled to -1, +1)
s = 2./(1+exp(-x)) - 1;
Local function defining the
sigmoid. Note that it works
on vectors
Principle of Gradient Descent
Gradient descent algorithm:
– Start with some initial guess at w
– Move downhill in “small steps” direction of steepest descent
– What is the direction of steepest descent?

• The negative of the gradient, evaluated at w
– What is the gradient?

• Gradient = vector of derivatives with respect to each component of w
• E.g., if w = [ w1, w2, w3] then
gradient[g(w)] = [ d g(w)/ dw1, d g(w)/dw2, d g(w)/dw3 ]
• Note that the gradient is itself a vector (or a “direction)
– After moving, recompute the gradient, get a new downhill direction, and move
again.
– Keep repeating this until the decrease in g(w) is less than some threshold, i.e.,
we appear to be on a flat part of the g(w) surface.
Illustration of Gradient Descent
g(w)
w1
w2
g(w)
w1
w2
g(w)
w1
Direction of steepest
descent = direction of
negative gradient
w2
g(w)
w1
Original point in
weight space
New point in
weight space
w2
Gradient Descent Algorithm
• Algorithm converges to either

– Global minimum if g(w) is convex (has a single minimum)
• this is the case for the perceptron
– Local minimum if g(w) has multiple local minima

• This is the case for multilayer neural networks
• To avoid local minima, in practice we rerun the gradient
descent algorithm from multiple random starting points
pick the solution with the lowest MSE.
• Note that the backpropagation algorithm is based on
gradient descent (using a clever way to calculate the
gradient)
– Note that the algorithm need not converge at all if the learning
rate (i.e., step size) is too large
Mathematically, the Gradient Descent Rule:

w new = w old -  (w)
where
 (w) is the gradient and
is the learning rate (small, positive)
Mathematically, the Gradient Descent Rule:

w new = w old -  (w)
where
 (w) is the gradient and
is the learning rate (small, positive)
In MATLAB, for the perceptron with sigmoid outputs this translates into the
following update rule:
weights = weights - rate * (o - targets(i)) * dsigmoid(o) * data(i, :);
This whole part is the

gradient, evaluated at
the current weight
vector
learn_perceptron.m function
function [weights,mse,acc] =
learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
% function [weights,mse,acc] =
learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
%
% Learn the weights for a perceptron (linear) classifier to minimize its
% mean squared error.
%
% Inputs
% targets: N x 1 vector of target values (+1 or -1)
% rate: learning rate for the perceptron algorithm (e.g., rate = 0.001)
% threshold: if the reduction in MSE from one iteration to the next is *less*
% than threshold, then halt learning (e.g., threshold = 0.000001)
% init_method: method used to initialize the weights (1 = random, 2 = half
% way between 2 random points in each group, 3 = half way between
% the centroids in each group)
% random_seed: this is an integer used to "seed" the random number generator
% for either methods 1 or 2 for initialization (this is useful
% to be able to recreate a particular run exactly)
% plotflag: 1 means plotting is turned on, default value is 0
% k: how many iterations between plotting (e.g., k = 100)
%
% Outputs
% weights: 1 x (d+1) row vector of learned weights
% mse: mean squared error for learned weights
% acc: classification accuracy for learned weights (percentage, between 0 and 100)
[N, d] = size(data);
% error checking
if nargin < 4
error('The function takes at least 4 arguments (data, targets, rate, threshold)');
end
if size(data,1) ~= size(targets,1)
error('The number of rows in the first two arguments (data, targets) does not match!');
end
% initialize the input arguments

if ~exist('k')
k = 100;
end
if ~exist('plotflag')
plotflag = 0;
end
if ~exist('random_seed')
random_seed = 1234;
end
if ~exist('init_method')
init_method = 1;
end
% initialize the weights
weights = initialize_weights175(data,targets,init_method,random_seed);
iteration=0;
while iteration < 2 | ( abs(mse(iteration) - mse(iteration-1)) > threshold )
iteration = iteration + 1;
% cycle through all of the examples

for i=1:N
% calculate the unthresholded output for the ith row of "data"
o = sigmoid( weights * data(i,:)' );
% update the weight vector
weights = weights + rate * (targets(i) - o) * dsigmoid(o) * data(i, :);
end
% calculate the errors using current parameter values

[cerror(iteration), mse(iteration)] = perceptron_error(weights, data, targets);
% visualize the decision boundary if needed

if plotflag == 1 & mod(iteration, k) == 0
t = strcat ('Decision boundary at iteration # ', num2str(iteration));
weightplot175(data, targets, weights, t);
pause(0.0001);
end
end
% create the plots of the MSE and Accuracy Vs. iteration number
if (plotflag == 1)
figure(2);
subplot(2, 1, 1);
plot(mse,'b-');
xlabel('iteration');
ylabel('MSE');
subplot(2, 1, 2);
plot(100-cerror,'b-');
xlabel('iteration');
ylabel('Accuracy');
end
% local functions…..
% Compute the sigmoid function, scaled from -1 to +1
s = 2./(1+exp(-x)) - 1;
function ds = dsigmoid(x)
% Compute the derivative of the (rescaled) sigmoid
ds = .5 .* (sigmoid(x)+1) .* (1-sigmoid(x));
MATLAB Demonstration
Download MATLAB demo code (Zip file) from Web page
Run demo_perceptron_image_classification.m
Additional Concepts in Classification
(Relevant to Assignment 4)
Assignment 4
• threshold_image.m
– Simple function to display thresholded images
• knn_dispset.m
– Finds and displays the k-nearest-neighbors for a given image
• test_classifiers.m
– Uses cross-validation to compare classifiers
– (code is provided)
• test_imageclassifiers.m
– Compare different classification methods on image data
– Uses cross-validation
Assignment 4: using kNN to find similar images
function [list] = knndispset(imageset,i,j,k,plotflag)

% function [list] = knndispset(imageset,i,j,k, plotflag)
%
% a brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% imageset: an array structure of images (CS 175 format)
% i, j: integers specifying that imageset(i,j).image is the query image
% k: number of neighbors to find
% plotflag: display the k nearest neighbors if plotflag = 1;
%
% Outputs
% list: a k x 2 matrix, where the first row contains the indices from
imageset of the nearest neighbor, the second row contains
the indices of the 2nd nearest neighbor, and so forth.
MATLAB demo of knndispset
• knndispset(i2straight,5,1,15,1);
MATLAB demo of knndispset
• knndispset(i2straight,18,1,15,1);
Training Data and Test Data
• Training data
– labeled data used to build a classifier
• Test data
– new data, not used in the training process, to evaluate how well a
classifier does on new data
• Memorization versus Generalization

– better training_accuracy
• “memorizing” the training data:
– better test_accuracy
• “generalizing” to new data
– in general, we would like our classifier to perform well on new test
data, not just on training data,
• i.e., we would like it to generalize well to new data
• Test accuracy is more important than training accuracy
Test Accuracy and Generalization
• The accuracy of our classifier on new unseen data is a

fair/honest assessment of the performance of our classifier
• Why is training accuracy not good enough?

– Training accuracy is optimistic
– a classifier like nearest-neighbor can construct boundaries which
always separate all training data points, but which do not separate
new points
• e.g., what is the training accuracy of kNN, k = 1?
– A flexible classifier can “overfit” the training data
• in effect it just memorizes the training data, but does not learn
the general relationship between x and C
• Generalization
– We are really interested in how our classifier generalizes to new
data
– test data accuracy is a good estimate of generalization performance
Another Example
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

6
Decision
Region 1 Decision
5 Region 2
3
Feature 2
0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
A More Complex Decision Boundary
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

6
Decision
Region 1 Decision
5 Region 2
4
Feature 2
0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
Example: The Overfitting Phenomenon
A Complex Model
Y = high-order polynomial in X
The True (simpler) Model
Y = a X + b + noise
How Overfitting affects Prediction
Predictive
Error
Error on Training Data
Model Complexity
Predictive
Error
Error on Test Data
Model Complexity
Predictive Underfitting Overfitting

Error
Error on Test Data
Model Complexity
Ideal Range
for Model Complexity
Comparing Two Classifiers
• Say we have 2 classifiers, C1 and C2

• We want to choose the best one to use for future predictions
– e.g., medical diagnosis
– e.g., email filtering
• Can we use Training Accuracy to choose between them?

– No:
• e.g., C1 = perceptron, C2=kNN
• e.g., training accuracy(kNN) = 100%, but it is not necessarily best
• We can choose according to whichever of test_accuracy(C1) or

test_accuracy(C2) is larger
Training and Validation Data
Full Data Set

Idea: train each
Training Data model on the
“training data”
and then test

each model’s
Validation Data accuracy on
the validation data
The v-fold Cross-Validation Method
• Why just choose one particular 90/10 “split” of the data?

– In principle we could do this multiple times
• “v-fold Cross-Validation” (e.g., v=10)

– randomly partition our full data set into v disjoint subsets (each
roughly of size n/v, n = total number of training data points)
• for i = 1:10 (here v = 10)
– train on 90% of data,
– Acc(i) = accuracy on other 10%
• end
• Cross-Validation-Accuracy = 1/v  i Acc(i)
– choose the method with the highest cross-validation accuracy
– common values for v are 5 and 10
– Can also do “leave-one-out” where v = n
Disjoint Validation Data Sets
Full Data Set
Validation Data
Training Data
1st partition
Disjoint Validation Data Sets
Full Data Set
Validation Data
Validation
Data
Training Data
1st partition 2nd partition
More on Cross-Validation
• Notes
– cross-validation generates an approximate estimate of how well
the classifier will do on “unseen” data
– by averaging over different partitions it is more robust than just a

single train/validate partition of the data
– “v-fold” cross-validation is a generalization

• partition data into disjoint validation subsets of size n/v
• train, validate, and average over the v partitions
• e.g., v=10 is commonly used
– v-fold cross-validation is approximately v times computationally

more expensive than just fitting a model to all of the data
Sample MATLAB code for Cross-Validation
% first randomly order the data (n = number of data points)
rand('state',rseed);
index = randperm(n);
data = ordereddata(index,:);
labels = orderedlabels(index);
Sample MATLAB code for Cross-Validation
% now perform v-fold cross-validation
olddata = data;
oldlabels = labels;
nvalidate = floor(n/v);
for i=1:v
% set testdata and testlabels to be the first nvalidate rows of olddata,oldlabels
…..
% set traindata and trainlabels to be the rest of rows of olddata,oldlabels
…...
% call classifiers with traindata, trainlabels, testdata, testlabels
cvaccuracy(i) = classifier(…..)
olddata = [traindata; testdata];

oldlabels = [trainlabels; testlabels];
end
overall_cvaccuacy = mean(cvaccuracy);
Assignment 4: Cross-Validation code (provided)
function [cvacc, trainacc] = test_classifiers(data1,data2,kvalues,v,rseed)

% function [cvacc, trainacc] = test_classifiers(data1,data2,kvalues,v,rseed)
%
% cross-validation results with minimum distance and
% knn classifiers
%
% INPUTS:
% data1: n1 x d feature data for class 1
% data2: n2 x d feature data for class 2
% kvalues: row vector of values of k for knn
% v: for "v-fold" cross-validation
% rseed: random seed setting before permuting the data order
%
% OUTPUT:
% cvacc: accuracy estimated using cross-validation
% trainacc: accuracy on the training data
% (accuracy expressed as a percentage, between 0 and 100%)
Example of running cross-validation code
>> test_classifiers(d1,d2,1,5,1234)
Training Data Results:
Minimum distance accuracy = 87.50
KNN, k=1, accuracy = 100.00
Cross Validation Results (v=5):

If we change to k=3 nearest-neighbors, the results are as follows:

>> test_classifiers(d1,d2,3,5,1234)
Training Data Results:
Cross Validation Results (v=5):

Assignment 4: Classifying images
function [cvacc, trainacc]
= test_imageclassifiers(imageset1,imageset2,plotflag,kvalues,v,rseed)
%
% Learns a classifier to classify images in imageset1
% from images in imageset2, using minimum distance and knn classifiers,
% and returns the training and cross-validation accuracies.
%
% Your name, CS 175A
%
% INPUTS:
% imageset1, imageset2: arrays (of size m x n, and m2 x n2)
% of structures, where imageset1(i,j).image is a matrix of
% pixel (image) values of size nx by ny. It is assumed
% that all images are of the same size in both imageset1
% and imageset2.
% plotflag: if plotflag=1, plot the mean image for each set,
% and plot the difference of the means of the images in the two sets.
% kvalues: an K x 1 vector of k values for the knn classifier
% v: number of "folds" for v-fold cross-validation
The Minimum-Distance Classifier
• A very simple classifier
• Assume we have data from M classes (e.g., M=2)
• Calculate the mean for each class, e.g., Mean1 and Mean2
– mean vector = sum of all vectors/number of vectors
– mean vector ~ “centroid” of points
• Classify each new point x as follows

– for j = 1: M
• calculate the distance dj = Euclidean distance(x, Meanj)
– distance from x to the jth “class mean”
– choose the minimum distance as the predicted class
– assign x to the closest mean
Assignment 4: Minimum Distance Classifier (provided)
function acc = minimum_distance(traindata,trainlabels,testdata,testlabels)

%
% implementation of a minimum distance classifier
%
% INPUTS:
% traindata: N1 x d matrix of feature data
% trainlabels: N1 x 1 column vector of classlabels
% testdata: N2 x d matrix of feature data
% trainlabels: N2 x 1 column vector of classlabels
%
% OUTPUTS:
% acc: accuracy (percentage) on the test data for a classifier
% trained on the training data
Summary
• Assignment 2
– Perceptron code
– Can use perceptrons (or any classifier) to classify images
• Assignment 4
– Nearest-neighbor with images
– Cross-validation
– Minimum distance classifier
– Due Tuesday at 9:30am

Slides8 Misc Topics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides8 Misc Topics

Uploaded by

Copyright:

Available Formats

Discussion of Assignments 3 and 4

function [thresholded_outputs] = perceptron(weights,data)

% calculate the thresholded outputs of the perceptron (vectorized)

% compare thresholded output to the target values to get the accuracy

– What is the direction of steepest descent?

– What is the gradient?

• Algorithm converges to either

– Local minimum if g(w) has multiple local minima

Mathematically, the Gradient Descent Rule:

Mathematically, the Gradient Descent Rule:

This whole part is the

% initialize the input arguments

% cycle through all of the examples

% calculate the errors using current parameter values

% visualize the decision boundary if needed

Download MATLAB demo code (Zip file) from Web page

function [list] = knndispset(imageset,i,j,k,plotflag)

• Memorization versus Generalization

• The accuracy of our classifier on new unseen data is a

• Why is training accuracy not good enough?

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

Error on Training Data

Error on Test Data

Error on Training Data

Predictive Underfitting Overfitting

Error on Test Data

Error on Training Data

• Say we have 2 classifiers, C1 and C2

• Can we use Training Accuracy to choose between them?

• We can choose according to whichever of test_accuracy(C1) or

Full Data Set

and then test

• Why just choose one particular 90/10 “split” of the data?

• “v-fold Cross-Validation” (e.g., v=10)

Full Data Set

Full Data Set

1st partition 2nd partition

– by averaging over different partitions it is more robust than just a

– “v-fold” cross-validation is a generalization

– v-fold cross-validation is approximately v times computationally

olddata = [traindata; testdata];

function [cvacc, trainacc] = test_classifiers(data1,data2,kvalues,v,rseed)

Cross Validation Results (v=5):

If we change to k=3 nearest-neighbors, the results are as follows:

Cross Validation Results (v=5):

• A very simple classifier

• Assume we have data from M classes (e.g., M=2)

• Classify each new point x as follows

function acc = minimum_distance(traindata,trainlabels,testdata,testlabels)

– Due Tuesday at 9:30am

You might also like