You are on page 1of 50

Discussion of Assignments 3 and 4

and
Cross-Validation Methods

Padhraic Smyth
Information and Computer Science
CS 175, Fall 2007
Review of Assignment 3 (Perceptron)
perceptron.m function

function [thresholded_outputs] = perceptron(weights,data)


% function [thresholded_outputs] = perceptron(weights,data)
%
% Compute the class predictions for perceptron (linear classifier)
% Sample code for CS 175
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% data: N x (d+1) matrix of training data
%
% Outputs
% outputs: N x 1 vector of perceptron outputs

% error checking
if size(weights,1) ~= 1
error('The first argument (weights) should be a row vector');
end

if size(data,2) ~= size(weights,2)
error('The arguments (weights and data) should have the same number of
columns');
end

% calculate the thresholded outputs of the perceptron (vectorized)


thresholded_outputs = sign(data*weights');

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 3
perceptron_error.m function
function [cerror, mse] = perceptron_error(weights,data,targets)
% function [cerror, mse] = perceptron_error(weights,data,targets)
%
% Compute mis-classification error and mean squared error for
% a perceptron (linear) classifier
% Sample code for CS 175
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% data: N x (d+1) matrix of training data
% targets: N x 1 vector of target values (+1 or -1)
%
% Outputs
% cerror: the percentage of examples misclassified (between 0 and 100)
% mse: the mean-square error (sum of squared errors divided by N)

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 4
perceptron_error.m function
N = size(data, 1);

% error checking
if nargin ~= 3
error('The function takes three arguments (weights, data, targets)');
end

if size(weights,1) ~= 1
error('The first argument (weights) should be a row vector');
end

if size(data,2) ~= size(weights,2)
error('The first two arguments (weights and data) should have the same number of
columns');
end

if size(data,1) ~= size(targets,1)
error('The last two arguments (targets and data) should have the same number of rows');
end

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 5
perceptron_error.m function
% calculate the unthresholded outputs, for all rows in data, N x 1 vector
f = (weights * data‘) ‘;

% compare thresholded output to the target values to get the accuracy


cerror = 100 * sum(sign(f) ~= targets)/N;

% calculate the sigmoid version of the outputs, for all rows in data, N x 1 vector
outputs = sigmoid(f);

% compare sigmoid output vector to the target vector to get the mse
mse = sum((outputs-targets).^2)/N;

function s = sigmoid(x)
% Computes sigmoid function (scaled to -1, +1)
s = 2./(1+exp(-x)) - 1;

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 6
perceptron_error.m function
% calculate the unthresholded outputs, for all rows in data, N x 1 vector
f = (weights * data‘) ‘;
Vectorized computation of
classification error rate
% compare thresholded output to the target values to get the accuracy
cerror = 100 * sum(sign(f) ~= targets)/N;

Vectorized computation of
% calculate the sigmoid version of the outputs,
sigmoid for all rows in data, N x 1 vector
output
outputs = sigmoid(f);

Vectorized computation of
% compare sigmoid output vector to the target vector to get the mse
MSE
mse = sum((outputs-targets).^2)/N;

function s = sigmoid(x)
% Computes sigmoid function (scaled to -1, +1)
s = 2./(1+exp(-x)) - 1;
Local function defining the
sigmoid. Note that it works
on vectors

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 7
Principle of Gradient Descent
Gradient descent algorithm:
– Start with some initial guess at w
– Move downhill in “small steps” direction of steepest descent

– What is the direction of steepest descent?


• The negative of the gradient, evaluated at w

– What is the gradient?


• Gradient = vector of derivatives with respect to each component of w
• E.g., if w = [ w1, w2, w3] then
gradient[g(w)] = [ d g(w)/ dw1, d g(w)/dw2, d g(w)/dw3 ]
• Note that the gradient is itself a vector (or a “direction)

– After moving, recompute the gradient, get a new downhill direction, and move
again.

– Keep repeating this until the decrease in g(w) is less than some threshold, i.e.,
we appear to be on a flat part of the g(w) surface.

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 8
Illustration of Gradient Descent

g(w)

w1

w2
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 9
Illustration of Gradient Descent

g(w)

w1

w2
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 10
Illustration of Gradient Descent

g(w)

w1

Direction of steepest
descent = direction of
negative gradient

w2
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 11
Illustration of Gradient Descent

g(w)

w1

Original point in
weight space

New point in
weight space
w2
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 12
Gradient Descent Algorithm

• Algorithm converges to either


– Global minimum if g(w) is convex (has a single minimum)
• this is the case for the perceptron

– Local minimum if g(w) has multiple local minima


• This is the case for multilayer neural networks
• To avoid local minima, in practice we rerun the gradient
descent algorithm from multiple random starting points
pick the solution with the lowest MSE.
• Note that the backpropagation algorithm is based on
gradient descent (using a clever way to calculate the
gradient)

– Note that the algorithm need not converge at all if the learning
rate (i.e., step size) is too large

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 13
Gradient Descent Algorithm

Mathematically, the Gradient Descent Rule:


w new = w old -  (w)
where
 (w) is the gradient and
is the learning rate (small, positive)

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 14
Gradient Descent Algorithm

Mathematically, the Gradient Descent Rule:


w new = w old -  (w)
where
 (w) is the gradient and
is the learning rate (small, positive)

In MATLAB, for the perceptron with sigmoid outputs this translates into the
following update rule:
weights = weights - rate * (o - targets(i)) * dsigmoid(o) * data(i, :);

This whole part is the


gradient, evaluated at
the current weight
vector

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 15
learn_perceptron.m function
function [weights,mse,acc] =
learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
% function [weights,mse,acc] =
learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
%
% Learn the weights for a perceptron (linear) classifier to minimize its
% mean squared error.
% Sample code for CS 175
%
% Inputs
% data: N x (d+1) matrix of training data
% targets: N x 1 vector of target values (+1 or -1)
% rate: learning rate for the perceptron algorithm (e.g., rate = 0.001)
% threshold: if the reduction in MSE from one iteration to the next is *less*
% than threshold, then halt learning (e.g., threshold = 0.000001)
% init_method: method used to initialize the weights (1 = random, 2 = half
% way between 2 random points in each group, 3 = half way between
% the centroids in each group)
% random_seed: this is an integer used to "seed" the random number generator
% for either methods 1 or 2 for initialization (this is useful
% to be able to recreate a particular run exactly)
% plotflag: 1 means plotting is turned on, default value is 0
% k: how many iterations between plotting (e.g., k = 100)
%
% Outputs
% weights: 1 x (d+1) row vector of learned weights
% mse: mean squared error for learned weights
% acc: classification accuracy for learned weights (percentage, between 0 and 100)

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 16
learn_perceptron.m function
[N, d] = size(data);

% error checking
if nargin < 4
error('The function takes at least 4 arguments (data, targets, rate, threshold)');
end

if size(data,1) ~= size(targets,1)
error('The number of rows in the first two arguments (data, targets) does not match!');
end

% initialize the input arguments


if ~exist('k')
k = 100;
end
if ~exist('plotflag')
plotflag = 0;
end
if ~exist('random_seed')
random_seed = 1234;
end
if ~exist('init_method')
init_method = 1;
end

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 17
learn_perceptron.m function
% initialize the weights
weights = initialize_weights175(data,targets,init_method,random_seed);

iteration=0;
while iteration < 2 | ( abs(mse(iteration) - mse(iteration-1)) > threshold )

iteration = iteration + 1;

% cycle through all of the examples


for i=1:N
% calculate the unthresholded output for the ith row of "data"
o = sigmoid( weights * data(i,:)' );
% update the weight vector
weights = weights + rate * (targets(i) - o) * dsigmoid(o) * data(i, :);
end

% calculate the errors using current parameter values


[cerror(iteration), mse(iteration)] = perceptron_error(weights, data, targets);

% visualize the decision boundary if needed


if plotflag == 1 & mod(iteration, k) == 0
t = strcat ('Decision boundary at iteration # ', num2str(iteration));
weightplot175(data, targets, weights, t);

pause(0.0001);
end
end

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 18
learn_perceptron.m function
% create the plots of the MSE and Accuracy Vs. iteration number
if (plotflag == 1)
figure(2);
subplot(2, 1, 1);
plot(mse,'b-');
xlabel('iteration');
ylabel('MSE');

subplot(2, 1, 2);
plot(100-cerror,'b-');
xlabel('iteration');
ylabel('Accuracy');
end

% local functions…..
function s = sigmoid(x)
% Compute the sigmoid function, scaled from -1 to +1
s = 2./(1+exp(-x)) - 1;

function ds = dsigmoid(x)
% Compute the derivative of the (rescaled) sigmoid
ds = .5 .* (sigmoid(x)+1) .* (1-sigmoid(x));

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 19
MATLAB Demonstration

Download MATLAB demo code (Zip file) from Web page

Run demo_perceptron_image_classification.m

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 20
Additional Concepts in Classification
(Relevant to Assignment 4)

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 21
Assignment 4

• threshold_image.m
– Simple function to display thresholded images

• knn_dispset.m
– Finds and displays the k-nearest-neighbors for a given image

• test_classifiers.m
– Uses cross-validation to compare classifiers
– (code is provided)

• test_imageclassifiers.m
– Compare different classification methods on image data
– Uses cross-validation

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 22
Assignment 4: using kNN to find similar images

function  [list] = knndispset(imageset,i,j,k,plotflag)


  % function  [list] = knndispset(imageset,i,j,k, plotflag)
 %
  %  a brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
 %
  %  Inputs
  %     imageset:  an array structure of images (CS 175 format)
  %     i, j:  integers specifying that imageset(i,j).image is the query image
  %     k: number of neighbors to find
  %    plotflag: display the k nearest neighbors if plotflag = 1;
 %
  %  Outputs
  %    list: a k x 2 matrix, where the first row contains the indices from
imageset of the nearest neighbor, the second row contains
the indices of the 2nd nearest neighbor, and so forth.

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 23
MATLAB demo of knndispset

• knndispset(i2straight,5,1,15,1);

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 24
MATLAB demo of knndispset

• knndispset(i2straight,18,1,15,1);

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 25
Training Data and Test Data

• Training data
– labeled data used to build a classifier
• Test data
– new data, not used in the training process, to evaluate how well a
classifier does on new data

• Memorization versus Generalization


– better training_accuracy
• “memorizing” the training data:
– better test_accuracy
• “generalizing” to new data
– in general, we would like our classifier to perform well on new test
data, not just on training data,
• i.e., we would like it to generalize well to new data
• Test accuracy is more important than training accuracy

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 26
Test Accuracy and Generalization

• The accuracy of our classifier on new unseen data is a


fair/honest assessment of the performance of our classifier

• Why is training accuracy not good enough?


– Training accuracy is optimistic
– a classifier like nearest-neighbor can construct boundaries which
always separate all training data points, but which do not separate
new points
• e.g., what is the training accuracy of kNN, k = 1?
– A flexible classifier can “overfit” the training data
• in effect it just memorizes the training data, but does not learn
the general relationship between x and C

• Generalization
– We are really interested in how our classifier generalizes to new
data
– test data accuracy is a good estimate of generalization performance

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 27
Another Example

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE


6
Decision
Region 1 Decision
5 Region 2

3
Feature 2

0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 28
A More Complex Decision Boundary

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE


6
Decision
Region 1 Decision
5 Region 2

4
Feature 2

0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 29
Example: The Overfitting Phenomenon

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 30
A Complex Model

Y = high-order polynomial in X

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 31
The True (simpler) Model

Y = a X + b + noise

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 32
How Overfitting affects Prediction

Predictive
Error

Error on Training Data

Model Complexity

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 33
How Overfitting affects Prediction

Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 34
How Overfitting affects Prediction

Predictive Underfitting Overfitting


Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 35
Comparing Two Classifiers

• Say we have 2 classifiers, C1 and C2


• We want to choose the best one to use for future predictions
– e.g., medical diagnosis
– e.g., email filtering

• Can we use Training Accuracy to choose between them?


– No:
• e.g., C1 = perceptron, C2=kNN
• e.g., training accuracy(kNN) = 100%, but it is not necessarily best

• We can choose according to whichever of test_accuracy(C1) or


test_accuracy(C2) is larger

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 36
Training and Validation Data

Full Data Set


Idea: train each
Training Data model on the
“training data”

and then test


each model’s
Validation Data accuracy on
the validation data

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 37
The v-fold Cross-Validation Method

• Why just choose one particular 90/10 “split” of the data?


– In principle we could do this multiple times

• “v-fold Cross-Validation” (e.g., v=10)


– randomly partition our full data set into v disjoint subsets (each
roughly of size n/v, n = total number of training data points)
• for i = 1:10 (here v = 10)
– train on 90% of data,
– Acc(i) = accuracy on other 10%
• end
• Cross-Validation-Accuracy = 1/v  i Acc(i)
– choose the method with the highest cross-validation accuracy
– common values for v are 5 and 10
– Can also do “leave-one-out” where v = n

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 38
Disjoint Validation Data Sets

Full Data Set

Validation Data

Training Data

1st partition

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 39
Disjoint Validation Data Sets

Full Data Set

Validation Data
Validation
Data
Training Data

1st partition 2nd partition

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 40
More on Cross-Validation

• Notes
– cross-validation generates an approximate estimate of how well
the classifier will do on “unseen” data

– by averaging over different partitions it is more robust than just a


single train/validate partition of the data

– “v-fold” cross-validation is a generalization


• partition data into disjoint validation subsets of size n/v
• train, validate, and average over the v partitions
• e.g., v=10 is commonly used

– v-fold cross-validation is approximately v times computationally


more expensive than just fitting a model to all of the data

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 41
Sample MATLAB code for Cross-Validation
% first randomly order the data (n = number of data points)
rand('state',rseed);
index = randperm(n);
data = ordereddata(index,:);
labels = orderedlabels(index);

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 42
Sample MATLAB code for Cross-Validation
% now perform v-fold cross-validation
olddata = data;
oldlabels = labels;
nvalidate = floor(n/v);
for i=1:v
% set testdata and testlabels to be the first nvalidate rows of olddata,oldlabels
…..
% set traindata and trainlabels to be the rest of rows of olddata,oldlabels
…...
% call classifiers with traindata, trainlabels, testdata, testlabels
cvaccuracy(i) = classifier(…..)

olddata = [traindata; testdata];


oldlabels = [trainlabels; testlabels];
end
overall_cvaccuacy = mean(cvaccuracy);

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 43
Assignment 4: Cross-Validation code (provided)

function [cvacc, trainacc] = test_classifiers(data1,data2,kvalues,v,rseed)


% function [cvacc, trainacc] = test_classifiers(data1,data2,kvalues,v,rseed)
%
% cross-validation results with minimum distance and
% knn classifiers
%
% INPUTS:
% data1: n1 x d feature data for class 1
% data2: n2 x d feature data for class 2
% kvalues: row vector of values of k for knn
% v: for "v-fold" cross-validation
% rseed: random seed setting before permuting the data order
%
% OUTPUT:
% cvacc: accuracy estimated using cross-validation
% trainacc: accuracy on the training data
% (accuracy expressed as a percentage, between 0 and 100%)

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 44
Example of running cross-validation code

>> test_classifiers(d1,d2,1,5,1234)
Training Data Results:
Minimum distance accuracy = 87.50
KNN, k=1, accuracy = 100.00

Cross Validation Results (v=5):


Minimum distance accuracy = 85.00
KNN, k=1, accuracy = 82.50

If we change to k=3 nearest-neighbors, the results are as follows:


>> test_classifiers(d1,d2,3,5,1234)
Training Data Results:
Minimum distance accuracy = 87.50
KNN, k=3, accuracy = 95.00

Cross Validation Results (v=5):


Minimum distance accuracy = 85.00
KNN, k=3, accuracy = 85.00

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 45
Assignment 4: Classifying images
function [cvacc, trainacc]
= test_imageclassifiers(imageset1,imageset2,plotflag,kvalues,v,rseed)
%
% Learns a classifier to classify images in imageset1
% from images in imageset2, using minimum distance and knn classifiers,
% and returns the training and cross-validation accuracies.
%
%                                     Your name, CS 175A
%
% INPUTS:
%   imageset1, imageset2: arrays (of size m x n, and m2 x n2)
%       of structures, where imageset1(i,j).image is a matrix of
%       pixel (image) values of size nx by ny. It is assumed
%       that all images are of the same size in both imageset1
%       and imageset2.
%   plotflag: if plotflag=1, plot the mean image for each set,
%   and plot the difference of the means of the images in the two sets.
%   kvalues: an K x 1 vector of k values for the knn classifier
%   v: number of "folds" for v-fold cross-validation

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 46
The Minimum-Distance Classifier

• A very simple classifier

• Assume we have data from M classes (e.g., M=2)

• Calculate the mean for each class, e.g., Mean1 and Mean2
– mean vector = sum of all vectors/number of vectors
– mean vector ~ “centroid” of points

• Classify each new point x as follows


– for j = 1: M
• calculate the distance dj = Euclidean distance(x, Meanj)
– distance from x to the jth “class mean”
– choose the minimum distance as the predicted class
– assign x to the closest mean

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 47
CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 48
Assignment 4: Minimum Distance Classifier (provided)

function acc = minimum_distance(traindata,trainlabels,testdata,testlabels)


%
% implementation of a minimum distance classifier
%
% INPUTS:
% traindata: N1 x d matrix of feature data
% trainlabels: N1 x 1 column vector of classlabels
% testdata: N2 x d matrix of feature data
% trainlabels: N2 x 1 column vector of classlabels
%
% OUTPUTS:
% acc: accuracy (percentage) on the test data for a classifier
% trained on the training data

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 49
Summary

• Assignment 2
– Perceptron code
– Can use perceptrons (or any classifier) to classify images

• Assignment 4
– Nearest-neighbor with images
– Cross-validation
– Minimum distance classifier

– Due Tuesday at 9:30am

CS 175, Fall 2007: Professor Padhraic Smyth Slide Set 8: Cross Validation 50

You might also like