You are on page 1of 9

Cambridge Books Online

http://ebooks.cambridge.org/

Statistical Learning for Biomedical Data

James D. Malley, Karen G. Malley, Sinisa Pajevic

Book DOI: http://dx.doi.org/10.1017/CBO9780511975820

Online ISBN: 9780511975820

Hardback ISBN: 9780521875806

Paperback ISBN: 9780521699099

Chapter

Appendix pp. 263-270

Chapter DOI: http://dx.doi.org/10.1017/CBO9780511975820.015

Cambridge University Press


Appendix

A1: Software used in this book


A2: Data used in this book

A1: Software used in this book


Classification and Regression Tree; CART
We used the Matlab functions treefit and treeval for learning and
prediction, respectively. We use Gini’s diversity index as our splitting crite-
rion. But see also Note 1(c) at the end of Chapter 7.

k-Nearest Neighbor; k-NN


k-NN algorithms are relatively simple to implement, but the best are truly fast
implementations. We used several implementations and list two that are
available at Matlab Central: an implementation by Yi Cao (at Cranfield
University on 25 March 2008) called Efficient K-Nearest Neighbor Search
using JIT http://www.mathworks.com/matlabcentral/fileexchange/19345-
efficient-k-nearest-neighbor-search-using-jit and an implementation by
Luigi Giaccari called Fast k-Nearest Neighbors Search http://www.math
works.es/matlabcentral/fileexchange/22190.

Support Vector Machines; SVM


We used the implementation SVMlight that can be found at http://svmlight.
joachims.org/.
A number of other software packages for SVMs can be found at http://
www.support-vector-machines.org/SVM_soft.html.

Fisher Linear Discriminant Analysis; LDA


Matlab code for performing Fisher LDA is provided below.
263

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:43 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
264 Appendix

Logistic Regression
Matlab code for performing logistic regression is provided below.

Neural Networks
Neural Networks is a broad term that includes a number of related imple-
mentations. In this book we used one that optimizes the number of nodes in
the hidden layers. This is derived from work by Broyden–Fletcher–
Goldfarb–Shanno (BFGS) and by DJC MacKay; see MacKay (1992a,b).
The implementation we used was written for Matlab by Sigurdur
Sigurdsson (2002) and is based on an older neural classifier written by
Morten with Pedersen. It is available in the ANN:DTU Toolbox http://isp.
imm.dtu.dk/toolbox/ann/index.html.
As stated on that website, all code can be used freely in research and other
nonprofit applications. If you publish results obtained with the ANN:DTU
Toolbox you are asked to cite the relevant sources.
Multiple neural network packages are available for R (search “neural
network” at http://cran.r-project.org/.
Still other free packages for neural network classification (NuMap and
NuClass, available only for Windows) can be found at http://www-ee.uta.
edu/eeweb/IP/Software/Software.htm.
A convenient place to find a collection of Matlab implementations is
“Matlab Central” http://www.mathworks.com/matlabcentral/.
For example, Neural Network Classifiers written by Sebastien Paris is
available at http://www.mathworks.com/matlabcentral/fileexchange/17415.
A commercial package “Neural Network Toolbox” is also available for
Matlab.

Boosting
We used BoosTexter, available at http://www.cs.princeton.edu/~schapire/
boostexter.html.
For this implementation see Schapire and Singer (2000). As stated on the
home page above, “the object code for BoosTexter is available free from
AT&T for non-commercial research or educational purposes.”

Random Forests; RF
Random Forests (written for R) can be obtained from http://cran.r-project.
org/web/packages/randomForest/index.html.

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:43 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
265 Appendix

SASW programs
Logistic regressions for stroke study analysis were done in SAS version 9.1.3 ®
PROC LOGISTIC.
®
Custom SAS version 8.2 PROC IML code, macro %GOFLOGIT written
by Oliver Kuss (2002), was used for model goodness-of-fit analysis in logistic
regression.

R code for comparison of correlated error estimates


We applied an adjusted Wald confidence interval and the Tango score
interval, using the R code as described in Agresti and Min (2005).

Matlab code
Code for Fisher Linear Discriminant Analysis: fLDA.m

function [ConfMatrix,decisions,prms]=fLDA(LearnSamples, . . .
LearnLabels,TestSamples,TestLabels)
% Usage:
% ConfMatrix,decisions,prms]=fLDA(LearnSamples,LearnLabels,
% TestSamples,TestLabels)
%
% The code expects that the LearnSamples and TestSamples be
% n x m matrices where n is the number of the cases (samples)
% and each row contains the m-predictor values for each case.
% Otherwise, transpose the data, i.e. uncomment the lines below:
% LearnSamples=LearnSamples’;
% TestSamples=TestSamples’;

% separate the data into “positives” and “negatives”


ipos=find(LearnLabels==1);
ineg=find(LearnLabels==0);
predpos=LearnSamples(ipos,:);
predneg=LearnSamples(ineg,:);
nsamp_pos=length(predpos);
nsamp_neg=length(predneg);

% obtain the covariance matrix and the means for “positives” and
% “negatives”
[Spos, meanpos]=getSmat(predpos);
[Sneg, meanneg]=getSmat(predneg);

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:43 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
266 Appendix

% obtain Fisher LDA projection wproj that maximizes J(w)


wproj=inv(Spos+Sneg)*(meanpos-meanneg)’;
wproj=wproj/norm(wproj)

% find appropriate decision threshold


sp=sqrt(trace(Spos)/nsamp_pos);
sm=sqrt(trace(Sneg)/nsamp_neg);
mnsm=(sm*meanpos+sp*meanneg)/(sm+sp);
cthresh=mnsm*wproj;

if nargout>2 % if asking for it, provide the parameters of fLDA


prms={wproj,cthresh,mnsm};
end

decisions=[];
ConfMatrix=[];
if nargin>2 % if testsamples provided
% run the discriminant on the testing data
cpred=TestSamples*wproj;
decisions=(cpred>cthresh)';
if nargin>3 % if testlabels provided
% obtain the confusion matrix (1 indicates that we
% want the raw counts)
ConfMatrix=GetConfTable(decisions,TestLabels,1);
end
end

% Supporting functions
% get a covariance matrix of x
function [Smat,meanx]=getSmat(x)
meanx=mean(x);
zmn=x-repmat(meanx,size(x,1),1);
Smat=zmn'*zmn;

% get a confusion matrix


function [ConfTable,labls]=GetConfTable(FinalDecision,...
TestLabels,counts)
% function [ConfTable,labls]=GetConfTable(FinalDecision,
% TestLabels,counts)
% Returns confusion table based on the machine decisions
% (FinalDecision) and the known outcomes.

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:44 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
267 Appendix

% If the flag counts=0 returns confusion matrix in terms


% of fractions (percentages), otherwise returns confusion
% matrix as raw counts

if nargin<3, counts=0;, end


% labls=fliplr(flipud(unique(TestLabels)));
labls=fliplr(unique(TestLabels));
nlabls=length(labls);
if nlabls == 1 % if all decisions were the same
nlabls=2;
if labls(1), labls=[labls(1),0]; else labls=[1,0]; end
end
ConfTable=zeros(nlabls);
for ilb=1:nlabls
i1=find(TestLabels==labls(ilb));
for jlb=1:nlabls
i2=find(FinalDecision(i1)==labls(jlb));
if counts
ConfTable(jlb,ilb)=length(i2);
else
ConfTable(jlb,ilb)=length(i2)/length(i1);
end
end
end

Code for Logistic Regression: logisticr.m

function lrresult = logisticr(LRSamples, LRinp, fitit, wts)


% function lrresult = logisticr(LRSamples, LRinp, fitit, wts)
% parameter fitit determines if fitting or evaluation is used
% for fitit=1:
% input: LearnSamples as LRSampls and LearnLabels as LRinp
% returns LR model (npredictors+1) parameters in lrresults
% for fitit=0 or fitit=2:
% input: TestingSamples (LRSamples), LR parameters (LRinp)
% returns probabilites (fitit=0) or
% log-odds (fitit=2) in lrresults
%
% Example: (using decision threshold “logisticthreshold”):
% lrprms=mylogistic(LearnSamples, LearnLabels, 1);

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:44 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
268 Appendix

% decisions=mylogistic(TestSamples,lrprms) > logisticthreshold;


% ConfMatrix=GetConfTable(decisions,TestLabels);

if nargin<3 fitit=0; end

switch fitit
case {0,2} % return the LR values
[nsamps, mpred] = size(LRSamples);
inputmat=[LRSamples,ones(nsamps,1)];
liny = inputmat * LRinp;
if fitit==2
lrresult = liny
else
lrresult = invlogit(liny);
end
case 1 % Obtain LR parameters (fitting)
% specify desired precision and maximal number of
% Newton-Ralphson iterations, trade precision (small itereps,
% e.g. 1e-12) for speed (larger itereps, e.g. 1e-7)
itereps = 1e-9;
maxiter = 100;
[nsamps, mpred] = size(LRSamples)
inputmat=[LRSamples,ones(nsamps,1)];
mprms=mpred+1;

% assume all weights equal, in not specified


if nargin < 4, wts = ones(nsamps,1); end

% initialize iterations
lrresult=zeros(mprms,1);
lrnlabels=LRinp';
prevexpy=-ones(size(lrnlabels));
for iter=1:maxiter
liny=inputmat*lrresult;
expy=invlogit(liny);
% LR weights based on derivative of invlogit (=p(1-p))
lrw=max(5*eps, expy.*(1-expy)); % avoiding zero lrw for
liny=liny+(lrnlabels-expy)./lrw; % update with W^(-1)(y-p)
% adjust prescribed weights "wts" with LR weights to
% obtain the final weights matrix
weights=spdiags(lrw.*wts, 0, nsamps, nsamps);

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:44 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
269 Appendix

% can then be obtained as equiv. linear regression


% to modifed liny
lrresult=inv(inputmat'*weights*inputmat) ...
*inputmat'*weights* liny;
if sum(abs(expy-prevexpy)) < nsamps*itereps
break;
end
prevexpy=expy;
end
otherwise
disp(sprintf('The value fitit=%d not implemented yet', fitit))
end

function logodds=logit(p)
logodds=log(p./(1-p));

function p=invlogit(lodds)
p=1./(1+exp(-lodds));

A2: Data used in this book


We principally used three datasets: (1) the German stroke dataset, (2) the
lupus dataset, and (3) the simulated cholesterol data. These are all dis-
cussed in Chapter 4. Unfortunately, neither the German nor the lupus
data are available for public use. As a nice alternative we suggest accessing
the thoroughly edited and maintained data collection at the University
of California (Irvine) machine learning website: http://archive.ics.uci.
edu/ml/.
As of Spring 2010, there are nearly 200 datasets available, providing
ground material for the study of nearly every aspect of machine learning,
and derived from an astonishing range of subjects.
One caution, however, as we discussed in Section 2.10: that is, the bench-
marking problem raised by using data to declare single winners. This is
indeed a problem since for any given dataset on which a given scheme does
best among a set of schemes, there is a counter dataset and an alternative
scheme such that the alternative scheme can be selected to be Bayes

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:44 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
270 Appendix

consistent, and do better than the original “winning” scheme at every sample
size.
Moreover, as discussed in Chapter 12, for any finite collection of machines
there is an ensemble machine that does at least as well as the best in that
collection. Which is to say that declaring a single winner in a machine arms
race is a misdirected use of computing resources and brain power.

Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:44 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016

You might also like