Appendix

Cambridge Books Online
http://ebooks.cambridge.org/
Statistical Learning for Biomedical Data
James D. Malley, Karen G. Malley, Sinisa Pajevic
Book DOI: http://dx.doi.org/10.1017/CBO9780511975820
Online ISBN: 9780511975820
Hardback ISBN: 9780521875806
Paperback ISBN: 9780521699099
Chapter
Appendix pp. 263-270
Chapter DOI: http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge University Press

Appendix
A1: Software used in this book

A2: Data used in this book
A1: Software used in this book

Classification and Regression Tree; CART
We used the Matlab functions treefit and treeval for learning and
prediction, respectively. We use Gini’s diversity index as our splitting crite-
rion. But see also Note 1(c) at the end of Chapter 7.
k-Nearest Neighbor; k-NN

k-NN algorithms are relatively simple to implement, but the best are truly fast
implementations. We used several implementations and list two that are
available at Matlab Central: an implementation by Yi Cao (at Cranfield
University on 25 March 2008) called Efficient K-Nearest Neighbor Search
using JIT http://www.mathworks.com/matlabcentral/fileexchange/19345-
efficient-k-nearest-neighbor-search-using-jit and an implementation by
Luigi Giaccari called Fast k-Nearest Neighbors Search http://www.math
works.es/matlabcentral/fileexchange/22190.
Support Vector Machines; SVM

We used the implementation SVMlight that can be found at http://svmlight.
joachims.org/.
A number of other software packages for SVMs can be found at http://
www.support-vector-machines.org/SVM_soft.html.
Fisher Linear Discriminant Analysis; LDA

Matlab code for performing Fisher LDA is provided below.
263
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:19:43 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.015
Cambridge Books Online © Cambridge University Press, 2016
264 Appendix
Logistic Regression
Matlab code for performing logistic regression is provided below.
Neural Networks
Neural Networks is a broad term that includes a number of related imple-
mentations. In this book we used one that optimizes the number of nodes in
the hidden layers. This is derived from work by Broyden–Fletcher–
Goldfarb–Shanno (BFGS) and by DJC MacKay; see MacKay (1992a,b).
The implementation we used was written for Matlab by Sigurdur
Sigurdsson (2002) and is based on an older neural classifier written by
Morten with Pedersen. It is available in the ANN:DTU Toolbox http://isp.
imm.dtu.dk/toolbox/ann/index.html.
As stated on that website, all code can be used freely in research and other
nonprofit applications. If you publish results obtained with the ANN:DTU
Toolbox you are asked to cite the relevant sources.
Multiple neural network packages are available for R (search “neural
network” at http://cran.r-project.org/.
Still other free packages for neural network classification (NuMap and
NuClass, available only for Windows) can be found at http://www-ee.uta.
edu/eeweb/IP/Software/Software.htm.
A convenient place to find a collection of Matlab implementations is
“Matlab Central” http://www.mathworks.com/matlabcentral/.
For example, Neural Network Classifiers written by Sebastien Paris is
available at http://www.mathworks.com/matlabcentral/fileexchange/17415.
A commercial package “Neural Network Toolbox” is also available for
Matlab.
Boosting
We used BoosTexter, available at http://www.cs.princeton.edu/~schapire/
boostexter.html.
For this implementation see Schapire and Singer (2000). As stated on the
home page above, “the object code for BoosTexter is available free from
AT&T for non-commercial research or educational purposes.”
Random Forests; RF
Random Forests (written for R) can be obtained from http://cran.r-project.
org/web/packages/randomForest/index.html.
265 Appendix
SASW programs
Logistic regressions for stroke study analysis were done in SAS version 9.1.3 ®
PROC LOGISTIC.
®
Custom SAS version 8.2 PROC IML code, macro %GOFLOGIT written
by Oliver Kuss (2002), was used for model goodness-of-fit analysis in logistic
regression.
R code for comparison of correlated error estimates

We applied an adjusted Wald confidence interval and the Tango score
interval, using the R code as described in Agresti and Min (2005).
Matlab code
Code for Fisher Linear Discriminant Analysis: fLDA.m
function [ConfMatrix,decisions,prms]=fLDA(LearnSamples, . . .
LearnLabels,TestSamples,TestLabels)
% Usage:
% ConfMatrix,decisions,prms]=fLDA(LearnSamples,LearnLabels,
% TestSamples,TestLabels)
%
% The code expects that the LearnSamples and TestSamples be
% n x m matrices where n is the number of the cases (samples)
% and each row contains the m-predictor values for each case.
% Otherwise, transpose the data, i.e. uncomment the lines below:
% LearnSamples=LearnSamples’;
% TestSamples=TestSamples’;
% separate the data into “positives” and “negatives”

ipos=find(LearnLabels==1);
ineg=find(LearnLabels==0);
predpos=LearnSamples(ipos,:);
predneg=LearnSamples(ineg,:);
nsamp_pos=length(predpos);
nsamp_neg=length(predneg);
% obtain the covariance matrix and the means for “positives” and
% “negatives”
[Spos, meanpos]=getSmat(predpos);
[Sneg, meanneg]=getSmat(predneg);
266 Appendix
% obtain Fisher LDA projection wproj that maximizes J(w)

wproj=inv(Spos+Sneg)*(meanpos-meanneg)’;
wproj=wproj/norm(wproj)
% find appropriate decision threshold

sp=sqrt(trace(Spos)/nsamp_pos);
sm=sqrt(trace(Sneg)/nsamp_neg);
mnsm=(sm*meanpos+sp*meanneg)/(sm+sp);
cthresh=mnsm*wproj;
if nargout>2 % if asking for it, provide the parameters of fLDA

prms={wproj,cthresh,mnsm};
end
decisions=[];
ConfMatrix=[];
if nargin>2 % if testsamples provided
% run the discriminant on the testing data
cpred=TestSamples*wproj;
decisions=(cpred>cthresh)';
if nargin>3 % if testlabels provided
% obtain the confusion matrix (1 indicates that we
% want the raw counts)
ConfMatrix=GetConfTable(decisions,TestLabels,1);
end
end
% Supporting functions
% get a covariance matrix of x
function [Smat,meanx]=getSmat(x)
meanx=mean(x);
zmn=x-repmat(meanx,size(x,1),1);
Smat=zmn'*zmn;
% get a confusion matrix

function [ConfTable,labls]=GetConfTable(FinalDecision,...
TestLabels,counts)
% function [ConfTable,labls]=GetConfTable(FinalDecision,
% TestLabels,counts)
% Returns confusion table based on the machine decisions
% (FinalDecision) and the known outcomes.
267 Appendix
% If the flag counts=0 returns confusion matrix in terms

% of fractions (percentages), otherwise returns confusion
% matrix as raw counts
if nargin<3, counts=0;, end

% labls=fliplr(flipud(unique(TestLabels)));
labls=fliplr(unique(TestLabels));
nlabls=length(labls);
if nlabls == 1 % if all decisions were the same
nlabls=2;
if labls(1), labls=[labls(1),0]; else labls=[1,0]; end
end
ConfTable=zeros(nlabls);
for ilb=1:nlabls
i1=find(TestLabels==labls(ilb));
for jlb=1:nlabls
i2=find(FinalDecision(i1)==labls(jlb));
if counts
ConfTable(jlb,ilb)=length(i2);
else
ConfTable(jlb,ilb)=length(i2)/length(i1);
end
end
end
Code for Logistic Regression: logisticr.m
function lrresult = logisticr(LRSamples, LRinp, fitit, wts)

% function lrresult = logisticr(LRSamples, LRinp, fitit, wts)
% parameter fitit determines if fitting or evaluation is used
% for fitit=1:
% input: LearnSamples as LRSampls and LearnLabels as LRinp
% returns LR model (npredictors+1) parameters in lrresults
% for fitit=0 or fitit=2:
% input: TestingSamples (LRSamples), LR parameters (LRinp)
% returns probabilites (fitit=0) or
% log-odds (fitit=2) in lrresults
%
% Example: (using decision threshold “logisticthreshold”):
% lrprms=mylogistic(LearnSamples, LearnLabels, 1);
268 Appendix
% decisions=mylogistic(TestSamples,lrprms) > logisticthreshold;

% ConfMatrix=GetConfTable(decisions,TestLabels);
if nargin<3 fitit=0; end
switch fitit
case {0,2} % return the LR values
[nsamps, mpred] = size(LRSamples);
inputmat=[LRSamples,ones(nsamps,1)];
liny = inputmat * LRinp;
if fitit==2
lrresult = liny
else
lrresult = invlogit(liny);
end
case 1 % Obtain LR parameters (fitting)
% specify desired precision and maximal number of
% Newton-Ralphson iterations, trade precision (small itereps,
% e.g. 1e-12) for speed (larger itereps, e.g. 1e-7)
itereps = 1e-9;
maxiter = 100;
[nsamps, mpred] = size(LRSamples)
inputmat=[LRSamples,ones(nsamps,1)];
mprms=mpred+1;
% assume all weights equal, in not specified

if nargin < 4, wts = ones(nsamps,1); end
% initialize iterations
lrresult=zeros(mprms,1);
lrnlabels=LRinp';
prevexpy=-ones(size(lrnlabels));
for iter=1:maxiter
liny=inputmat*lrresult;
expy=invlogit(liny);
% LR weights based on derivative of invlogit (=p(1-p))
lrw=max(5*eps, expy.*(1-expy)); % avoiding zero lrw for
liny=liny+(lrnlabels-expy)./lrw; % update with W^(-1)(y-p)
% adjust prescribed weights "wts" with LR weights to
% obtain the final weights matrix
weights=spdiags(lrw.*wts, 0, nsamps, nsamps);
269 Appendix
% can then be obtained as equiv. linear regression

% to modifed liny
lrresult=inv(inputmat'*weights*inputmat) ...
*inputmat'*weights* liny;
if sum(abs(expy-prevexpy)) < nsamps*itereps
break;
end
prevexpy=expy;
end
otherwise
disp(sprintf('The value fitit=%d not implemented yet', fitit))
end
function logodds=logit(p)
logodds=log(p./(1-p));
function p=invlogit(lodds)
p=1./(1+exp(-lodds));
A2: Data used in this book

We principally used three datasets: (1) the German stroke dataset, (2) the
lupus dataset, and (3) the simulated cholesterol data. These are all dis-
cussed in Chapter 4. Unfortunately, neither the German nor the lupus
data are available for public use. As a nice alternative we suggest accessing
the thoroughly edited and maintained data collection at the University
of California (Irvine) machine learning website: http://archive.ics.uci.
edu/ml/.
As of Spring 2010, there are nearly 200 datasets available, providing
ground material for the study of nearly every aspect of machine learning,
and derived from an astonishing range of subjects.
One caution, however, as we discussed in Section 2.10: that is, the bench-
marking problem raised by using data to declare single winners. This is
indeed a problem since for any given dataset on which a given scheme does
best among a set of schemes, there is a counter dataset and an alternative
scheme such that the alternative scheme can be selected to be Bayes
270 Appendix
consistent, and do better than the original “winning” scheme at every sample
size.
Moreover, as discussed in Chapter 12, for any finite collection of machines
there is an ensemble machine that does at least as well as the best in that
collection. Which is to say that declaring a single winner in a machine arms
race is a misdirected use of computing resources and brain power.

Appendix

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appendix

Uploaded by

Copyright:

Available Formats

Cambridge Books Online

Statistical Learning for Biomedical Data

James D. Malley, Karen G. Malley, Sinisa Pajevic

Book DOI: http://dx.doi.org/10.1017/CBO9780511975820

Online ISBN: 9780511975820

Hardback ISBN: 9780521875806

Paperback ISBN: 9780521699099

Appendix pp. 263-270

Chapter DOI: http://dx.doi.org/10.1017/CBO9780511975820.015

Cambridge University Press

A1: Software used in this book

A1: Software used in this book

k-Nearest Neighbor; k-NN

Support Vector Machines; SVM

Fisher Linear Discriminant Analysis; LDA

R code for comparison of correlated error estimates

% separate the data into “positives” and “negatives”

% obtain Fisher LDA projection wproj that maximizes J(w)

% ﬁnd appropriate decision threshold

if nargout>2 % if asking for it, provide the parameters of fLDA

% get a confusion matrix

% If the ﬂag counts=0 returns confusion matrix in terms

if nargin<3, counts=0;, end

Code for Logistic Regression: logisticr.m

function lrresult = logisticr(LRSamples, LRinp, ﬁtit, wts)

% decisions=mylogistic(TestSamples,lrprms) > logisticthreshold;

if nargin<3 ﬁtit=0; end

% assume all weights equal, in not speciﬁed

% can then be obtained as equiv. linear regression

A2: Data used in this book

You might also like