You are on page 1of 29

Module7

Machine Learning in Practice


References:
1. Ethem Alpaydin, "Introduction to Machine
Learning”, MIT Press, Prentice Hall of India
Introduction
• Questions:
• Assessment of the expected error of a learning algorithm: Is the error rate of 1-
NN less than 2%?
• Comparing the expected errors of two algorithms: Is k-NN more accurate than
MLP ?
• Training/validation/test sets
• Resampling methods: K-fold cross-validation
Algorithm Preference
• Criteria (Application-dependent):
• Misclassification error, or risk (loss functions)
• Training time/space complexity
• Testing time/space complexity
• Interpretability
• Easy programmability
• Cost-sensitive learning
Factors and Response
Strategies of Experimentation

Response surface design for approximating and maximizing


the response function in terms of the controllable factors
Guidelines for ML experiments
A. Aim of the study
B. Selection of the response variable
C. Choice of factors and levels
D. Choice of experimental design
E. Performing the experiment
F. Statistical Analysis of the Data
G. Conclusions and Recommendations
Resampling and
K-Fold Cross-Validation
• The need for multiple training/validation sets
{Xi,Vi}i: Training/validation sets of fold i
• K-fold cross-validation: Divide X into k, Xi,i=1,...,K
V1  X1 T 1  X 2  X 3   X K
V 2  X 2 T 2  X1  X 3   X K

V K  X K T K  X 1  X 2    X K 1

• Ti share K-2 parts


5×2 Cross-Validation
• 5 times 2 fold cross-validation (Dietterich, 1998)
T 1  X 11  V 1  X 12 
2  1 
T 2  X1 V 2  X1
T 3  X 21  V 3  X 22 
T 4  X 22  V 4  X 21 

T 9  X 51  V 9  X 52 
T 10  X 52  V 10  X 51 
Bootstrapping
• Draw instances from a dataset with replacement
• Prob that we do not pick an instance after N draws
N
 1 1
1    e  0.368
 N

that is, only 36.8% is new!


Measuring Error

• Error rate = # of errors / # of instances = (FN+FP) / N


• Recall = # of found positives / # of positives
= TP / (TP+FN) = sensitivity = hit rate
• Precision = # of found positives / # of found
= TP / (TP+FP)
• Specificity = TN / (TN+FP)
• False alarm rate = FP / (FP+TN) = 1 - Specificity
ROC Curve
Precision and Recall
Interval Estimation
• X = { xt }t where xt ~ N ( μ, σ2)
• m ~ N ( μ, σ2/N)
m   
N ~Z

 m    
P  1.96  N  1.96  0.95
  
   
P m  1.96    m  1.96   0.95
 N N
   
P m  z / 2    m  z / 2  1
 N N 100(1- α) percent
confidence
interval
 m    
P N  1.64  0.95
  
  
P m  1.64     0.95
 N 
  
P m  z   1
 N 
When σ2 is not known:
N m   
S   x  m  / N  1
2 t 2
~ t N 1
t S
 S S 
P m  t / 2 ,N 1    m  t / 2 ,N 1  1
 N N
Hypothesis Testing
• Reject a null hypothesis if not supported by the sample with
enough confidence
• X = { xt }t where xt ~ N ( μ, σ2)
H0: μ = μ0 vs. H1: μ ≠ μ0
Accept H0 with level of significance α if μ0 is in the
100(1- α) confidence interval
N m   0 
  z / 2 , z / 2 

Two-sided test
• One-sided test: H0: μ ≤ μ0 vs. H1: μ > μ0
N m   0 
Accept if    , z 

• Variance unknown: Use t, instead of z


N m   0 
Accept H0: μ = μ0 if   t / 2 ,N 1 , t / 2 ,N 1 
S
Assessing Error: H0: p ≤ p0 vs. H1: p >
p0
• Single training/validation set: Binomial Test
If error prob is p0, prob that there are e errors or less in
N validation trials is
N  j
PX  e    p0 1  p0 
e
j N j

j 1  j 

Accept if this prob is less than 1- α

N=100, e=20
1- α
Normal Approximation to the Binomial

• Number of errors X is approx N with mean Np0 and


var Np0(1-p0)
X  Np0
~Z
Np0 1  p0 

Accept if this prob for X = e is


less than z1-α

1- α
t Test
• Multiple training/validation sets
• xti = 1 if instance t misclassified on fold i
 i
N t
• Error rate of fold i: x
pi  t 1
N
• With m and s2 average and var of pi , we accept p0 or less error if

K m  p0 
~ t K 1
S
is less than tα,K-1
Comparing Classifiers:
H0: μ0 = μ1 vs. H1: μ0 ≠ μ1
• Single training/validation set: McNemar’s Test

• Under H0, we expect e01= e10=(e01+ e10)/2


e 01  e10  1
2

~ X 12
e01  e10
Accept if < X2α,1
K-Fold CV Paired t Test
• Use K-fold cv to get K training/validation folds
• pi1, pi2: Errors of classifiers 1 and 2 on fold i
• pi = pi1 – pi2 : Paired difference on fold i
• The null hypothesis is whether pi has mean 0
H0 :   0 vs. H0 :   0

i 1 pi
K K

i 1 i
p  m 2

m s2 
K K 1
K m  0  K m
 ~ t K 1 Accept if in  t / 2 ,K 1 , t / 2 ,K 1 
s s
5×2 cv Paired t Test
• Use 5×2 cv to get 2 folds of 5 tra/val replications
(Dietterich, 1998)
• pi(j) : difference btw errors of 1 and 2 on fold j=1, 2 of
replication i=1,...,5
pi  pi  pi / 2 s  pi  pi   pi  pi 
1 2  2 1 2 2  2
i

p11
~ t5
i 1 i / 5
5 2
s
Two-sided test: Accept H0: μ0 = μ1 if in (-tα/2,5,tα/2,5)
One-sided test: Accept H0: μ0 ≤ μ1 if < tα,5
5×2 cv Paired F Test

  p  j 2
5 2
i 1 j 1 i
~ F10,5
2 s
5 2
i 1 i

Two-sided test: Accept H0: μ0 = μ1 if < Fα,10,5


Comparing L>2 Algorithms: Analysis of
Variance (Anova)
H0 : 1   2    L
• Errors of L algorithms on K folds
X ij ~ N  j , 2 , j  1,..., L, i  1,..., K

• We construct two estimators to σ2 .


One is valid if H0 is true, the other is always valid.
We reject H0 if the two estimators disagree.
If H0 is true :
K X ij
mj   ~ N  , 2 / K 
i 1 K
 j 1 m j  m 
L 2
j j  m
m S2 
L L 1
Thus an estimator of  2 is K  S 2 , namely,

ˆ 2  K 
L m j  m 2

j 1 L 1
m  m 2

~ X L21 SSb  K  m j  m 2
j  2 /K
j

So when H0 is true, we have


SSb 2
~ X L 1
2
Regardless of H0 our second estimator to  2 is the
average of group variances S 2j :


K
X  m j 2 L S 2
X  mj 
2

ˆ    
2 ij 2 j ij
S  i 1
LK  1
j
K 1 j 1 L j i

SSw   X ij  m j 2
j i

S 2j SSw
K  1 2 ~ X 2
2
~
K 1 X 2
L K 1
 
 SSb /  2   SSw /  2  SSb / L  1
  /   ~ FL1,LK 1
 L  1   LK  1  SSw / LK  1
H0 : 1   2    L if  F ,L1,LK 1
ANOVA table

If ANOVA rejects, we do pairwise posthoc tests


H0 :  i   j vs H1 :  i   j
mi  m j
t ~ t L(K 1)
2 w
Comparison over Multiple Datasets
• Comparing two algorithms:
Sign test: Count how many times A beats B over N datasets, and check if
this could have been by chance if A and B did have the same error rate
• Comparing multiple algorithms
Kruskal-Wallis test: Calculate the average rank of all algorithms on N
datasets, and check if these could have been by chance if they all had
equal error
If KW rejects, we do pairwise posthoc tests to find which ones have
significant rank difference

You might also like