ML Evaluation2

Statistical Significance
Testing
1
The purpose of Statistical
Significance Testing
 The purpose of Statistical Significance Testing is to
answer the following questions:
 Can the observed results be attributed to real
characteristics of the classifiers under scrutiny or were

they obtained by chance?
 Were the data sets representative of the problems to
which the classifier will be applied in the future?

Important: Unfortunately, because of the inductive nature of the
problem, such questions cannot be fully answered. The user
should, instead, accept that no matter what evaluation procedures
are followed, they only allows us to gather some evidence into the
2
classifiers’ behaviour. They are almost never conclusive.
Current Disagreements with
Statistical Significance Testing
 There is, currently, a controversy in the statistical
community with some scholars calling for the
rejection of Null Hypothesis Significance Testing
(NHST) due to the fact that:
 It is commonly misinterpreted, causing over-confidence in
meaningless results.
 Its results can be manipulated to show statistical
significance even if that significance is not practically
meaningful.
 While remaining cautious about these issues, we
believe that NHST is the best tool we have to answer
the previous two questions, and will continue to use
it.
3
Overview of Statistical Tests
 We consider two categories of approaches:
 Parametric approaches that make strong assumptions
about the distribution of the population, and
 Non-parametric approaches whose assumptions are
not as strong.
 Parametric approaches are typically more powerful
than non-parametric ones if the assumptions are
verified.
 Note 1: the quality of statistical tests is measured using two
quantities: the Type I Error of the test which denotes the
probability of incorrectly detecting a difference when no such
difference between two classifiers exists; and, the Power of the
test signifying the ability to detect differences when they do exist.
 Note 2: We assume that in all cases, the algorithms are tested on
the same domains (matched samples). 4
4
Overview of the Different Tests I
 Comparisons of two algorithms on a Single domain I:
 Parametric: t-test along with a measure of the effect size
(Cohen’s d statistics)
 Explanation: The t-test determines whether the observed
difference in the performance measures of the classifiers is
statistically significant. However, it cannot confirm whether this
difference, although statistically significantly different, is also
of any practical importance. That is, it does measure the effect
but not the size of this effect. This can be done using one of the
available effect size measuring statistics, such as Cohen’s d
statistics.
 When is it appropriate to use the t-test?
 If the samples come from a normal or pseudo-normal distribution (i.e.,
the test set has, at least 30 examples in it => 300 examples are necessary
if we are running 10-fold cross-validation experiments).
 If the samples were selected at random [Can we really know?]
 If the two populations have equal variances. 5
Overview of the Different Tests II
 Comparisons of two algorithms on a Single
domain II:
 Non-Parametric:
 McNemar’s Test
 t-test versus McNemar: McNemar doesn’t make the
kind of assumptions made by the t-test and it
compares well to it in terms of Type I error and
Power. However, McNemar’s test can be applied only
under the condition where the number of
disagreements between the two classifiers is large
(generally >= 20). If not, the Sign test can be applied,
instead (see next slide).
6
Overview of the Different Tests III
 Comparison of Two algorithms on a Multiple
domains:
 Non-Parametric:
 Sign Test
 Wilcoxon’s Sign rank test
 The Sign Test versus Wilcoxon’s Sign rank test:
Wilcoxon’s Sign Rank Test is more powerful than
the sign test and is generally preferred. However,
the Sign test is very simple.
7
Overview of the Different Tests IV
 Comparison of Multiple algorithms on Multiple
domains:
 Parametric: One-Way Repeated Measure ANOVA,
followed by an appropriate post-hoc test (e.g., Tukey,
Dunnett, Bonferroni or Bonferroni-Dunn)
 Non-Parametric: Friedman’s Test, followed by an
appropriate post-hoc test (e.g., Nemeyni, Hommel, Holm,
or Hochberg)
 ANOVA versus Friedman: More often than not the
assumptions required by the ANOVA test cannot be
ascertained. Therefore, the Friedman Test is usually
preferred.
8
Practical Concerns
 Section 6.8 of my book (with M. Shah) shows
how to use the freely downloadable R
Statistical Software, in order to compute most
of the tests discussed on the previous slides.

ML Evaluation2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Evaluation2

Uploaded by

Copyright:

Available Formats

Statistical Significance

characteristics of the classifiers under scrutiny or were

which the classifier will be applied in the future?

You might also like