You are on page 1of 1


Applied Statistics Assignment 2
Group 1
Himalini Aklecha (1032670), Suraj Iyer (0866094), Pradyumna Majumder (1036151)
Technical University of Eindhoven

Problem Statement Hampel’s test: Levene’s test

Levene’s test is a
There has been a release of
non-parametric test for
toxic substances at a
variances. The null
factory. The leak has been
hypothesis assumes
measured with 2 sensors
the variance of the
across 7 different sites for
two samples to be equal. The above figure is a table of
up to 4 hours and once from The Hampel’s test, tests for outliers based on the Site-Hour samples which have same variance as the
an hour before the accident. deviation of each observation from the median. We “Before” site because their p-values are clearly >
. performed the Hampel’s test per sample. Here in the α=0.05. The rest of the samples have different spread.
Our main goal is to answer the following questions: above example the Z value is greater than 3.5 hence it
1. Are the measured PPM values at the is an outlier. We use the result of Hampel’s for the rest
different sites worse than normal level? of our analysis because it detects better and more Wilcoxon rank sum test
2. If so, which sites are really affected. outliers than Tukey’s.
The normal PPM levels are given by the “Before” Advanced Normality tests
site. So to determine if the PPM values from other
sites are worse than normal, we compared each
measurement made per hour after the accident with The K-S test is sensitive to any difference in the
the hour from before the accident. If most hours of a underlying distributions of the two samples.
given site appears worse, then we can say the site Substantial difference in shape, spread or median
is generally affected. results in small p-value. Therefore we applied the
Wilcoxon rank sum test to detect the actual location
Overall Structure shift. It tests equality of medians of two given samples.
Most of the samples appeared to have outliers. So we We found only two samples (above table) have the
The general order in which we proceeded to performed the rest of the tests by removing the same median as the “Before” site. One important thing
analyse the data is as follows. outliers. The above example represents a set of to note is that this test has strong assumption that the
thorough normality tests we performed on each two input samples have equal variance but we can
sample. If the p-value of any of the tests is >α=0.05, clearly see that the samples in the tables above also
then the sample could be normal. But this gives rise to have equal variance from the Levene’s test results
the issue where some tests may accept normality above.
while others reject it. To avoid this issue, we prioritise
the result of each test by its strength and the t-test
consensus on the result can be drawn effectively from
this order. The order is: Shapiro-Wilk >
Anderson-Darling > Cramer-von Mises >
Kolmogorov-Smirnov. In the above example, all the
tests accept normality, hence the sample is normal.
The result of this test allows to decide whether to use
parametric or non-parametric distribution free tests for
comparing each sample against the “Before” site.

Advanced Normality tests for LOG PPM

Basic Normality test

Normality test on PPM values revealed that the
“Before” site itself was not normal. Consequently, to
test for location and variance, we had to use The t-test was performed on those samples which we
non-parametric tests (K-S, Wilcoxon & Levene) but in found to have log-normal distribution which includes the
the case where the sample meets all the “Before” site. T-test tests equality of means of the two
requirements, a two-sample t-test is much more given samples and assumes equality of variance which
sensitive and thus effective. So we took the log of the it automatically checks with a F-test. All of the 6
PPM values and retested for normality. We found that samples we tested, have unequal variance with the
the “Before” site (above example) as well as a few “Before” site. As such, the results of the t-test cannot be
other sites were log-normal which was sufficient to trusted and so we fall back to the results of the
compare them with t-test. non-parametric tests for these samples. In the above
We first generated the Normal probability plots K-S test example for Site=5 Hour=1, the p-value (Pr > F) rejects
(Graphical test) for each sample and checked if the variance homogeneity. Furthermore in the case of
data points are aligned to the estimated normal unequal variances, the p-value (Pr > |t|) rejects equality
regression line. of means. Both can also be clearly seen from the shape
(spread) and the difference in the position of the center
Outlier detection of the curves.
Tukey’s test:
Are the measured PPM values at the different sites
worse than normal level?
Yes for all sites except Site 2. Because in the
Wilcoxon test, we can see that the null hypothesis for
K-S test tests the null hypothesis that the underlying one-sided is also rejected where two-sided is rejected
distribution of two given samples are the same. The implying that the median of the PPM values has
above example is for comparing Site=Before Hour=-1 increased. Also, in cases where the Wilcoxon test
This test, tests for outliers by looking for data and Site=2 Hour=4. The p-value=0.4721 of the does not work, the EDF plot from the K-S test showed
points that are ”far out” above and below the 75% two-sided test is > α=0.05. Therefore, they follow the an increase in PPM values relative to the “Before” site
and 25% quartiles respectively. In the above same distribution. One interesting thing to note here is except for Site 2.
example the Outlier = 1 means it is an outlier. than the one sided p-value D+ > D- which means that If so, which sites are really affected?
the PPM values of Site=2 at Hour=4 is less than Based on various test results, all sites except Site 2
before the accident. seem to be seriously affected by the accident.
Poster template by