Professional Documents
Culture Documents
005 Chemometrics
• Data distributions
• Descriptive statistics
• Hypothesis testing, statistical tests
• Erroneous and missing data
Geven Piir
LOKT.08.005 Chemometrics 2
Rows
• Case, object, sample, observation, compound
LOKT.08.005 Chemometrics 3
Columns
• Variable, feature, measurement, descriptor,
parameter
– Not all columns are variables
LOKT.08.005 Chemometrics 4
Columns
Identifier (not variable)
Dependent variable
Independent variables
LOKT.08.005 Chemometrics 5
Variable types
• Numerical
– Continuous (1.2, 2.68, 0.15, ...)
• Obtained from measurements
• It exists on an infinite range of values
– Discrete (1, 2, 3, ...)
• Obtained in counting
• It can only take on certain values
• Categorical
– Nominal (alcohols, esters, acids, ...)
• functional group, colour, apparatus
– Ordinal, including binary (A, B, C, ... – grades)
• They have meaning, one is better than another
LOKT.08.005 Chemometrics 6
Shape of the distribution
• Where are the data points located, and how
far do they spread?
– What are typical, as well as minimal and maximal,
values?
• How are the points distributed?
– Are they spread out evenly or do they cluster in
certain areas?
• How many points are there?
– Is this a large data set or a relatively small one?
LOKT.08.005 Chemometrics 7
Shape of the distribution
• Is the distribution symmetric or asymmetric?
– In other words, is the tail of the distribution much
larger on one side than on the other?
• Are the tails of the distribution relatively
heavy
– do many data points lie far away from the central
group of points
• Are most of the points confined to a restricted
region?
LOKT.08.005 Chemometrics 8
Shape of the distribution
• If there are clusters, how many are there?
– Is there only one, or are there several?
– Approximately where are the clusters located, and
how large are they?
• Does the data set contain any significant
outliers?
– that is, data points that seem to be different from
all the others?
LOKT.08.005 Chemometrics 9
Shape of the distribution
• And lastly, are there any other unusual or
significant features in the data set
– gaps
– sharp cutoffs
– unusual values
– anything at all that we can observe
LOKT.08.005 Chemometrics 10
DATA DISTRIBUTIONS
LOKT.08.005 Chemometrics 11
LOKT.08.005 Chemometrics 12
Normal distribution
• Normal distribution – bell-shaped
– most “basic” continuous probability distribution
2 /2𝜎2
𝑒 −(𝑥−𝜇)
𝑦=
𝜎 2𝜋
LOKT.08.005 Chemometrics 13
Normal distribution
10
0.4
100
1000
10000
100000
0.3
Density
0.2
0.1
0.0
-3 -2 -1 0 1 2 3
Mean = 0, Std.dev = 1
LOKT.08.005 Chemometrics 14
Normal distribution
• For a normal distribution with mean µ and
standard deviation σ:
– ~68% of the population values lie within ±1σ of
the mean
– ~ 95% of the population values lie within ±2σ of
the mean
– ~ 99.7% of the population values lie within ±3σ
of the mean
LOKT.08.005 Chemometrics 15
Normal distribution
LOKT.08.005 Chemometrics 16
Log-normal Distribution
• A continuous distribution in
which the logarithm of a
variable has a normal
distribution
– Y = log(X)
• Antibody concentration in
human blood sera
a) Concentration
b) Log(concentration)
LOKT.08.005 Chemometrics 17
Other distributions
• Student’s T Distribution
– T-test
• comparison of two means
• Chi-Square Distribution
– Pearson's chi-squared test
• test of the independence of two nominal variables
• F-distribution
– F-test
• comparison of two variances
LOKT.08.005 Chemometrics 18
LOOKING AT DATA
LOKT.08.005 Chemometrics 19
Descriptive statistics
LOKT.08.005 Chemometrics 20
Example data
60.39 59.68 59.37 58.75 60.00 58.93 58.63 58.07 60.44 59.40
60.41 58.40 61.60 59.08 60.11 59.65 58.93 60.73 59.84 58.49
59.45 58.90 60.24 59.25 59.45 61.08 57.65 60.09 60.30 61.40
60.20 59.16 59.74 61.41 59.91 59.89 60.14 58.34 62.08 60.31
62.37 58.32 61.87 60.65 59.17 59.24 59.58 60.07 60.81 58.41
59.27 59.48 61.30 59.04 59.01 59.14 58.78 58.91 63.10 60.53
56.86 62.14 57.02 60.49 55.41 60.59 56.77 61.08 59.46 57.00
60.83 63.46 64.14 58.60 40.01 58.25 57.85 64.56 53.20 66.00
63.39 52.89 57.94 64.97 65.56 61.09 64.04 62.50 61.79 56.82
66.70 62.71 54.06 58.36 72.87 60.68 51.90 52.09 57.84 56.10
61.87 64.06 64.77 57.49 65.52 51.08 62.19 54.99 58.31 56.05
57.48 58.70 65.01 47.54 56.34 61.53 57.68 60.53 55.14 54.14
LOKT.08.005 Chemometrics 21
One-dimensional scatter plot
LOKT.08.005 Chemometrics 22
One-dimensional scatter plot
LOKT.08.005 Chemometrics 23
One-dimensional scatter plot
LOKT.08.005 Chemometrics 24
Histogram
LOKT.08.005 Chemometrics 25
Histogram gaps
LOKT.08.005 Chemometrics 26
Histogram gaps location
Probability density function
LOKT.08.005 Chemometrics 29
Empirical cumulative distribution function
The y-axis of the plot shows the number or percentage of data values
smaller or equal to the x-axis value.
LOKT.08.005 Chemometrics 32
Descriptive measures of the data
distribution
• Arithmetic mean (average)
– Describes the central location of the data
• Median
– Numeric value separating the higher half of a sample from the lower half
• Quantiles (quartiles, percentiles)
– Points taken at regular intervals from the data set. Percentile is the value of a
variable below which a certain percent of observations fall
• Mode
– The value that occurs most frequently in the data set
• Minimum
– Smallest value in the data set
• Maximum
– Largest value in the data set
LOKT.08.005 Chemometrics 33
Boxplot
LOKT.08.005 Chemometrics 35
Boxplots
LOKT.08.005 Chemometrics 36
Boxplots
LOKT.08.005 Chemometrics 37
DISTRIBUTION PARAMETERS
LOKT.08.005 Chemometrics 38
Population vs sample
• A population consists all objects that are
relevant in a particular study
• In chemometrics we usually deal with samples
population
sample sample
LOKT.08.005 Chemometrics 39
Central value
• The normal distribution is the most widely
used distribution in statistics
– Mean
– Standard deviation
• These two parameters have to be estimated
using the data at hand
LOKT.08.005 Chemometrics 40
Arithmetic mean
Population = x i
Sample x=
x i
n n
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
LOKT.08.005 Chemometrics 41
Example 1 Example 2
Measurment 1 0.2 0.42
Measurment 2 0.3 0.44
Measurment 3 0.25 4.3
Average 0.25 1.72
LOKT.08.005 Chemometrics 42
Median
• A robust measure for the central value
– much less influenced by outliers than the mean
• The median divides the data distribution into
two equal halves
– the number of data higher than the median is
equal to the number of data lower than the
median
LOKT.08.005 Chemometrics 43
Median
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
LOKT.08.005 Chemometrics 44
Spread
• Range
– length of the smallest interval which contains all the data (xmax – xmin)
• Interquartile range
– Difference between the third and the first quartile
• Variance
– describes how far the numbers lie from the mean
• Standard deviation
– measure of the dispersion of a set of data from its mean
• Median absolute deviation (MAD)
– the median of the absolute deviations from the data's median
45
Descriptive statistics
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
LOKT.08.005 Chemometrics 46
Descriptive statistics
population sample
(
i ) (x − x )
2
x − 2
Variance 2 =
n
v = s2 = i
n −1
(x − x )
2
(x − )
2
Standard deviation = i
n
s= v =
n −1
i
Median average
𝑀𝐴𝐷 = 𝑚𝑒𝑑𝑖𝑎𝑛( 𝑋𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑋 )
deviation
LOKT.08.005 Chemometrics 47
Descriptive statistics
• The variance of a data set is calculated by taking
(x − x )
2 the arithmetic mean of the squared differences
v = s2 = i
between each value and the mean value
n −1
• Squaring makes each term positive so that
values above the mean do not cancel values
below the mean
LOKT.08.005 Chemometrics 48
Descriptive statistics
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
LOKT.08.005 Chemometrics 49
Mean = 59.82
Std. dev = 1.12
LOKT.08.005 Chemometrics 50
HYPOTHESIS TESTING
LOKT.08.005 Chemometrics 51
Statistical hypothesis testing
• Hypothesis are our assumptions about the data
which may or may not be true.
LOKT.08.005 Chemometrics 52
Steps for hypothesis testing
• State the hypotheses
– the null hypothesis and an alternative hypothesis
• Formulate an analysis plan
– the significance level is 0.05, the test method
• Analyse sample data
• Interpret result
LOKT.08.005 Chemometrics 53
Statistical hypotheses
• H0 – null hypothesis
– a statistical hypothesis that states that there is no
difference between a parameter and a specific
value, or that there is no difference between two
parameters
• H1 – alternative hypothesis
– a statistical hypothesis that states the existence of
a difference between a parameter and a specific
value, or states that there is a difference between
two parameters.
LOKT.08.005 Chemometrics 54
Hypothesis testing
• We want to test is if H1 is “likely” true.
• Two possible outcomes:
– Reject H0 and accept H1 because of sufficient
evidence in the sample in favour of H1
LOKT.08.005 Chemometrics 55
Analysis plan
LOKT.08.005 Chemometrics 56
Analyse data
• p – significance probability
– If p-value is larger than (or equal to) a, the test is
said to be “not significant” and thus null
hypothesis H0 cannot be rejected
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision
LOKT.08.005 Chemometrics 59
Decision errors
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision
LOKT.08.005 Chemometrics 60
Decision errors
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision
LOKT.08.005 Chemometrics 61
Tests for distribution
• Shapiro-Wilk Test
– H0 : data distribution follows normal distribution
• Anderson-Darling Test
– H0 : data distribution follows given hypothetical
distribution
• Kolmogorov-Smirnov Test
– H0 : data distribution follows given hypothetical
distribution
If the p-value is less than the chosen alpha level (0.05),
then the null hypothesis is rejected and there is evidence
that the data tested are not from a normally distributed
population
LOKT.08.005 Chemometrics 62
Quantile-Quantile (Q-Q) plot
• Is a graphical tool to help us
assess if a set of data plausibly
came from some theoretical
distribution such as a Normal
LOKT.08.005 Chemometrics 63
Example
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
Variance 1.25 27.16
Shapiro-Wilk Statistic 0.968 0.958
Shapiro-Wilk p-value 0.1165 0.039
LOKT.08.005 Chemometrics 64
Tests for central value (mean)
• One sample t-test x1 − x 2
t=
– H0 : mean is equal to given value s
n
• Two sample t-test
– H0 : means of two distributions are equal
Requires normal distribution of both groups but, it is not very
sensitive if they are not
LOKT.08.005 Chemometrics 65
One-sample t-test (example)
Actual concentration of the sample was 60.0 ppm
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
t -1.2782 -1.2771
T-test p-value 0.2062 0.2066
LOKT.08.005 Chemometrics 66
Two-sample t-test (example)
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
t 0.981
T-test p-value 0.330
LOKT.08.005 Chemometrics 67
Tests for variance
• One sample F-test 1 s12
– H0 : variance is equal to given value F= =
2 s22
• Two sample F-test
– H0 : variance of two distributions are equal
Requires normal distribution of both groups and is very sensitive
if they are not
LOKT.08.005 Chemometrics 68
Two-sample F-test (example)
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
F 0.046
F-test p-value < 2.2e-16
AB 2580
Ansari-Bradley test p-value 3.331e-15
LOKT.08.005 Chemometrics 70
ERRONEOUS AND MISSING DATA
LOKT.08.005 Chemometrics 71
Errors in data
When using statistics one should always concern that the
data consists errors
• Random error (noise)
– Most common and exists always
• Systematic error
– Imperfection in an experimental procedure
– Poorly calibrated instrument, incorrect use of volumetric
glassware
• Gross error
– For example caused by instrumental breakdown
– Negative concentrations, too high concentrations (100 molar
(M))
Data with gross error should be excluded from data analysis
LOKT.08.005 Chemometrics 72
Student Results (ml) Average Comment
A 10.08 10.11 10.09 10.1 10.12 10.1 Precise, biased
B 9.88 10.14 10.02 9.8 10.21 10.01 Imprecise, unbiased
C 10.19 9.79 9.69 10.05 9.78 9.9 Imprecise, biased
D 10.04 9.98 10.02 9.97 10.04 10.01 Precise, unbiased
LOKT.08.005 Chemometrics 73
Random errors
• Affect precision – repeatability or
reproducibility
• Cause replicate results to fall on either side of
a mean value
• Can be estimated using replicate
measurements
• Can be minimised by good technique but not
eliminated
• Caused by both humans and equipment
LOKT.08.005 Chemometrics 74
Systematic errors
• Produce bias – an overall deviation of a result from
the true value even when random errors are very
small
• Cause all results to be affected in one sense only, all
too high or all too low
• Cannot be detected simply by using replicate
measurements
• Can be corrected, e.g. by using standard methods
and materials
• Caused by both humans and equipment
LOKT.08.005 Chemometrics 75
Missing data
• Data possess empty gaps
– Either empty cell or N/A name
Methanol
MOA
1
B96h
0.32
C48h
1.86
-OH
1
• Solutions
Phenol 2 3.60 3.16 1
Ethanol 1 1.47 0.66 1
propan-1-ol 1 2.11 2.53 1
– Do nothing Benzene
pentan-1-ol
1
1 2.13
2.25
2.76 1
– Substitute with variable mean 2-Propanone
Butan-1-ol
1
1
0.90
2.42
1.22
N/A 1
–
Toluene 1 3.85 2.36
Remove corresponding object or 4-nitrophenol 2 4.22 3.75 1
2-butanone 1 2.15 1.38
variable hexan-1-ol 1 3.47 2.62 1
(Multiple Imputation)
LOKT.08.005 Chemometrics 76