Lecture02.Data Distribution

LOKT.08.
005 Chemometrics
Data distributions and

descriptive statistics
• Data distributions
• Descriptive statistics
• Hypothesis testing, statistical tests
• Erroneous and missing data
Geven Piir
LOKT.08.005 Chemometrics 2
Rows
• Case, object, sample, observation, compound
Columns
• Variable, feature, measurement, descriptor,
parameter
– Not all columns are variables
Columns
Identifier (not variable)
Dependent variable
Independent variables
Variable types
• Numerical
– Continuous (1.2, 2.68, 0.15, ...)
• Obtained from measurements
• It exists on an infinite range of values
– Discrete (1, 2, 3, ...)
• Obtained in counting
• It can only take on certain values
• Categorical
– Nominal (alcohols, esters, acids, ...)
• functional group, colour, apparatus
– Ordinal, including binary (A, B, C, ... – grades)
• They have meaning, one is better than another
Shape of the distribution
• Where are the data points located, and how
far do they spread?
– What are typical, as well as minimal and maximal,
values?
• How are the points distributed?
– Are they spread out evenly or do they cluster in
certain areas?
• How many points are there?
– Is this a large data set or a relatively small one?
• Is the distribution symmetric or asymmetric?
– In other words, is the tail of the distribution much
larger on one side than on the other?
• Are the tails of the distribution relatively
heavy
– do many data points lie far away from the central
group of points
• Are most of the points confined to a restricted
region?
• If there are clusters, how many are there?
– Is there only one, or are there several?
– Approximately where are the clusters located, and
how large are they?
• Does the data set contain any significant
outliers?
– that is, data points that seem to be different from
all the others?
• And lastly, are there any other unusual or
significant features in the data set
– gaps
– sharp cutoffs
– unusual values
– anything at all that we can observe
DATA DISTRIBUTIONS
Normal distribution
• Normal distribution – bell-shaped
– most “basic” continuous probability distribution
2 /2𝜎2
𝑒 −(𝑥−𝜇)
𝑦=
𝜎 2𝜋
• Most naturally occurring processes can be described

with normal distribution
• Deviations are normal, especially with small samples
Normal distribution
10
0.4
100
1000
10000
100000
0.3
Density
0.2
0.1
0.0
-3 -2 -1 0 1 2 3
Mean = 0, Std.dev = 1
Normal distribution
• For a normal distribution with mean µ and
standard deviation σ:
– ~68% of the population values lie within ±1σ of
the mean
– ~ 95% of the population values lie within ±2σ of
the mean
– ~ 99.7% of the population values lie within ±3σ
of the mean
Normal distribution
Log-normal Distribution
• A continuous distribution in
which the logarithm of a
variable has a normal
distribution
– Y = log(X)
• Antibody concentration in
human blood sera
a) Concentration
b) Log(concentration)
Other distributions
• Student’s T Distribution
– T-test
• comparison of two means
• Chi-Square Distribution
– Pearson's chi-squared test
• test of the independence of two nominal variables
• F-distribution
– F-test
• comparison of two variances
Exists a lot of more distributions
LOOKING AT DATA
Descriptive statistics
– Graphical representations (scatter plot, histogram,

probability density plot, boxplot, etc.)
– Mean, median, mode, quantiles, standard
deviation, variance etc.
Example data
Sixty determinations of Cu content (in ppm) in a reference sample
60.39 59.68 59.37 58.75 60.00 58.93 58.63 58.07 60.44 59.40
60.41 58.40 61.60 59.08 60.11 59.65 58.93 60.73 59.84 58.49
59.45 58.90 60.24 59.25 59.45 61.08 57.65 60.09 60.30 61.40
60.20 59.16 59.74 61.41 59.91 59.89 60.14 58.34 62.08 60.31
62.37 58.32 61.87 60.65 59.17 59.24 59.58 60.07 60.81 58.41
59.27 59.48 61.30 59.04 59.01 59.14 58.78 58.91 63.10 60.53
56.86 62.14 57.02 60.49 55.41 60.59 56.77 61.08 59.46 57.00
60.83 63.46 64.14 58.60 40.01 58.25 57.85 64.56 53.20 66.00
63.39 52.89 57.94 64.97 65.56 61.09 64.04 62.50 61.79 56.82
66.70 62.71 54.06 58.36 72.87 60.68 51.90 52.09 57.84 56.10
61.87 64.06 64.77 57.49 65.52 51.08 62.19 54.99 58.31 56.05
57.48 58.70 65.01 47.54 56.34 61.53 57.68 60.53 55.14 54.14
One-dimensional scatter plot
Histogram
Histogram gaps
Histogram gaps location
Probability density function
Smoothed line tracing the histogram
Empirical cumulative distribution function
The y-axis of the plot shows the number or percentage of data values
smaller or equal to the x-axis value.
Descriptive measures of the data
distribution
• Arithmetic mean (average)
– Describes the central location of the data
• Median
– Numeric value separating the higher half of a sample from the lower half
• Quantiles (quartiles, percentiles)
– Points taken at regular intervals from the data set. Percentile is the value of a
variable below which a certain percent of observations fall
• Mode
– The value that occurs most frequently in the data set
• Minimum
– Smallest value in the data set
• Maximum
– Largest value in the data set
Boxplot
IQR (interquartile range) = Q3 − Q1

Boxplots
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
Minimum = 57.65 Q1 =(59.01+59.04)/2 = 59.025

Maximum = 63.10 Q2(median) = (59.65+59.68)/2 = 59.665
Mode = 59.45 Q3 = (60.39+60.41)/2 = 60.40
Mean = (57.65+…+63.10)/60 = 59.82 IQR = Q3-Q1 = 62-58.9 = 1.375

1.5*IQR = 1.5*1.375 = 2.0625
Boxplots
Boxplots
DISTRIBUTION PARAMETERS
Population vs sample
• A population consists all objects that are
relevant in a particular study
• In chemometrics we usually deal with samples
population
sample sample
Central value
• The normal distribution is the most widely
used distribution in statistics
– Mean
– Standard deviation
• These two parameters have to be estimated
using the data at hand
Arithmetic mean
Population =  x i
Sample x=
 x i
n n
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14
Example 1 Example 2
Measurment 1 0.2 0.42
Average 0.25 1.72
Median
• A robust measure for the central value
– much less influenced by outliers than the mean
• The median divides the data distribution into
two equal halves
– the number of data higher than the median is
equal to the number of data lower than the
median
Median
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14
Median = 59.66 Median = 58.65
Spread
• Range
– length of the smallest interval which contains all the data (xmax – xmin)
• Interquartile range
– Difference between the third and the first quartile
• Variance
– describes how far the numbers lie from the mean
• Standard deviation
– measure of the dispersion of a set of data from its mean
• Median absolute deviation (MAD)
– the median of the absolute deviations from the data's median
45
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
Range = 63.1 – 57.65 = 5.45 Range = 72.87 – 40.01 = 32.87
IQR = 1.375 IQR = 7.05
population sample
(
 i )  (x − x )
2
x −  2
Variance 2 =
n
v = s2 = i
n −1
 (x − x )
2
 (x − )
2
Standard deviation = i
n
s= v =
n −1
i
Median average
𝑀𝐴𝐷 = 𝑚𝑒𝑑𝑖𝑎𝑛( 𝑋𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑋 )
deviation
• The variance of a data set is calculated by taking
 (x − x )
2 the arithmetic mean of the squared differences
v = s2 = i
between each value and the mean value
n −1
• Squaring makes each term positive so that
values above the mean do not cancel values
below the mean
• Because the differences are squared, the

units of variance are not the same as the units
 (x − x )
2
of the data.
s= v = i
n −1 • Therefore, the standard deviation is reported

as the square root of the variance and the
units then correspond to those of the data set.
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87
v = [(57.65 – 59.82)^2+…+(63.1- v = [(40.01 – 59.14)^2+…+(72.87 -

59.82)^2]/60-1 = 1.25 59.14)^2]/60-1 = 27.16
s = √v = 1.12 s = √v = 5.21
MAD = |57.65-59.66|,….,|63.1-59.66| = MAD = |40.01-58.65|,….,|72.87-58.65|
1.02 = 1.6
Mean = 59.82
Std. dev = 1.12
HYPOTHESIS TESTING
Statistical hypothesis testing
• Hypothesis are our assumptions about the data
which may or may not be true.
• Hypothesis testing is a statistical method that is used

in making statistical decisions using experimental
data
– Tests for distribution
– Tests for central value (mean)
– Tests for variance
Steps for hypothesis testing
• State the hypotheses
– the null hypothesis and an alternative hypothesis
• Formulate an analysis plan
– the significance level is 0.05, the test method
• Analyse sample data
• Interpret result
Statistical hypotheses
• H0 – null hypothesis
– a statistical hypothesis that states that there is no
difference between a parameter and a specific
value, or that there is no difference between two
parameters
• H1 – alternative hypothesis
– a statistical hypothesis that states the existence of
a difference between a parameter and a specific
value, or states that there is a difference between
two parameters.
Hypothesis testing
• We want to test is if H1 is “likely” true.
• Two possible outcomes:
– Reject H0 and accept H1 because of sufficient
evidence in the sample in favour of H1
– Do not reject H0 because of insufficient evidence

to support H1
Analysis plan
• a – significance level, commonly 0.05

– Means that in 5% (1 in 20) of the cases H0
will rejected although it is valid
• Pick a test method
Analyse data
• a – significance level, commonly 0.05

– Means that in 5% (1 in 20) of the cases H0 will rejected although it is
valid
• p – significance probability
– If p-value is larger than (or equal to) a, the test is
said to be “not significant” and thus null
hypothesis H0 cannot be rejected
– If p-value is smaller than a, H0 has to be rejected

and it is said the test “is significant”
Note that failure to reject H0 does not mean the null hypothesis is
true.
There is no formal outcome that says “accept H0.”
It only means that we do not have sufficient evidence to support H0
Decision errors
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision
Decision errors
The acceptance of H1 when H0 is true is called a Type I error.

The probability of committing a type I error is called the level
of significance and is denoted by α.
Decision errors
Failure to reject H0 when H1 is true is called a Type II error.

The probability of committing a type II error is denoted by β.
Tests for distribution
• Shapiro-Wilk Test
– H0 : data distribution follows normal distribution
• Anderson-Darling Test
– H0 : data distribution follows given hypothetical
distribution
• Kolmogorov-Smirnov Test
– H0 : data distribution follows given hypothetical
distribution
If the p-value is less than the chosen alpha level (0.05),
then the null hypothesis is rejected and there is evidence
that the data tested are not from a normally distributed
population
Quantile-Quantile (Q-Q) plot
• Is a graphical tool to help us
assess if a set of data plausibly
came from some theoretical
distribution such as a Normal
• A Q-Q plot is a scatterplot created

by plotting two sets of quantiles
against one another
• If both sets of quantiles came from

the same distribution, we should
see the points forming a line that’s
roughly straight
Example
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
Variance 1.25 27.16
Shapiro-Wilk Statistic 0.968 0.958
Shapiro-Wilk p-value 0.1165 0.039
If the p-value is less than the chosen alpha level

(0.05), then the null hypothesis is rejected and
there is evidence that the data tested are not from
a normally distributed population
Tests for central value (mean)
• One sample t-test x1 − x 2
t=
– H0 : mean is equal to given value s
n
• Two sample t-test
– H0 : means of two distributions are equal
Requires normal distribution of both groups but, it is not very
sensitive if they are not
• Wilcoxon Signed Ranks Test (nonparametric)

– H0 : medians of two distributions are equal
Requires continuous data
One-sample t-test (example)
Actual concentration of the sample was 60.0 ppm
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
t -1.2782 -1.2771
T-test p-value 0.2062 0.2066

there is evidence that the mean is different from
the hypothesized value
Two-sample t-test (example)
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
t 0.981
T-test p-value 0.330

there is evidence that the means of two
distributions are not equal
Tests for variance
• One sample F-test  1 s12
– H0 : variance is equal to given value F= =
 2 s22
• Two sample F-test
– H0 : variance of two distributions are equal
Requires normal distribution of both groups and is very sensitive
if they are not
• Ansari-Bradley Test (nonparametric)

– H0 : variance of two distributions are equal
No requirements
Two-sample F-test (example)
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
F 0.046
F-test p-value < 2.2e-16
AB 2580
Ansari-Bradley test p-value 3.331e-15

there is evidence that the variances of two
distributions are not equal
Data
t-test and F-test example
name HC-OH logL(n-hexane) logL(water)
methanol alcohol 1.09 3.73
2-propanol alcohol 1.69 3.48 Statistics
ethanol alcohol 1.92 3.69 HC-OH of HC-OH of
... ... ... ... logL(hexane) logL(water)
p-bromophenol alcohol 5.11 5.23 mean 3.531 3.897 3.830 -0.303
4-hydroxybenzaldehyde alcohol 6.74 5.38 standard deviation 1.413 1.726 0.948 1.639
methyl 4-hydroxybenzoate alcohol 8.42 6.84 variance 1.996 2.978 0.898 2.685
t value -1.090 -15.067
methane hydrocarbon -0.03 -1.43
p-value of t-test 0.279 (same) 0.000 (not same)
ethene hydrocarbon 0.59 -0.97
F ratio 0.670 2.685
propane hydrocarbon 1.37 -1.46
p-value of F-test 0.221 (same) 0.001 (not same)
... ... ... ...
trans-stilbene hydrocarbon 7.45 2.78
anthracene hydrocarbon 7.45 3.03
fluoranthene hydrocarbon 8.42 3.44
ERRONEOUS AND MISSING DATA
Errors in data
When using statistics one should always concern that the
data consists errors
• Random error (noise)
– Most common and exists always
• Systematic error
– Imperfection in an experimental procedure
– Poorly calibrated instrument, incorrect use of volumetric
glassware
• Gross error
– For example caused by instrumental breakdown
– Negative concentrations, too high concentrations (100 molar
(M))
Data with gross error should be excluded from data analysis
Student Results (ml) Average Comment
A 10.08 10.11 10.09 10.1 10.12 10.1 Precise, biased
B 9.88 10.14 10.02 9.8 10.21 10.01 Imprecise, unbiased
C 10.19 9.79 9.69 10.05 9.78 9.9 Imprecise, biased
D 10.04 9.98 10.02 9.97 10.04 10.01 Precise, unbiased
Random errors
• Affect precision – repeatability or
reproducibility
• Cause replicate results to fall on either side of
a mean value
• Can be estimated using replicate
measurements
• Can be minimised by good technique but not
eliminated
• Caused by both humans and equipment
Systematic errors
• Produce bias – an overall deviation of a result from
the true value even when random errors are very
small
• Cause all results to be affected in one sense only, all
too high or all too low
• Cannot be detected simply by using replicate
measurements
• Can be corrected, e.g. by using standard methods
and materials
• Caused by both humans and equipment
Missing data
• Data possess empty gaps
– Either empty cell or N/A name
Methanol
MOA
1
B96h
0.32
C48h
1.86
-OH
1
• Solutions
Phenol 2 3.60 3.16 1
Ethanol 1 1.47 0.66 1
propan-1-ol 1 2.11 2.53 1
– Do nothing Benzene
pentan-1-ol
1
1 2.13
2.25
2.76 1
– Substitute with variable mean 2-Propanone
Butan-1-ol
1
1
0.90
2.42
1.22
N/A 1
– Substitute with 0 propan-2-ol

Aniline
1
2
1.74
3.28
2.76
2.57
1
–
Toluene 1 3.85 2.36
Remove corresponding object or 4-nitrophenol 2 4.22 3.75 1
2-butanone 1 2.15 1.38
variable hexan-1-ol 1 3.47 2.62 1
– Estimate with NIPALS, MI

2,4-dichlorophenol 2 4.91 4.45 1
(Multiple Imputation)

Lecture02.Data Distribution

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture02.Data Distribution

Uploaded by

Copyright:

Available Formats

LOKT.08.

Data distributions and

• Most naturally occurring processes can be described

Exists a lot of more distributions

– Graphical representations (scatter plot, histogram,

Sixty determinations of Cu content (in ppm) in a reference sample

Smoothed line tracing the histogram

IQR (interquartile range) = Q3 − Q1

Minimum = 57.65 Q1 =(59.01+59.04)/2 = 59.025

Mean = (57.65+…+63.10)/60 = 59.82 IQR = Q3-Q1 = 62-58.9 = 1.375

Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14

Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14

Median = 59.66 Median = 58.65

Range = 63.1 – 57.65 = 5.45 Range = 72.87 – 40.01 = 32.87

IQR = 1.375 IQR = 7.05

• Because the differences are squared, the

n −1 • Therefore, the standard deviation is reported

v = [(57.65 – 59.82)^2+…+(63.1- v = [(40.01 – 59.14)^2+…+(72.87 -

• Hypothesis testing is a statistical method that is used

– Do not reject H0 because of insuﬃcient evidence

• a – significance level, commonly 0.05

• Pick a test method

• a – significance level, commonly 0.05

– If p-value is smaller than a, H0 has to be rejected

The acceptance of H1 when H0 is true is called a Type I error.

Failure to reject H0 when H1 is true is called a Type II error.

• A Q-Q plot is a scatterplot created

• If both sets of quantiles came from

If the p-value is less than the chosen alpha level

• Wilcoxon Signed Ranks Test (nonparametric)

If the p-value is less than the chosen alpha level

If the p-value is less than the chosen alpha level

• Ansari-Bradley Test (nonparametric)

If the p-value is less than the chosen alpha level

– Substitute with 0 propan-2-ol

– Estimate with NIPALS, MI

You might also like