You are on page 1of 76

LOKT.08.

005 Chemometrics

Data distributions and


descriptive statistics

• Data distributions
• Descriptive statistics
• Hypothesis testing, statistical tests
• Erroneous and missing data

Geven Piir
LOKT.08.005 Chemometrics 2
Rows
• Case, object, sample, observation, compound

LOKT.08.005 Chemometrics 3
Columns
• Variable, feature, measurement, descriptor,
parameter
– Not all columns are variables

LOKT.08.005 Chemometrics 4
Columns
Identifier (not variable)

Dependent variable

Independent variables

LOKT.08.005 Chemometrics 5
Variable types
• Numerical
– Continuous (1.2, 2.68, 0.15, ...)
• Obtained from measurements
• It exists on an infinite range of values
– Discrete (1, 2, 3, ...)
• Obtained in counting
• It can only take on certain values
• Categorical
– Nominal (alcohols, esters, acids, ...)
• functional group, colour, apparatus
– Ordinal, including binary (A, B, C, ... – grades)
• They have meaning, one is better than another

LOKT.08.005 Chemometrics 6
Shape of the distribution
• Where are the data points located, and how
far do they spread?
– What are typical, as well as minimal and maximal,
values?
• How are the points distributed?
– Are they spread out evenly or do they cluster in
certain areas?
• How many points are there?
– Is this a large data set or a relatively small one?
LOKT.08.005 Chemometrics 7
Shape of the distribution
• Is the distribution symmetric or asymmetric?
– In other words, is the tail of the distribution much
larger on one side than on the other?
• Are the tails of the distribution relatively
heavy
– do many data points lie far away from the central
group of points
• Are most of the points confined to a restricted
region?
LOKT.08.005 Chemometrics 8
Shape of the distribution
• If there are clusters, how many are there?
– Is there only one, or are there several?
– Approximately where are the clusters located, and
how large are they?
• Does the data set contain any significant
outliers?
– that is, data points that seem to be different from
all the others?

LOKT.08.005 Chemometrics 9
Shape of the distribution
• And lastly, are there any other unusual or
significant features in the data set
– gaps
– sharp cutoffs
– unusual values
– anything at all that we can observe

LOKT.08.005 Chemometrics 10
DATA DISTRIBUTIONS

LOKT.08.005 Chemometrics 11
LOKT.08.005 Chemometrics 12
Normal distribution
• Normal distribution – bell-shaped
– most “basic” continuous probability distribution
2 /2𝜎2
𝑒 −(𝑥−𝜇)
𝑦=
𝜎 2𝜋

• Most naturally occurring processes can be described


with normal distribution
• Deviations are normal, especially with small samples

LOKT.08.005 Chemometrics 13
Normal distribution

10
0.4

100
1000
10000
100000
0.3
Density

0.2
0.1
0.0

-3 -2 -1 0 1 2 3

Mean = 0, Std.dev = 1

LOKT.08.005 Chemometrics 14
Normal distribution
• For a normal distribution with mean µ and
standard deviation σ:
– ~68% of the population values lie within ±1σ of
the mean
– ~ 95% of the population values lie within ±2σ of
the mean
– ~ 99.7% of the population values lie within ±3σ
of the mean

LOKT.08.005 Chemometrics 15
Normal distribution

LOKT.08.005 Chemometrics 16
Log-normal Distribution
• A continuous distribution in
which the logarithm of a
variable has a normal
distribution
– Y = log(X)
• Antibody concentration in
human blood sera
a) Concentration
b) Log(concentration)

LOKT.08.005 Chemometrics 17
Other distributions
• Student’s T Distribution
– T-test
• comparison of two means
• Chi-Square Distribution
– Pearson's chi-squared test
• test of the independence of two nominal variables
• F-distribution
– F-test
• comparison of two variances

Exists a lot of more distributions

LOKT.08.005 Chemometrics 18
LOOKING AT DATA

LOKT.08.005 Chemometrics 19
Descriptive statistics

– Graphical representations (scatter plot, histogram,


probability density plot, boxplot, etc.)
– Mean, median, mode, quantiles, standard
deviation, variance etc.

LOKT.08.005 Chemometrics 20
Example data

Sixty determinations of Cu content (in ppm) in a reference sample

60.39 59.68 59.37 58.75 60.00 58.93 58.63 58.07 60.44 59.40
60.41 58.40 61.60 59.08 60.11 59.65 58.93 60.73 59.84 58.49
59.45 58.90 60.24 59.25 59.45 61.08 57.65 60.09 60.30 61.40
60.20 59.16 59.74 61.41 59.91 59.89 60.14 58.34 62.08 60.31
62.37 58.32 61.87 60.65 59.17 59.24 59.58 60.07 60.81 58.41
59.27 59.48 61.30 59.04 59.01 59.14 58.78 58.91 63.10 60.53

56.86 62.14 57.02 60.49 55.41 60.59 56.77 61.08 59.46 57.00
60.83 63.46 64.14 58.60 40.01 58.25 57.85 64.56 53.20 66.00
63.39 52.89 57.94 64.97 65.56 61.09 64.04 62.50 61.79 56.82
66.70 62.71 54.06 58.36 72.87 60.68 51.90 52.09 57.84 56.10
61.87 64.06 64.77 57.49 65.52 51.08 62.19 54.99 58.31 56.05
57.48 58.70 65.01 47.54 56.34 61.53 57.68 60.53 55.14 54.14

LOKT.08.005 Chemometrics 21
One-dimensional scatter plot

LOKT.08.005 Chemometrics 22
One-dimensional scatter plot

LOKT.08.005 Chemometrics 23
One-dimensional scatter plot

LOKT.08.005 Chemometrics 24
Histogram

LOKT.08.005 Chemometrics 25
Histogram gaps

LOKT.08.005 Chemometrics 26
Histogram gaps location
Probability density function

Smoothed line tracing the histogram

LOKT.08.005 Chemometrics 29
Empirical cumulative distribution function

The y-axis of the plot shows the number or percentage of data values
smaller or equal to the x-axis value.

LOKT.08.005 Chemometrics 32
Descriptive measures of the data
distribution
• Arithmetic mean (average)
– Describes the central location of the data
• Median
– Numeric value separating the higher half of a sample from the lower half
• Quantiles (quartiles, percentiles)
– Points taken at regular intervals from the data set. Percentile is the value of a
variable below which a certain percent of observations fall
• Mode
– The value that occurs most frequently in the data set
• Minimum
– Smallest value in the data set
• Maximum
– Largest value in the data set

LOKT.08.005 Chemometrics 33
Boxplot

IQR (interquartile range) = Q3 − Q1


LOKT.08.005 Chemometrics 34
Boxplots
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10

Minimum = 57.65 Q1 =(59.01+59.04)/2 = 59.025


Maximum = 63.10 Q2(median) = (59.65+59.68)/2 = 59.665
Mode = 59.45 Q3 = (60.39+60.41)/2 = 60.40

Mean = (57.65+…+63.10)/60 = 59.82 IQR = Q3-Q1 = 62-58.9 = 1.375


1.5*IQR = 1.5*1.375 = 2.0625

LOKT.08.005 Chemometrics 35
Boxplots

LOKT.08.005 Chemometrics 36
Boxplots

LOKT.08.005 Chemometrics 37
DISTRIBUTION PARAMETERS

LOKT.08.005 Chemometrics 38
Population vs sample
• A population consists all objects that are
relevant in a particular study
• In chemometrics we usually deal with samples

population

sample sample

LOKT.08.005 Chemometrics 39
Central value
• The normal distribution is the most widely
used distribution in statistics
– Mean
– Standard deviation
• These two parameters have to be estimated
using the data at hand

LOKT.08.005 Chemometrics 40
Arithmetic mean
Population =  x i
Sample x=
 x i
n n

57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87

Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14

LOKT.08.005 Chemometrics 41
Example 1 Example 2
Measurment 1 0.2 0.42
Measurment 2 0.3 0.44
Measurment 3 0.25 4.3
Average 0.25 1.72

LOKT.08.005 Chemometrics 42
Median
• A robust measure for the central value
– much less influenced by outliers than the mean
• The median divides the data distribution into
two equal halves
– the number of data higher than the median is
equal to the number of data lower than the
median

LOKT.08.005 Chemometrics 43
Median
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87

Mean = (57.65+…+63.10)/60 = 59.82 Mean = (40.01+…+72.87)/60 = 59.14

Median = 59.66 Median = 58.65

LOKT.08.005 Chemometrics 44
Spread
• Range
– length of the smallest interval which contains all the data (xmax – xmin)
• Interquartile range
– Difference between the third and the first quartile
• Variance
– describes how far the numbers lie from the mean
• Standard deviation
– measure of the dispersion of a set of data from its mean
• Median absolute deviation (MAD)
– the median of the absolute deviations from the data's median

45
Descriptive statistics
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87

Range = 63.1 – 57.65 = 5.45 Range = 72.87 – 40.01 = 32.87

IQR = 1.375 IQR = 7.05

LOKT.08.005 Chemometrics 46
Descriptive statistics
population sample

(
 i )  (x − x )
2
x −  2

Variance 2 =
n
v = s2 = i

n −1

 (x − x )
2
 (x − )
2

Standard deviation = i

n
s= v =
n −1
i

Median average
𝑀𝐴𝐷 = 𝑚𝑒𝑑𝑖𝑎𝑛( 𝑋𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑋 )
deviation
LOKT.08.005 Chemometrics 47
Descriptive statistics
• The variance of a data set is calculated by taking
 (x − x )
2 the arithmetic mean of the squared differences
v = s2 = i
between each value and the mean value
n −1
• Squaring makes each term positive so that
values above the mean do not cancel values
below the mean

• Because the differences are squared, the


units of variance are not the same as the units
 (x − x )
2
of the data.
s= v = i

n −1 • Therefore, the standard deviation is reported


as the square root of the variance and the
units then correspond to those of the data set.

LOKT.08.005 Chemometrics 48
Descriptive statistics
57.65 58.07 58.32 58.34 58.40 58.41 58.49 58.63 58.75 58.78 58.90 58.91 58.93 58.93 59.01
59.04 59.08 59.14 59.16 59.17 59.24 59.25 59.27 59.37 59.40 59.45 59.45 59.48 59.58 59.65
59.68 59.74 59.84 59.89 59.91 60.00 60.07 60.09 60.11 60.14 60.20 60.24 60.30 60.31 60.39
60.41 60.44 60.53 60.65 60.73 60.81 61.08 61.30 61.40 61.41 61.60 61.87 62.08 62.37 63.10
40.01 47.54 51.08 51.90 52.09 52.89 53.20 54.06 54.14 54.99 55.14 55.41 56.05 56.10 56.34
56.77 56.82 56.86 57.00 57.02 57.48 57.49 57.68 57.84 57.85 57.94 58.25 58.31 58.36 58.60
58.70 59.46 60.49 60.53 60.59 60.68 60.83 61.08 61.09 61.53 61.79 61.87 62.14 62.19 62.50
62.71 63.39 63.46 64.04 64.06 64.14 64.56 64.77 64.97 65.01 65.52 65.56 66.00 66.70 72.87

v = [(57.65 – 59.82)^2+…+(63.1- v = [(40.01 – 59.14)^2+…+(72.87 -


59.82)^2]/60-1 = 1.25 59.14)^2]/60-1 = 27.16
s = √v = 1.12 s = √v = 5.21
MAD = |57.65-59.66|,….,|63.1-59.66| = MAD = |40.01-58.65|,….,|72.87-58.65|
1.02 = 1.6

LOKT.08.005 Chemometrics 49
Mean = 59.82
Std. dev = 1.12

LOKT.08.005 Chemometrics 50
HYPOTHESIS TESTING

LOKT.08.005 Chemometrics 51
Statistical hypothesis testing
• Hypothesis are our assumptions about the data
which may or may not be true.

• Hypothesis testing is a statistical method that is used


in making statistical decisions using experimental
data
– Tests for distribution
– Tests for central value (mean)
– Tests for variance

LOKT.08.005 Chemometrics 52
Steps for hypothesis testing
• State the hypotheses
– the null hypothesis and an alternative hypothesis
• Formulate an analysis plan
– the significance level is 0.05, the test method
• Analyse sample data
• Interpret result

LOKT.08.005 Chemometrics 53
Statistical hypotheses
• H0 – null hypothesis
– a statistical hypothesis that states that there is no
difference between a parameter and a specific
value, or that there is no difference between two
parameters

• H1 – alternative hypothesis
– a statistical hypothesis that states the existence of
a difference between a parameter and a specific
value, or states that there is a difference between
two parameters.

LOKT.08.005 Chemometrics 54
Hypothesis testing
• We want to test is if H1 is “likely” true.
• Two possible outcomes:
– Reject H0 and accept H1 because of sufficient
evidence in the sample in favour of H1

– Do not reject H0 because of insufficient evidence


to support H1

LOKT.08.005 Chemometrics 55
Analysis plan

• a – significance level, commonly 0.05


– Means that in 5% (1 in 20) of the cases H0
will rejected although it is valid

• Pick a test method

LOKT.08.005 Chemometrics 56
Analyse data

• a – significance level, commonly 0.05


– Means that in 5% (1 in 20) of the cases H0 will rejected although it is
valid

• p – significance probability
– If p-value is larger than (or equal to) a, the test is
said to be “not significant” and thus null
hypothesis H0 cannot be rejected

– If p-value is smaller than a, H0 has to be rejected


and it is said the test “is significant”
LOKT.08.005 Chemometrics 57
Note that failure to reject H0 does not mean the null hypothesis is
true.
There is no formal outcome that says “accept H0.”
It only means that we do not have sufficient evidence to support H0
LOKT.08.005 Chemometrics 58
Decision errors

H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision

LOKT.08.005 Chemometrics 59
Decision errors
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision

The acceptance of H1 when H0 is true is called a Type I error.


The probability of committing a type I error is called the level
of significance and is denoted by α.

LOKT.08.005 Chemometrics 60
Decision errors
H0 is true H1 is true
Do not reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision

Failure to reject H0 when H1 is true is called a Type II error.


The probability of committing a type II error is denoted by β.

LOKT.08.005 Chemometrics 61
Tests for distribution
• Shapiro-Wilk Test
– H0 : data distribution follows normal distribution
• Anderson-Darling Test
– H0 : data distribution follows given hypothetical
distribution
• Kolmogorov-Smirnov Test
– H0 : data distribution follows given hypothetical
distribution
If the p-value is less than the chosen alpha level (0.05),
then the null hypothesis is rejected and there is evidence
that the data tested are not from a normally distributed
population
LOKT.08.005 Chemometrics 62
Quantile-Quantile (Q-Q) plot
• Is a graphical tool to help us
assess if a set of data plausibly
came from some theoretical
distribution such as a Normal

• A Q-Q plot is a scatterplot created


by plotting two sets of quantiles
against one another

• If both sets of quantiles came from


the same distribution, we should
see the points forming a line that’s
roughly straight

LOKT.08.005 Chemometrics 63
Example
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
Variance 1.25 27.16
Shapiro-Wilk Statistic 0.968 0.958
Shapiro-Wilk p-value 0.1165 0.039

If the p-value is less than the chosen alpha level


(0.05), then the null hypothesis is rejected and
there is evidence that the data tested are not from
a normally distributed population

LOKT.08.005 Chemometrics 64
Tests for central value (mean)
• One sample t-test x1 − x 2
t=
– H0 : mean is equal to given value s
n
• Two sample t-test
– H0 : means of two distributions are equal
Requires normal distribution of both groups but, it is not very
sensitive if they are not

• Wilcoxon Signed Ranks Test (nonparametric)


– H0 : medians of two distributions are equal
Requires continuous data

LOKT.08.005 Chemometrics 65
One-sample t-test (example)
Actual concentration of the sample was 60.0 ppm
STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
t -1.2782 -1.2771
T-test p-value 0.2062 0.2066

If the p-value is less than the chosen alpha level


(0.05), then the null hypothesis is rejected and
there is evidence that the mean is different from
the hypothesized value

LOKT.08.005 Chemometrics 66
Two-sample t-test (example)

STUDENT_1 STUDENT_2
N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
t 0.981
T-test p-value 0.330

If the p-value is less than the chosen alpha level


(0.05), then the null hypothesis is rejected and
there is evidence that the means of two
distributions are not equal

LOKT.08.005 Chemometrics 67
Tests for variance
• One sample F-test  1 s12
– H0 : variance is equal to given value F= =
 2 s22
• Two sample F-test
– H0 : variance of two distributions are equal
Requires normal distribution of both groups and is very sensitive
if they are not

• Ansari-Bradley Test (nonparametric)


– H0 : variance of two distributions are equal
No requirements

LOKT.08.005 Chemometrics 68
Two-sample F-test (example)
STUDENT_1 STUDENT_2

N of Cases 60 60
Minimum 57.65 40.01
Maximum 63.10 72.87
Arithmetic Mean 59.82 59.14
Standard Deviation 1.12 5.21
F 0.046
F-test p-value < 2.2e-16
AB 2580
Ansari-Bradley test p-value 3.331e-15

If the p-value is less than the chosen alpha level


(0.05), then the null hypothesis is rejected and
there is evidence that the variances of two
distributions are not equal
LOKT.08.005 Chemometrics 69
Data
t-test and F-test example
name HC-OH logL(n-hexane) logL(water)
methanol alcohol 1.09 3.73
2-propanol alcohol 1.69 3.48 Statistics
ethanol alcohol 1.92 3.69 HC-OH of HC-OH of
... ... ... ... logL(hexane) logL(water)
p-bromophenol alcohol 5.11 5.23 mean 3.531 3.897 3.830 -0.303
4-hydroxybenzaldehyde alcohol 6.74 5.38 standard deviation 1.413 1.726 0.948 1.639
methyl 4-hydroxybenzoate alcohol 8.42 6.84 variance 1.996 2.978 0.898 2.685
t value -1.090 -15.067
methane hydrocarbon -0.03 -1.43
p-value of t-test 0.279 (same) 0.000 (not same)
ethene hydrocarbon 0.59 -0.97
F ratio 0.670 2.685
propane hydrocarbon 1.37 -1.46
p-value of F-test 0.221 (same) 0.001 (not same)
... ... ... ...
trans-stilbene hydrocarbon 7.45 2.78
anthracene hydrocarbon 7.45 3.03
fluoranthene hydrocarbon 8.42 3.44

LOKT.08.005 Chemometrics 70
ERRONEOUS AND MISSING DATA

LOKT.08.005 Chemometrics 71
Errors in data
When using statistics one should always concern that the
data consists errors
• Random error (noise)
– Most common and exists always
• Systematic error
– Imperfection in an experimental procedure
– Poorly calibrated instrument, incorrect use of volumetric
glassware
• Gross error
– For example caused by instrumental breakdown
– Negative concentrations, too high concentrations (100 molar
(M))
Data with gross error should be excluded from data analysis

LOKT.08.005 Chemometrics 72
Student Results (ml) Average Comment
A 10.08 10.11 10.09 10.1 10.12 10.1 Precise, biased
B 9.88 10.14 10.02 9.8 10.21 10.01 Imprecise, unbiased
C 10.19 9.79 9.69 10.05 9.78 9.9 Imprecise, biased
D 10.04 9.98 10.02 9.97 10.04 10.01 Precise, unbiased

LOKT.08.005 Chemometrics 73
Random errors
• Affect precision – repeatability or
reproducibility
• Cause replicate results to fall on either side of
a mean value
• Can be estimated using replicate
measurements
• Can be minimised by good technique but not
eliminated
• Caused by both humans and equipment
LOKT.08.005 Chemometrics 74
Systematic errors
• Produce bias – an overall deviation of a result from
the true value even when random errors are very
small
• Cause all results to be affected in one sense only, all
too high or all too low
• Cannot be detected simply by using replicate
measurements
• Can be corrected, e.g. by using standard methods
and materials
• Caused by both humans and equipment
LOKT.08.005 Chemometrics 75
Missing data
• Data possess empty gaps
– Either empty cell or N/A name
Methanol
MOA
1
B96h
0.32
C48h
1.86
-OH
1

• Solutions
Phenol 2 3.60 3.16 1
Ethanol 1 1.47 0.66 1
propan-1-ol 1 2.11 2.53 1
– Do nothing Benzene
pentan-1-ol
1
1 2.13
2.25
2.76 1
– Substitute with variable mean 2-Propanone
Butan-1-ol
1
1
0.90
2.42
1.22
N/A 1

– Substitute with 0 propan-2-ol


Aniline
1
2
1.74
3.28
2.76
2.57
1


Toluene 1 3.85 2.36
Remove corresponding object or 4-nitrophenol 2 4.22 3.75 1
2-butanone 1 2.15 1.38
variable hexan-1-ol 1 3.47 2.62 1

– Estimate with NIPALS, MI


2,4-dichlorophenol 2 4.91 4.45 1

(Multiple Imputation)

LOKT.08.005 Chemometrics 76

You might also like