Professional Documents
Culture Documents
Innovative Approaches to
Uncover Multivariate Analysis
Guest Speaker for this Seminar
Weight(Kg)
15 16 16 18 14
16 17 15 17 15
16 14 12 16 12
15 16 17 16 16
17 17 16 16 28
17 15 17 19 15
16 16 15 14 16
17 15 18 14 16
13 17 15 16 14
16 14 13 15 15
Mean 15.82
Max 28
Min 12
Range 16
Std Deviation 2.273988
Variance 5.17102
Median 16
Q1 15
Q3 16.75
Bivariate Data
Height
Weight(Kg) (cm)
Working with Dataset (Two Variables)
15 120
16 130
16 128
15 122
17 135
16 129
17 134
13 114
14 119
15 119
14 118
Multivariate Data Set
Vegetable
3600 3594 3588 3583 3577 3571 3565 3559 3554 3548 3542 3536 3530 3525 3519 3513 3507 3501 3496 3490 3484 3478 3472 3467
Oil
Corn 0 4.67 14 6.33 2.33 13 6.67 5.33 4.33 4 3.67 2 -3.67 -11 -12.3 -5.33 -9 -12 -3.33 3 6 14 11.7 6.33
Corn -0.333 3 14 3.67 2.67 13.3 8 3 3.33 5 -2 -6 -4 -11.7 -14.3 -11.3 -13 -10.7 -4.33 -0.333 1.67 11 10 2.33
Corn -1 -3 -3.67 9.67 -1 1.33 -1.67 2.67 1 3 -3 -5.33 -4 -15.7 -14 -12.3 -7.67 -8.33 -2.67 2.33 2 7 12.7 6.67
Corn 2.67 3.33 11.3 5.33 3.67 13.3 6 1.33 4 5.33 2 -0.333 -3.67 -13.3 -11.3 -5 -8.67 -8.67 -2.33 3.33 9.67 13.7 13.3 5.67
Corn -0.667 3.67 11.7 0.667 -0.667 12.3 4.67 -1 0.333 0.667 -1.67 -6.33 -8.67 -15 -19 -14.3 -14 -16.3 -9.67 0 1.67 6 7.67 2.33
Corn 1.67 3.33 13.7 7.33 5.33 12.3 7.33 5.33 -0.667 2.67 -2.33 -4.67 -7.33 -13.7 -15.7 -9.67 -11.3 -10.7 -3 0.667 1.33 8.33 10.7 5
Corn 1.67 6.67 14.7 5 2 13 6.33 3.33 4.67 4.67 -1 -2.33 -5.33 -11.3 -14.3 -7.33 -9 -8.67 -5 0.333 5.67 11.7 9 3.67
Corn -1.67 0.333 8.33 -2.33 -2.33 9.67 2.67 -0.333 -3 -1.67 -6.67 -7.67 -10 -18 -22.3 -18.7 -16.3 -17.3 -11.3 -3 -1.67 1.67 1.67 -3.67
Corn -0.333 5.33 13.3 4 2 14.7 8.33 3 1 3.67 -3.67 -5.33 -6.33 -11 -15 -10.7 -10.3 -8 -3.67 1.33 4.67 12.7 10 3.33
Olive -1 -0.667 5.33 1.33 -3.33 6.67 2.67 0.333 0.667 0.667 -2.67 -0.667 -4.67 -9.67 -12 -9.33 -10.3 -8 -3.67 0.333 8 10.7 9 5
Olive -1.67 1.67 12 3.67 -1 11.7 5 4.33 2 0.667 -3 -3 -4 -9.67 -10.3 -7.33 -8.67 -10.3 -8.33 4.33 5.33 8.67 8.33 3
Olive 0.333 -1 8 2.33 1 9 5.33 4.33 2 -0.333 -5.67 -2.33 -1.67 -8.33 -11.3 -8.33 -9 -7 -3.67 3.67 7 12.7 12 2.33
Olive 0.333 1.67 12.7 7.33 3.33 15 8 3.67 4 2.67 -1.33 -3.67 -2.33 -9.33 -11.3 -7.33 -9.33 -9.33 -1.67 2.67 9 11.7 10 5.67
Olive -3.33 -2.33 6.33 -0.667 -7 6 0 -5.67 -6.33 -6.67 -10.7 -13.7 -15.7 -21.7 -24.3 -21.7 -20 -20 -15.7 -5.67 -3 -1 -2.33 -4.67
Olive 0.667 -0.333 12 3 2.33 12.3 5.67 2 -1.33 -2 -9.33 -7.33 -11 -15.7 -19.3 -15 -14.3 -12 -9.67 2.33 3.33 5.67 8 2.67
Olive -1 4.67 12.7 6.67 4.67 12 9.67 6.67 5.33 6.67 2.67 3 -0.333 -6.67 -10.3 -6 -7.67 -8 -0.667 3.67 9.33 15.3 13 7
Olive -2.33 -2 10.7 2.67 -1.67 15.3 8.67 3.67 4.33 2 -3.33 -3 -3.33 -9.67 -13 -12.7 -10.7 -11.7 -7.67 3.67 7.67 9 8 1.67
Olive -0.333 0.333 9 2.67 -0.333 10.3 8.33 6 2 1.33 -4.67 -4.33 -5 -11.7 -11.3 -10.7 -10.3 -8 -5 4.33 7 10.3 7.67 4
Olive -0.333 2.33 13.7 4.33 3 12 9 3.67 4.67 5 -2.33 -0.667 -1 -8.67 -9 -7 -4.67 -8.67 -0.667 4 10.3 12.7 14.3 5.67
Olive -3 1 13.3 2 -2 12.7 6.67 6 0.667 2 -3 -4.33 -4.33 -12.3 -14.7 -11.7 -10.3 -11.3 -7.33 1.33 5.67 9 8.33 2.33
Olive -2 -0.333 11 1 -2.67 9 5.33 2.67 -0.667 -3.33 -7.33 -5.33 -7.67 -9.67 -16.7 -11 -9.67 -8.67 -6 2.33 6.33 11.7 7 0.333
The world is multivariate
Multivariate
Multifactorial Problems Data Collection mathematics /
chemometrics
Factors: Sensors:
•Wind •Anemometer
•Air pressure •Barometer
•Temperature •Thermometer
•Dew point •Hygrometer
•Season •Calendar
Weather
Application of MVA : Indirect observations
Indirect measurements
often need an MVA approach
Types of multivariate data - examples
• Process data
– raw material
– process variables (e.g. pressure, temperature)
– End-product quality measurements
• Sensory data
– description by panel and
instrumental
– consumer preference
A simple illustration of why multivariate analysis is needed
50 Upper limit
45
40 Temperature
35
30
Lower limit
25
0 20 40 60 80 100
Time/sample no.
4 6 8 10
0
20 Allowed Univariate
Region
Time/sample no.
40
pH Variables highly
60
correlated except for
80
Only when viewed
100
Lower limit Upper limit
multivariately can the
problem be isolated
Application of MVA :
Raw
materials
Blending Tableting Coating Packaging Distribution
(API and
Excepients)
Variable raw
Fixed Variable
material
process output
input
Process Analytical Technology (PAT)
Raw
materials
Blending Tableting Coating Packaging Distribution
(API and
Excepients)
Prediction
Check the
results Results
Multivariate Analysis Tools
X Model Error
Data Structure Noise
Extraction of Principal Components (PCs)
PC 1 PC 3
Variables
Samples/Objects
PC 2
New latent variables that are linear combinations of the original variables.
PC1 = a1 V1 + a2 V2 + a4 V4+ a6 V6+ a7 V7+ a8 V8+ a9 V9+ a10 V10 + a11 V11 + a12 V12
PC2 = a1 V1 + a3 V3 + a5 V5+ a8 V8+ a10 V10
PC3 = a3 V3 + a5 V5+ a11 V11
[X] = b1 PC1 + b2 PC2 + b3 PC3 +Error
Interpretation of PCA Model
C2
D4 E3
E4
0.5 D4 color2
D4 D3
B5 C4 D3
B5 B5 Hardness
D3
D5 D4 D3 C3Mealines B1 A1
A5
A5
A5 D4 E4 C2
B2
C2 B1
0 Pea_Flav
A5 D4 E2
Sweet
Fruity B3 A2
B5 C5
A5
B5 C4A4
A4 Off_flav C2 C2 B1A1
D4 B2
C4 D4skinSlonk B2
A5B5
-0.5 B5B5 D4 C3
WhitenesC3 A1
A5
color1
-1.0 A5 PC1
-0.5 0 0.5 1.0
RESULT1,
X-expl: 74%,13%
Example data set: McDonalds data
Products / Nutrition (kJ/g) Protein (%) Carbohydrates(%) Fat (%) Saturated Fat (%)
Big Mac 9.54 12.48 19.62 11.00 4.02
Cheeseburger 10.54 13.79 26.00 10.19 4.41
Grilled Chicken 8.14 12.68 14.74 9.38 0.09
Hamburger 10.04 12.68 28.79 7.96 2.54
McChicken 9.24 11.37 19.32 10.80 1.76
McFeast 8.64 11.67 14.64 11.31 4.51
Quarter Pounder
w/cheese 10.54 15.71 17.93 12.93 6.38
Filet-O-Fish 11.84 10.87 26.39 14.86 2.84
Pommes Frites 12.24 5.02 37.06 13.54 2.05
Dessert
• Loadings can be
visualized to map
Protein content and which variables have
carbohydrate content are anti-
correlated contributed to the
High contribution score plot.
on PC 1 • Variables far away
from the center are
well described and
Not contributing important
to the model
1g proteins = 17 Kj
• Variables near the
High 1g carbohydrates = 17 Kj center are less
contribution on 1g fat = 38 Kj
PC 2
important.
Superimposition of Score and Loading Plot : McDonalds data
Low fat desserts
High in proteins
Low in carbohydrates
– Meat and Fish High in
Carbohydrates
Low in protein
– Several classes;
– Or one class vs. rest of the world. Adulterated
• Projection method
• SIMCA (Soft Independent Modelling of Class Analogies)
• PLS-DA (Partial Least Square – Discriminant Analysis)
• LDA (Linear Discriminant Analysis)
31
Regression modeling
What is regression modeling?
Mass, m Length, l
1.0 15.0
1
2 1.5 17.1
? 3 2.0 18.0
?
2.5 19.5
Problem Idea Experiment 3.0 21.0
Length, Prediction
25
20
l m=a+bxl
15 l = 20.7
10 a=-4
5 b = + 0.33 m=a+bxl
0
0 2 4 m = 2.9
Mass, m
Modeling
Linear regression
Uni-variate regression
y = b0 + b1*x
y
Least squares criterion
x: predictor variable b0
y: response variable x
b0: Intercept
b1: Slope
Multivariate regression
Examples:
Job satisfaction = b0 + b1*(type of work) + b2*(boss) + b3*
(colleagues)
Approximation
Regression methods
Y
MLR – Multiple Linear Regression:
A classical method still in use that relates one single
response variable (Y) to a small number of variables (X).
X1 X2
PCR – Principal Component Regression:
Y
Perform PCA on the X matrix, then use MLR to relate the
Y-variable to the scores from the PCA of X.
PLS – Partial Least Squares Regression:
PC1 PC2
Model both X and Y - matrices simultaneously
X3 u
Y3
t
t
u
PCy = f(PCx) Y2
X1 X2 u = f(t) Y1
1
Interpreting Regression Model
2 PC2 Scores
Salmon
Potato
Potato Blue Mussel
Potato Blue
Blue Mussel Salmon
Mussel
1 Potato
Cod
Potato Salmon
Salmon
Salmon
Salmon
Salmon
Salmon
Salmon
0 Cod Blue Mussel Salmon
Cod Salmon
Blue Mussel
-1 Blue Mussel
Potato
Potato Blue Mussel
CodPotato
Cod
Cod
Potato
-2
Cod
-3 PC1
-3 -2 -1 0 1 2 3
RESULT2,
X-expl: 24%,19%
Y-expl: 76%,9%
Sex and BMI (two X EPA (Y variable) is Regression line (actual Target line (optimal
Predicted fit)
variables) are strongly correlated fit)
1.0 PC2 Correlation Loadings (X and Y)
Y
correlated with the diet Salmon
Age (X variable)
0.5 BMI
Sex Individual Salmon
EPA_min7
Potato Blue Mussel
0
The X and Y-
loading plot is
-0.5 Cod Hormon useful to
understand the Measured
-1.0 PC1correlations
-1.0
RESULT2,
X-expl: 24%,19%
-0.5
Y-expl: 76%,9%
0 0.5 1.0 Y
between the
explicative and
response
variables.
Interpreting Regression Model
Example data set: McDonalds data
Products / Nutrition (kJ/g) Protein (%) Carbohydrates(%) Fat (%) Saturated Fat (%)
Big Mac 9.54 12.48 19.62 11.00 4.02
Cheeseburger 10.54 13.79 26.00 10.19 4.41
Grilled Chicken 8.14 12.68 14.74 9.38 0.09
Hamburger 10.04 12.68 28.79 7.96 2.54
McChicken 9.24 11.37 19.32 10.80Validation
1.76set
McFeast 8.64 11.67 14.64 11.31 4.51
Quarter Pounder
w/cheese 10.54 15.71 17.93 12.93 6.38
Filet-O-Fish 11.84 10.87 26.39 14.86 2.84
Pommes Frites 12.24 5.02 37.06 13.54 2.05
Dessert
2 latent
components
RMSEP = 0.18KJ/g
R2 = 0.991
B coefficients
1g proteins = 17 Kj
1g carbohydrates = 17 Kj
1g fat = 38 Kj
Variable not to be
taken into account 2 latent
components
RMSEP = 0.12 KJ/g
R2 = 0.996
Validating the PLS model of Energy on the Unused McDonald’s data
Conclusions on PLS