You are on page 1of 47

Seminar on

Innovative Approaches to
Uncover Multivariate Analysis
Guest Speaker for this Seminar

Mr. Ramnendra Mandloi is a Mechanical


Engineer with Six Sigma Master Black
Belt, Master of Technology from IIT, Delhi
with over 23 years of industrial experience
with large manufacturing organizations
from Automotive, Biotech to
Pharmaceuticals industrials which
includes execution of process
improvement using Quality Management
system , Six Sigma, Design of
Experiments, Multivariate Data Analysis
and Automotive Core Tools application i.e.
FMEA , SPC , MSA etc.
Univariate Data

Weight(Kg)
15 16 16 18 14
16 17 15 17 15
16 14 12 16 12
15 16 17 16 16
17 17 16 16 28
17 15 17 19 15
16 16 15 14 16
17 15 18 14 16
13 17 15 16 14
16 14 13 15 15

Mean 15.82
Max 28
Min 12
Range 16
Std Deviation 2.273988
Variance 5.17102
Median 16
Q1 15
Q3 16.75
Bivariate Data

Height
Weight(Kg) (cm)
Working with Dataset (Two Variables)
15 120
16 130
16 128
15 122
17 135
16 129
17 134
13 114
14 119
15 119
14 118
Multivariate Data Set

Vegetable
3600 3594 3588 3583 3577 3571 3565 3559 3554 3548 3542 3536 3530 3525 3519 3513 3507 3501 3496 3490 3484 3478 3472 3467
Oil
Corn 0 4.67 14 6.33 2.33 13 6.67 5.33 4.33 4 3.67 2 -3.67 -11 -12.3 -5.33 -9 -12 -3.33 3 6 14 11.7 6.33
Corn -0.333 3 14 3.67 2.67 13.3 8 3 3.33 5 -2 -6 -4 -11.7 -14.3 -11.3 -13 -10.7 -4.33 -0.333 1.67 11 10 2.33
Corn -1 -3 -3.67 9.67 -1 1.33 -1.67 2.67 1 3 -3 -5.33 -4 -15.7 -14 -12.3 -7.67 -8.33 -2.67 2.33 2 7 12.7 6.67
Corn 2.67 3.33 11.3 5.33 3.67 13.3 6 1.33 4 5.33 2 -0.333 -3.67 -13.3 -11.3 -5 -8.67 -8.67 -2.33 3.33 9.67 13.7 13.3 5.67
Corn -0.667 3.67 11.7 0.667 -0.667 12.3 4.67 -1 0.333 0.667 -1.67 -6.33 -8.67 -15 -19 -14.3 -14 -16.3 -9.67 0 1.67 6 7.67 2.33
Corn 1.67 3.33 13.7 7.33 5.33 12.3 7.33 5.33 -0.667 2.67 -2.33 -4.67 -7.33 -13.7 -15.7 -9.67 -11.3 -10.7 -3 0.667 1.33 8.33 10.7 5
Corn 1.67 6.67 14.7 5 2 13 6.33 3.33 4.67 4.67 -1 -2.33 -5.33 -11.3 -14.3 -7.33 -9 -8.67 -5 0.333 5.67 11.7 9 3.67
Corn -1.67 0.333 8.33 -2.33 -2.33 9.67 2.67 -0.333 -3 -1.67 -6.67 -7.67 -10 -18 -22.3 -18.7 -16.3 -17.3 -11.3 -3 -1.67 1.67 1.67 -3.67
Corn -0.333 5.33 13.3 4 2 14.7 8.33 3 1 3.67 -3.67 -5.33 -6.33 -11 -15 -10.7 -10.3 -8 -3.67 1.33 4.67 12.7 10 3.33
Olive -1 -0.667 5.33 1.33 -3.33 6.67 2.67 0.333 0.667 0.667 -2.67 -0.667 -4.67 -9.67 -12 -9.33 -10.3 -8 -3.67 0.333 8 10.7 9 5
Olive -1.67 1.67 12 3.67 -1 11.7 5 4.33 2 0.667 -3 -3 -4 -9.67 -10.3 -7.33 -8.67 -10.3 -8.33 4.33 5.33 8.67 8.33 3
Olive 0.333 -1 8 2.33 1 9 5.33 4.33 2 -0.333 -5.67 -2.33 -1.67 -8.33 -11.3 -8.33 -9 -7 -3.67 3.67 7 12.7 12 2.33
Olive 0.333 1.67 12.7 7.33 3.33 15 8 3.67 4 2.67 -1.33 -3.67 -2.33 -9.33 -11.3 -7.33 -9.33 -9.33 -1.67 2.67 9 11.7 10 5.67
Olive -3.33 -2.33 6.33 -0.667 -7 6 0 -5.67 -6.33 -6.67 -10.7 -13.7 -15.7 -21.7 -24.3 -21.7 -20 -20 -15.7 -5.67 -3 -1 -2.33 -4.67
Olive 0.667 -0.333 12 3 2.33 12.3 5.67 2 -1.33 -2 -9.33 -7.33 -11 -15.7 -19.3 -15 -14.3 -12 -9.67 2.33 3.33 5.67 8 2.67
Olive -1 4.67 12.7 6.67 4.67 12 9.67 6.67 5.33 6.67 2.67 3 -0.333 -6.67 -10.3 -6 -7.67 -8 -0.667 3.67 9.33 15.3 13 7
Olive -2.33 -2 10.7 2.67 -1.67 15.3 8.67 3.67 4.33 2 -3.33 -3 -3.33 -9.67 -13 -12.7 -10.7 -11.7 -7.67 3.67 7.67 9 8 1.67
Olive -0.333 0.333 9 2.67 -0.333 10.3 8.33 6 2 1.33 -4.67 -4.33 -5 -11.7 -11.3 -10.7 -10.3 -8 -5 4.33 7 10.3 7.67 4
Olive -0.333 2.33 13.7 4.33 3 12 9 3.67 4.67 5 -2.33 -0.667 -1 -8.67 -9 -7 -4.67 -8.67 -0.667 4 10.3 12.7 14.3 5.67
Olive -3 1 13.3 2 -2 12.7 6.67 6 0.667 2 -3 -4.33 -4.33 -12.3 -14.7 -11.7 -10.3 -11.3 -7.33 1.33 5.67 9 8.33 2.33
Olive -2 -0.333 11 1 -2.67 9 5.33 2.67 -0.667 -3.33 -7.33 -5.33 -7.67 -9.67 -16.7 -11 -9.67 -8.67 -6 2.33 6.33 11.7 7 0.333
The world is multivariate

All real processes are multivariate

Multivariate
Multifactorial Problems Data Collection mathematics /
chemometrics

Factors: Sensors:
•Wind •Anemometer
•Air pressure •Barometer
•Temperature •Thermometer
•Dew point •Hygrometer
•Season •Calendar

Weather
Application of MVA : Indirect observations

Measure of Temperature in a Furnace load?

Direct measurement Indirect measurement with


with a thermometer a IR emission
spectrometer

 Indirect measurements
often need an MVA approach
Types of multivariate data - examples

• Process data
– raw material
– process variables (e.g. pressure, temperature)
– End-product quality measurements

• Spectral and image data


– near-infrared (NIR)
– excitation-emission matrix (EEM)
fluorescence
– magnetic resonance imaging (MRI)

• Sensory data
– description by panel and
instrumental
– consumer preference
A simple illustration of why multivariate analysis is needed

50 Upper limit

45
40 Temperature
35
30
Lower limit
25
0 20 40 60 80 100
Time/sample no.
4 6 8 10
0

20 Allowed Univariate
Region
Time/sample no.

40
pH Variables highly
60
correlated except for
80
Only when viewed
100
Lower limit Upper limit
multivariately can the
problem be isolated
Application of MVA :

Manufacturing Process Environment

o Manufacturing processes are complex and composed of unit


operations
o Each unit operation is associated with different quality issues
o Each unit operation has its own quality monitoring and
improvement measures
o Traditional measures
• Check for quality of raw materials
• Check for quality of final product (Dosage forms)
An Example: Pharmaceutical Manufacturing: Tablet

Raw
materials
Blending Tableting Coating Packaging Distribution
(API and
Excepients)

Traditional methods usually involves the quality checks at the


end of each unit operation
What's Wrong??????
• Long lead-times due to breaks at the end of unit operations for
quality checks
• High potential of non-conforming products
• Lead to high inventories and unwanted warehouse capacity
• Same procedures for processes using different batches of raw
materials, e.g., fixed blending time, drying time etc. Consequently,
leads to variability in quality of final product

Variable raw
Fixed Variable
material
process output
input
Process Analytical Technology (PAT)

Raw
materials
Blending Tableting Coating Packaging Distribution
(API and
Excepients)

PAT is a system for designing, analyzing, and controlling


manufacturing through timely measurements (i.e., during
processing) of critical quality and performance attributes of raw
and in-process materials and processes with the goal of ensuring
final product quality
Variable raw Consistent
Variable
material
process output
input

Continuous Quality monitoring

Initiative by U.S. Food and Drug Administration, 2004


MVA tools and purposes

Exploratory data analysis (PCA, cluster analysis)


– Relations among 1 block of variables
– Map of samples
– Groups, patterns, outliers

Describe quality and


X structure
MVA tools and purposes

Classification/Discrimination •Classify the


Grou
pA samples in groups
X
Grou
pB
MVA tools and purposes

Regression methods (MLR, PCR, PLS)


– Relations between two blocks of variables • Relate
– Map of samples (PCR, PLS) measurements with
– Model: Y=XB references
• Indirect
Calibration and prediction : measurements

X Y Model • Calibrate a model

• Predict new values


^ for new samples
Xnew Model Y
Multivariate Analysis workflow

Collect the data


(sampling, acquisition)
Statistics Check the
, PCA data
Make a model MVA
Calibration
Check the
Validation
model
Apply the model on
new samples

Prediction
Check the
results Results
Multivariate Analysis Tools

• Principle Component Analysis


• Classification
• Multivariate Regression
Principal Component Analysis
Use of Principal Component Analysis (PCA)

• Exploratory data analysis


• Extract information
• Noise removal
• Dimensionality reduction

X Model Error
Data Structure Noise
Extraction of Principal Components (PCs)

PC 1 PC 3

Variables
Samples/Objects

PC 2
New latent variables that are linear combinations of the original variables.
PC1 = a1 V1 + a2 V2 + a4 V4+ a6 V6+ a7 V7+ a8 V8+ a9 V9+ a10 V10 + a11 V11 + a12 V12
PC2 = a1 V1 + a3 V3 + a5 V5+ a8 V8+ a10 V10
PC3 = a3 V3 + a5 V5+ a11 V11
[X] = b1 PC1 + b2 PC2 + b3 PC3 +Error
Interpretation of PCA Model

1.0 PC2 color3 Bi-plot

C2
D4 E3
E4
0.5 D4 color2
D4 D3
B5 C4 D3
B5 B5 Hardness
D3
D5 D4 D3 C3Mealines B1 A1
A5
A5
A5 D4 E4 C2
B2
C2 B1
0 Pea_Flav
A5 D4 E2
Sweet
Fruity B3 A2
B5 C5
A5
B5 C4A4
A4 Off_flav C2 C2 B1A1
D4 B2
C4 D4skinSlonk B2
A5B5
-0.5 B5B5 D4 C3
WhitenesC3 A1
A5

color1
-1.0 A5 PC1
-0.5 0 0.5 1.0
RESULT1,
X-expl: 74%,13%
Example data set: McDonalds data

13 products found at McDonalds restaurants (objects)


5 nutrition values (variables)
Energy
Hamburger with meat

Products / Nutrition (kJ/g) Protein (%) Carbohydrates(%) Fat (%) Saturated Fat (%)
Big Mac 9.54 12.48 19.62 11.00 4.02
Cheeseburger 10.54 13.79 26.00 10.19 4.41
Grilled Chicken 8.14 12.68 14.74 9.38 0.09
Hamburger 10.04 12.68 28.79 7.96 2.54
McChicken 9.24 11.37 19.32 10.80 1.76
McFeast 8.64 11.67 14.64 11.31 4.51
Quarter Pounder
w/cheese 10.54 15.71 17.93 12.93 6.38
Filet-O-Fish 11.84 10.87 26.39 14.86 2.84
Pommes Frites 12.24 5.02 37.06 13.54 2.05
Dessert

Apple Pie 11.54 2.70 32.17 15.06 4.31


Sundae Chokolate 7.84 4.31 28.39 6.14 4.41
Sundae Strawberry 6.84 3.60 28.29 3.50 2.45
Sundae Caramel 7.84 4.01 31.68 4.62 3.13
Score plot from McDonalds data

• McDonalds fast food


products displayed as a
score plot.
• The purpose is to
describe products
according to nutrition
content.
• The relative positions of
products reflect their
Average sample similarities or
differences.
Loading plot from McDonalds data

• Loadings can be
visualized to map
Protein content and which variables have
carbohydrate content are anti-
correlated contributed to the
High contribution score plot.
on PC 1 • Variables far away
from the center are
well described and
Not contributing important
to the model
1g proteins = 17 Kj
• Variables near the
High 1g carbohydrates = 17 Kj center are less
contribution on 1g fat = 38 Kj
PC 2
important.
Superimposition of Score and Loading Plot : McDonalds data
Low fat desserts

High in proteins
Low in carbohydrates
– Meat and Fish High in
Carbohydrates
Low in protein

High fat &


Energy
We can explain the
grouping of samples
thanks to the information
from the Loading plot
Conclusions on PCA

• Mc Donald’s products can be grouped


by nutritional composition:
– Low fat dessert: sundaes
– High carbohydrates and fat product:
Apple pie and French fries
– High content in protein: Burger with meat
– High content in fat: burger with fried fish
• Some variables are strongly correlated:
– Positively: Fat content and Energy.
– Negatively: Carbohydrates and Protein content
• One variable is not discrimating the samples:
Saturated Fat
CLASSIFICATION
Classification Examples

• Recognize adulterated wine samples


– Sugar is added in order to increase ethanol content;
– But IR spectroscopy can detect the change in wine composition.
– Can we train a model to distinguish between “natural” wines and adulterated
samples?
• Detect whether meat has been frozen
– In many countries, regulations distinguish between fresh meat and frozen-then-
thawed meat; yet some retailers cheat.
– Can we detect whether a meat sample has been frozen, with simple and rapid
measurements?
• Classify raw material samples according to their origin
– Depending on the geographical origin of mineral, vegetable or animal raw materials,
their composition is likely to vary a lot, thus influencing the properties of the end-
product;
– If these variations are important, we should adjust the recipe or processing
conditions to take these differences into account, so as to level out the final product
quality.
– Do all geographical sources differ? Is it possible to identify a few major groups and
recognize future samples from these sources?
Classification

• Predict a category variable "Natural"


(class membership) wines

– Several classes;
– Or one class vs. rest of the world. Adulterated

• Classify new samples


– For each class:
New
- Recognize members; model
wine
new
- Reject non-members. sample Natural
Adult-
erated
Classification Methods

• Projection method
• SIMCA (Soft Independent Modelling of Class Analogies)
• PLS-DA (Partial Least Square – Discriminant Analysis)
• LDA (Linear Discriminant Analysis)

31
Regression modeling
What is regression modeling?

Mass, m Length, l
1.0 15.0
1

2 1.5 17.1
? 3 2.0 18.0
?
2.5 19.5
Problem Idea Experiment 3.0 21.0

Length, Prediction
25

20
l m=a+bxl
15 l = 20.7

10 a=-4
5 b = + 0.33 m=a+bxl
0
0 2 4 m = 2.9
Mass, m
Modeling
Linear regression

Uni-variate regression

y = b0 + b1*x
y
Least squares criterion

x: predictor variable b0
y: response variable x
b0: Intercept
b1: Slope
Multivariate regression

A general regression model:


y = b0 + b1*x1 + b2*x2 + ... + bk*xk + f

y = Xb + f b = (XTX)-1XTY (full rank
required!)

Examples:
Job satisfaction = b0 + b1*(type of work) + b2*(boss) + b3*
(colleagues)

Multivariate regression model with X Y


several x-variables and y-variables
Modeling stages

• Calibration data: Data used to


build model between predictors Xcal Ycal
and responses.
Known
• Validation data: Data used to test
how the model works for new data.
Xval Yval
• Prediction data: Data without
known response values.
^
Xnew Yne
w

Approximation
Regression methods

Y
MLR – Multiple Linear Regression:
A classical method still in use that relates one single
response variable (Y) to a small number of variables (X).
X1 X2
PCR – Principal Component Regression:
Y
Perform PCA on the X matrix, then use MLR to relate the
Y-variable to the scores from the PCA of X.
PLS – Partial Least Squares Regression:
PC1 PC2
Model both X and Y - matrices simultaneously
X3 u
Y3
t

t
u

PCy = f(PCx) Y2
X1 X2 u = f(t) Y1
1
Interpreting Regression Model

2 PC2 Scores
Salmon
Potato
Potato Blue Mussel
Potato Blue
Blue Mussel Salmon
Mussel
1 Potato
Cod
Potato Salmon
Salmon
Salmon
Salmon
Salmon
Salmon
Salmon
0 Cod Blue Mussel Salmon
Cod Salmon
Blue Mussel
-1 Blue Mussel
Potato
Potato Blue Mussel
CodPotato
Cod
Cod
Potato
-2
Cod
-3 PC1
-3 -2 -1 0 1 2 3
RESULT2,
X-expl: 24%,19%
Y-expl: 76%,9%

Sex and BMI (two X EPA (Y variable) is Regression line (actual Target line (optimal
Predicted fit)
variables) are strongly correlated fit)
1.0 PC2 Correlation Loadings (X and Y)
Y
correlated with the diet Salmon
Age (X variable)
0.5 BMI
Sex Individual Salmon
EPA_min7
Potato Blue Mussel
0
The X and Y-
loading plot is
-0.5 Cod Hormon useful to
understand the Measured
-1.0 PC1correlations
-1.0
RESULT2,
X-expl: 24%,19%
-0.5
Y-expl: 76%,9%
0 0.5 1.0 Y
between the
explicative and
response
variables.
Interpreting Regression Model
Example data set: McDonalds data

13 products found at McDonalds restaurants (objects)


5 nutrition values (variables)
Energy
Hamburger with meat

Products / Nutrition (kJ/g) Protein (%) Carbohydrates(%) Fat (%) Saturated Fat (%)
Big Mac 9.54 12.48 19.62 11.00 4.02
Cheeseburger 10.54 13.79 26.00 10.19 4.41
Grilled Chicken 8.14 12.68 14.74 9.38 0.09
Hamburger 10.04 12.68 28.79 7.96 2.54
McChicken 9.24 11.37 19.32 10.80Validation
1.76set
McFeast 8.64 11.67 14.64 11.31 4.51
Quarter Pounder
w/cheese 10.54 15.71 17.93 12.93 6.38
Filet-O-Fish 11.84 10.87 26.39 14.86 2.84
Pommes Frites 12.24 5.02 37.06 13.54 2.05
Dessert

Apple Pie 11.54 2.70 32.17 15.06 4.31


Sundae Chokolate 7.84 4.31 28.39 6.14 4.41
Sundae Strawberry 6.84 3.60 28.29 3.50 2.45
Sundae Caramel 7.84 4.01 31.68 4.62 3.13
Choose the number of Latent variables

10 samples in Training set


3 samples to predict:
• Sundae caramel
• Mc Chicken
• Mc Feast
Only 2 LVs can be used to
predict the ”Energy”
Loading and Score plots of the PLS on McDonald’s data
PLS regression on McDonald’s data

2 latent
components
RMSEP = 0.18KJ/g
R2 = 0.991

Checking for outliers


Refining the PLS model on McDonald’s data

B coefficients
1g proteins = 17 Kj
1g carbohydrates = 17 Kj
1g fat = 38 Kj

Variable not to be
taken into account 2 latent
components
RMSEP = 0.12 KJ/g
R2 = 0.996
Validating the PLS model of Energy on the Unused McDonald’s data
Conclusions on PLS

• The model of prediction of the Energy is


validated
• It uses only 3 variables :
– Fat,
– Protein and
– Carbohydrates.

• It is not necessary in further measurement to


measure the Saturated fat content.  gain of time
Thank you for your attention
Q & A Session
Please fill in the feedback form
Our Upcoming Training Workshop is as follows:
Uncover Multivariate Data Analysis 10 & 11 JAN
Uncover Design of Experiments on 12 & 13 JAN

You might also like