Kerneldensityweighted PCA

Combustion and Flame 159 (2012) 2844–2855
Contents lists available at SciVerse ScienceDirect
Combustion and Flame

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c o m b u s t fl a m e
Kernel density weighted principal component analysis of combustion processes

Axel Coussement a,b,c,⇑, Olivier Gicquel a,b,c, Alessandro Parente a
a
Université Libre de Bruxelles, Aero-Thermo-Mechanics Departement, Avenue F.D. Roosevelt 51, CP 165/41, 1050 Bruxelles, Belgium
b
Ecole Centrale Paris, Grande Voie des Vignes, 92295 Chatenay-Malabry, France
c
CNRS, UPR 288 Laboratoire d’Energétique Moléculaire et Macroscopique, Combustion, Grande Voie des Vignes, 92295 Chatenay-Malabry, France
a r t i c l e i n f o a b s t r a c t
Article history: Principal component analysis (PCA) has been successfully applied to the analysis of combustion data-sets.
Received 28 October 2011 However using PCA on a raw direct numerical simulation or an experimental data-set is not straightfor-
Received in revised form 15 February 2012 ward. Indeed, those data-sets usually show non-homogenous data density, hot and cold zones being
Accepted 5 April 2012
generally over represented. This can introduce bias in the PCA reconstruction, especially when strong
Available online 30 April 2012
non-linear relationships characterize the data sample. To tackle this problem, a combination of the kernel
density method and PCA is introduced here. This new PCA algorithm, called Temperature BAsed KErnel
Keywords:
Density weighted PCA (T-BAKED PCA) allows to enhance the PCA accuracy especially in the flame front
Principal component analysis
Combustion
zone, which is the principal zone of interest. The performance of this new approach is benchmarked
Tabulated chemistry against classical PCA. Moreover, a new method called Hybrid T-BAKED PCA or HT-BAKED PCA, combining
both classical and T-BAKED PCA, is proposed to provide an optimal representation of all flame regions.
Ó 2012 The Combustion Institute. Published by Elsevier Inc. All rights reserved.
1. Introduction state-space parameterization: where it is assumed that the

combustion processes can be parametrized by a small num-
Principal component analysis has been applied, since its intro- ber of variables defining a reduced manifold in the state
duction [1], to a wide range of problem requiring dimension reduc- space.
tion. Recently, the application of PCA for size-reduction in
turbulent-reacting systems was investigated by Frouzakis et al. [2] Numerous models have so been proposed to identify this man-
and Danby and Echekki [3]. Indeed, a primary challenge in turbulent ifold: Steady Laminar Flamelet Method (SLFM) [6–8], Flamelet-
combustion modeling lies in the broad range of scales which are Generated Manifold (FGM) [9] and Flamelet-Prolongation of ILDM
inherently coupled, in space and time, through thermo-chemical model (FPI) [10–12] are just few examples. However, Sutherland
and fluid dynamic interactions. However, in many applications and Parente [13] and Parente et al. [14] showed that the selection
[4], several chemical time scales are smaller than the fluid dynamic of the parameters defining the low dimensional manifold is not
time scales, allowing to reduce the dimensionality of the problem. straightforward, even though some parameters are known to be
This reduction is of critical importance in Computational Fluid optimal for specific combustion regimes, such as the mixture frac-
Dynamics (CFD): if one reduces the number of equations needed tion (f) for non-premixed combustion.
by detailed chemical mechanisms, the computing power needed Another method which allows identifying a chemical manifold
will decrease accordingly. Indeed, even simple fuels such as meth- is Principal Component Analysis (PCA). In particular, PCA offers
ane are characterized by chemical mechanisms involving a large the possibility to automate the parameter identification process
number of species and chemical reactions: 53 species and 325 reac- [13,14], controlling at the same time the error induced by the
tion for the GRI-3.0 mechanism [5]. reduction. PCA can provide an optimal representation of a system
Two methods can be used to perform a dimensionality reduction: based on q Principal Components (PC) which are linear correlations
of the Q original variables. The error can be controlled by choosing
mechanism reduction: the chemical mechanism is modified the number of PCs ensuring an a priori defined accuracy. Suther-
to reduce the number of species and its stiffness. land and Parente [13] showed that a transport equation can be
solved for the score of each PC leading to a low dimensional repre-
sentation of the combustion process. Moreover, Parente et al. [14]
⇑ Corresponding author at: Université Libre de Bruxelles, Aero-Thermo-Mechan-
have demonstrated an improvement of the PCA-based model if
ics Departement, Avenue F.D. Roosevelt 51, CP 165/41, 1050 Bruxelles, Belgium.
local PCA is used. However, the actual implementation of a model
Fax: +32 2 650 27 10.
E-mail address: axcousse@ulb.ac.be (A. Coussement). based on local PCA is still to be established. Finally, PCA was also
0010-2180/$ - see front matter Ó 2012 The Combustion Institute. Published by Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.combustflame.2012.04.004
A. Coussement et al. / Combustion and Flame 159 (2012) 2844–2855 2845
exploited to investigate the main features of combustion regimes The main feature of PCA is dimension reduction. Indeed, if one
[15]. only uses the first q PCs (q < Q) in Eq. (5), an approximation of
However, the application of PCA to reacting flows still requires the original data-set, Xq (n P), is obtained:
further developments. The present work focuses on the influence
of data sampling on the derivation of the principal components
X Xq ¼ Zq Atq ð6Þ
and on the parameterization error. As it will be discussed below, where Zq (n q) and Atq ðq Q Þ are respectively the scores and PCs
the PCA results and the PCs are strongly dependent on the data- obtained by retaining only the first q PCs. Note that since the eigen-
set set conditioning. In particular, without an appropriate condi- vectors are ordered in decreasing magnitude, the marginal informa-
tioning of the data-set, the extracted PCs can be biased. tion recovered by increasing q decreases with q. This is therefore
After a brief description of the PCA reduction process, the prob- particularity well suited for dimensional reduction since Xq is the
lem of data sampling for PCA in combustion process will be best approximation of X in a space of dimension q.
emphasized and a weighting function based on the kernel density The PCs can thus be linked to the modeling proposed in Eq. (1):
method [16,17] will be introduced. The improvement of the PCs in zi are the components of the Zq matrix and ai the lines vectors of
the flame front description using the kernel density method with the Atq matrix:
respect to classic PCA will be demonstrated using a 1-D premixed 0 1
laminar flamen, a 2-D flame flame-vortex interaction and a diffu- z1;1 z1;2 . . . z1;q
sion jet flame as data-sets. Finally, a hybrid PCA method that com-
B z2;1 z2;2 . . . z2;q C
B C
bines the advantages of the two methods will be presented. Zq ¼ B
B .. .. .. .. C
C ð7Þ
@ . . . . A
zn;1 zn;2 . . . zn;q
2. Principal component analysis, centering and scaling
0 1 0
a1 a1;1 a1;2 . . . a1;Q 1
The dimension of the chemical state space associated to a reac- Ba C B a2;1 a2;2 . . . a2;Q C
tion mechanism including Nsp species is equal to Q = Nsp + 2. This B 2C B C
Atq ¼ B C B
B .. C ¼ B .. .. ..
C
.. C ð8Þ
state space can be parameterized for instance by the Nsp species @ . A @ . . . . A
mass fractions, Yi, temperature, T and pressure p. This means that
Q transport equations have to be solved to characterize the com-
aq aq;1 aq;2 . . . aq;Q
bustion processes leading to a large CPU cost. It should be stressed that PCA is affected by the relative difference of
To tackle this problem, previous studies [13] used the Principal magnitude between variables contained in the data-set [14]: if, for
Component Analysis (PCA) to identify q (q < Q) linear correlations example, the magnitude of one variable in the data-set is one (or
between the state space variables to reduce the degree of freedom more) order of magnitude higher than the other for all observations,
characterizing the system. Employing those correlations, the the first principal component will be aligned with this particular
approximated combustion system reads at each point c: variable. Therefore, the data should be centered and scaled before
X
q applying PCA [14]. Centering is achieved by subtracting the mean
½Y 1 ; . . . ; Y nsp ; T; pc ½Y 1 ; . . . ; Y nsp ; T; pc;q ¼ zc;i ai ð1Þ value of each Q original variable to their respective observation:
i¼1
X0 ¼ X X ð9Þ
where ai = [ai,1, . . . , ai,Q] and the ai,j are coefficients of linear correla-
tion ai. If q = Q, Eq. (1) describes the full state space, while when where X ðn Q Þ is a matrix containing the mean of each variable
q = 1 chemistry evolutions are limited to a simple linear subspace over the n observations:
in the state space. 0 x2 1
x1 . . . xQ
To identify the most accurate linear correlations, ai, a represen- B x1 x2 . . . xQ C
B C
tative sample of n observations, ½Y 1 ; . . . ; Y nsp ; T; pc ; c 2 ½1; n are X¼B
B .. .. ..
C
.. C ð10Þ
used. This sample could result from 1D laminar flame simulation, @ . . . . A
DNS of a 3D turbulent flame or experimental results [13,14], defin- x1 x2 . . . xQ
ing the matrix X (n Q). Then, the covariance matrix S is
computed: where xj is the mean of the jth variable over the n observations.
Scaled data can be defined as:
1
S¼ Xt X ð2Þ e ¼ X0 D1
n1 X ð11Þ
where the superscript t indicates the transpose matrix. with D (Q Q) being a matrix containing the scaling parameter for
If the covariance matrix is non-singular, it can be decomposed: each variable:
0 1
S ¼ A L At ð3Þ d1 0 ... 0
B0 d2 ... 0 C
where A (Q Q) and L (Q Q) contain respectively the Q eigenvec- B C
D¼B
B .. .. .. .. C
C ð12Þ
tors called principal component (PC) and the eigenvalues of S, in @ . . . . A
decreasing order of magnitude.
0 0 . . . dQ
The principal component scores, Z (n Q), are then computed
using the eigenvectors: While various scaling methods can be applied [18], the classic choice
is to use the ‘‘auto-scaling’’ [14]: the standard deviation of the jth
Z ¼ XA ð4Þ
variable computed from the data-set, r ^ j , is used as dj. After this scal-
One of the main advantage of PCA is that the original set of data, X, ing the variance of each original variable is one, meaning that the
can be uniquely recovered using the PCs and their associated analysis is based on correlations among the state variables. Applying
scores: centering and scaling, the data-set used to perform the PCA is thus:
X ¼ ZA1 ð5Þ e ¼ ðX XÞD1

X ð13Þ
1 t
where it should be noted that A =A. This method is used throughout the present work.
2846 A. Coussement et al. / Combustion and Flame 159 (2012) 2844–2855

3. Problem statement x1;n ¼ 1 þ 20R
x2;n ¼ 5x1;n þ 100R ð14Þ
Centering and scaling are crucial for the effectiveness of PCA. for n 2 ½1; 1000
However, this does not address the problem of the data sampling.
Indeed, if one considers a combustion data-set to be analyzed 8
with PCA, special attention should be paid to the over-representa- < x1;n ¼ 420 þ 8ðn 1001Þ
tion of the hot and cold zone, especially when dealing with : x2;n ¼ 5000 ðx1;n 400Þ þ 500R ð15Þ
numerical data. For example, if one considers the temperature 200
field from a vortex-flame interaction (see Section 7.1 for details) for n 2 ½1001; 1021
represented on Fig. 1, the relevant zone to be considered for

PCA is the grey highlighted region, corresponding to the flame x1;n ¼ 1000 þ 20R
front. It is clear that if one considers the whole domain as a x2;n ¼ 5x1;n þ 100R ð16Þ
data-set the hot and cold zone, where no reaction takes place,
for n 2 ½1021; 2021
will be over represented. This over-representation is emphasized
in Fig. 2 where the density of the normalized temperature is
with R being an uniformly distributed random noise which varies
shown.
between 0 and 1. This data-set, plotted on Fig. 3, consists in two lin-
To characterize the problem of over-sampling and to show its
ear functions, with variable data densities. Eqs. (16) and (15) define
influence on PCA, let’s consider the following data-set:
a linear relationship for low and high values of x1, with a large den-
sity of points, and Eq. (14) provides another linear relationship for
intermediate x1 values, with a lower density of points. The point
density along x1 and x2 was chosen to mimic the behavior of the
temperature field shown in Fig. 2.
6000
5000
4000
3000
x2
2000
1000
0
Fig. 1. Temperature field during a flame vortex interaction.
−200 0 200 400 600 800 1000 1200
x1
Fig. 3. x2 as a function of x1.
6000
5000
4000
Recovered X2
3000
2000
1000
−1000
0 2000 4000 6000
Observed X2
Fig. 2. Density of the normalized temperature for the temperature field shown in
Fig. 1. Fig. 4. x2 recovered as a function of x2 observed with 1 PC.
Processing this data-set using the PCA, yields the results plotted
6000 on Fig. 4, where the recovered x2 considering one PC is plotted as a
function of the observed x2. As expected, the first PC is ‘‘attracted’’
5000 by the over represented extreme values. If the relevant information
is contained in the over represented zone, this does not yield any
4000 problem. However, if most of the relevant information is contained
Recovered X2
in the intermediate region, the PCA fails to capture that

3000 information.
Note that, if data-set had contained only one linear function the
2000 over-representation would not have yielded any problem. It is
therefore not only the sample density that raise an issue, but the
1000 combination of non-linearity with the uneven sampling.
Applying PCA to combustions data-sets will suffer from theses
0 issue, as cold and hot zones are generally over represented, espe-
cially in numerical data-sets. PCA will thus be able to represent
−1000 the hot and cold zone but not the flame front zone which is the
0 2000 4000 6000
main zone of interest. Indeed if one want to accurately reproduce
Observed X2
the source terms or the flame speed using the results of PCA, the
set of PCs chosen should be able to accurately reconstruct the
Fig. 5. x2 recovered as a function of x2 observed with 1 PC, using the kernel density
flame front.
weighting.
To solve this problem, a pretreatment should be applied to the
data-set, to remove the effects of uneven sampling. The most obvi-
ous approach is to suppress the over represented zone prior to the
2500
PCA. While simple, this process has two main disadvantages: first,
one should find a criterion to distinguish the relevant zones from
the non-relevant ones, which is not an easy task mainly because
2000 this criterion should be robust. Second, even if such a criterion
can be found, the PCA process will no longer be fully automatic,
and as stated in the work of Sutherland and Parente [13], the main
1500 advantage of the PCA is the automatic identification of relevant
T (K)
variables.
In the following section, the kernel density method is proposed
to solve this problem: the sample density can be computed and
1000
used to lower the influence of the over represented zone in the
covariance matrix.
500
4. Weighting based on kernel density method
−0.002 0.002 0.006 0.01 0.014 The kernel density method was introduced by Rosenblatt [16]
x (m) and Parzen [17]. The general idea of this method is to compute
the density of the statistical sample in a distribution using a pre-
Fig. 6. Temperature, in K, as a function x for the 1D-laminar flame data-set. sumed normal distribution.
Fig. 7. Density (gray) and wieghting (black) functions as a function of the temperature, T in K for: (A) multi-variable kernel density, (B) mono-variable kernel density based on T.
4.1. Single-variable case: where h is the bandwidth associated with the variable. An estima-
tion of the density at each point c can be then obtained by summing
For each observation the distance, dc;c0 , between the current the Gaussian kernels:
observation c and an observation c0 is computed: c0 ¼n
X 1
Kc ¼ K c;c0 ð19Þ
dc;c0 ¼ jxc0 xc j ð17Þ c0 ¼1
n
xc being the value of the variable at the cth observation. Then a Using the variable density, it is possible to define a normalized
Gaussian kernel distribution K c;c0 is defined, for each point c, as: weighting Wc, for each observation, that varies between 0 and 1:
sffiffiffiffiffiffiffiffiffiffiffi 2
! 1
Kc
1 dc;c0 Wc ¼ ð20Þ
K c;c0 ¼ 2
exp 2
ð18Þ max 1
2ph 2h Kc
0,25 0.25
0,2 0.2
Recovered H2O
Recovered H2O
0,15 0.15
0,1 0.1
0,05 0.05
0 0
0 0,05 0,1 0,15 0,2 0,25 0 0.05 0.1 0.15 0.2 0.25
Observed H2O Observed H2O
−5 −5
x 10 x 10
4 4
3 3
Recovered H2O2
Recovered H2O2
2 2
1 1
0 0
0 1 2 3 4 0 1 2 3 4
−5 −5
Observed H2O2 x 10 Observed H2O2 x 10
0,01
0,01
Recovered OH
Recovered OH
0,005 0,005
0 0
0 0,005 0,01 0 0,005 0,01

Observed OH Observed OH
A B
Fig. 8. Recovered Y H2 O ; Y H2 O2 ; Y OH as a function of their observed counterparts with 4 PCs, using a multi-variable kernel density (A) and a mono-variable kernel density (B).
4.2. Multi-variables case b ¼W X

X e ð29Þ
Previous formulas can be extended to the multi-variable case where W is the matrix containing the weighting for each observa-
including n variables for each observations. The distance dc;c0 ;k , for tion. Note that the weightings are computed using the centered-
variable k, reads: e and that X
scaled data-set, X, b is only used for the computation of
dc;c0 ;k ¼ jxc0 ;k xc;k j ð21Þ the covariance matrix.
Figure 5 shows the recovered x2 as of function of the observed x2
xc,k being the value of the variable k at the cth observation. Then the using one PC. The weighting was performed using the multi-
normal distribution K c;c0 ;k is given by: variable kernel density method, and h was computed using Eq.
sffiffiffiffiffiffiffiffiffiffiffiffiffi !
2
dc;c0 ;k (26). Figure 5 emphasizes that the weighting behaves as expected:
1
K c;c0 ;k ¼ 2
exp 2
ð22Þ it emphasizes the importance of the region originally under-
2 p hk 2 hk represented in the data-set, yielding a much accurate representa-
where hk is the bandwidth associated with variable k. The variable tion of this region but a worse representation of the originally
density associated with variable k is then: over-represented regions. It can be concluded that sampling is cru-
cial for the analysis of non-linear data set, and an appropriate
c0 ¼n
X 1 choice of the weighting function can emphasize forward the main
Kc;k ¼ K c;c0 ;k ð23Þ
n features of interest for the investigated system.
c0 ¼1
The characteristics kernel density weighted PCA will now be
From these relations one can obtain that the global density associ- studied on the basis of a 1-D premixed hydrogen/air flame. Note
ated to the observation c: that, in Section 5.3, a method to solve the issue related to
Y
k¼Q the reconstruction of the originally over-represented regions will
Kc ¼ Kc;k ð24Þ be proposed.
k¼1
and the global weighting

5. Characteristics of the kernel density weighted PCA
1
Kc
Wc ¼ ð25Þ
max 1 The performances of the kernel density weighted PCA are stud-
Kc
ied for a 1-D premixed hydrogen/air flame.
This data-set was obtained from a DNS computation using the
4.3. Choice of hk compressible Navier–Stokes solver YWC developed at the EM2C
laboratory. The thermochemical properties are computed using
The density, and so the weighting, is strongly dependent of the the REGATH package developed at the EM2C laboratory which is
bandwidth value, hk, as it will be shown in Section 7.1. However, based on the CHEMKIN [20] formalism (i.e differential diffusion ef-
for a given data-set an optimal value for hk can be found (see fects are taken into account).
[19]). This optimum value can be either found by an iterative pro- The case is a H2-Air 1-D laminar flame at an equivalence ratio of
cess, that is changing hk until the error is smaller than a given / = 0.88 and a temperature of 300 K. The pressure is set to
threshold, or from the data-set characteristics. Using the latter 101,325 Pa, velocity of the unburned gases is 1.25 m/s, and the
method the bandwidth can be computed for a mono-variable chemical reaction mechanism by Katta et al. [21] was employed.
weighting as [19]: The domain consists of 1281 points with a constant grid spacing
15 of Dx = 1.25 105 m. A view of the temperature field of the
4r
^ data-set is shown in Fig. 6. Nine variables are included in the
h¼ ð26Þ
3n data-set: Y H ; Y H2 ; Y O ; Y O2 ; Y OH ; Y N2 ; Y H2 O ; Y HO2 and Y H2 O2 .
where r^ is the standard deviation of the considered variable within
the data-set.
−5
For multi-variable kernel density a different bandwidth can be x 10
4
used for each variable [19], but since the data used to compute the
weighting are centered and scaled by the standard deviation of all
the variables distributions, r
^ k , is therefore be equal to one, so that:
r^ k ¼ r^ 8k ð27Þ 3
Recovered H2O2
Using a global bandwidth or a variable dependent bandwidth is

therefore equivalent:
2
!15 15
4r
^k 4r
^
hk ¼ ¼ ¼h ð28Þ
3n 3n
1
where r
^ is the mean of the standard deviations rk.
4.4. Application
0
To demonstrate the ability of the proposed weighting to tackle 0 1 2 3 4
the problem exposed in Section 3, the method will be applied on −5
Observed H2O2 x 10
the data-set defined by Eqs. (14)–(16). Each centered-scaled obser-
vation is multiplied by its associated weighting before applying the Fig. 9. Recovered Y H2 O2 as a function of the observed Y H2 O2 with 4 PCs, T-BAKED PCA
PCA, that is: using h = 104 K. Using h = 3000 K yields the same results.
5.1. T-BAKED PCA Figure 7A and B show the density and the weighting function for
the multi-variable kernel density method and the T-BAKED method,
The multi-variable kernel density computation, presented in respectively. The bandwidths were computed using Eq. (26). The
Section 4.2, yields two issues when applied to a combustion comparison of Fig. 7A and B shows some major differences. Figure
data-set. First computing cost can be high when large dataset are 7A displays a local maximum for T 1700 K, whereas Fig. 7B
considered. To reduce this cost a tree-based algorithm could be displays a more intuitive behavior, that is a low value of the weight-
used but a mono-variable kernel density, that leads to a lower ing function in the over represented zone (i.e. the hot and cold
CPU cost, was preferred. Second, as shown further in this section, zones) and a high value in the under represented zones (i.e the
a multi-variable kernel density can lead to erroneous evaluation flame front). The behavior of Fig. 7A can be explained by the
of the weights due to the very nature of combustion processes coupling among the thermo-chemical state variables. Indeed, the
data-sets. This is why it is of great interest to investigate if a radical species do not appear simultaneously at the same location.
mono-variable kernel density can be used. Some, like H2O2, appear at the beginning of the flame front whereas
If a mono-variable kernel density is computed, a variable must others, like H, at the end of the reaction zone. Moreover, stable spe-
be chosen. The most efficient constraint is to require that this var- cies display monotonic increase (H2O) or decrease (H2 or O2) and
iable must describe the progress of the combustion process. The transient species are only present in the flame front. Individually,
most relevant choice is therefore the temperature, yielding a Tem- all those variables have very different kernel densities and therefore
perature BAsed KErnel Density weighted PCA or T-BAKED PCA. the weighting function can display non-intuitive behavior.
0.25 0.25 0.25
0.2 0.2 0.2

Recovered H2O
Recovered H2O
Recovered H2O
0.15 0.15 0.15
0.1 0.1 0.1
0.05 0.05 0.05
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25
Observed H2O Observed H2O Observed H2O
x 10−5 x 10−5 x 10−5

4 4 4
Recovered H2O2
3 3 3
Recovered H2O2
Recovered H2O2
2 2 2
1 1 1
0 0 0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Observed H2O2 x 10−5 Observed H2O2 x 10−5 Observed H2O2 x 10−5
0,01 0,01
0,01
Recovered OH
Recovered OH
Recovered OH
0,005 0,005 0,005
0 0 0
0 5 10 0 0,005 0,01 0 5 10
Observed OH x 10−3 Observed OH Observed OH x 10−3
A B C
Fig. 10. Recovered Y H2 O ; Y H2 O2 ; Y OH as a function of their observed counterparts with 4 PCs, using a classical PCA (A), a T-BAKED PCA (B) and a HT-BAKED PCA (C).
On the contrary, the temperature profile (Fig. 7B) displays a and cold regions is worsened and left in the minor PCs. A hybrid
much more predictable behavior: it provides high weight to sam- PCA relying on the following idea can be adopted: if q PCs should
ples contained in the flame front, while lowering the weight of be retrieved from the data-set, qw PCs will be obtained using the
the hot and cold zones. The behavior of T-BAKED density function T-BAKED PCA and qnw PCs with the classical, non-weighted (nw),
appears better suited for detecting and analyzing the flame front PCA (q = qw + qnw). In our combustion problem, the qw PCs will ac-
via PCA. Moreover, its computing cost is lower. count for the flame front and qnw PCs for the hot and cold regions.
Reconstruction results for the multi-variable kernel density The resulting approach, called n Hybrid T- BAKED PCA or HT-
weighted and T-BAKED PCA are shown on Fig. 8A and B, for BAKED PCA, can be summarized as follows: from the centered-
Y H2 O ; Y H2 O2 and YOH. Those reconstructions are achieved using 4 e q PCs are computed using the T-BAKED PCA
scaled data-set, X; w
PCs. Those species were chosen because Y H2 O and YOH are present e w is reconstructed with those PCs:
and the data-set, X
in the flame front and in the post flame zone and Y H2 O2 is only
e w ¼ Zw At
X ð30Þ
present in the flame front. From this, one can conclude that the w
T-BAKED PCA, yields more accurate results than the multi-variable

kernel density weighted PCA with a much lower computing
cost. Indeed, only one density function must be computed for the Table 2
T-BAKED PCA against 9, or more generally Q, for the multi-variable R2 statistics for the reconstruction of 1-D premixed flame using 4PCs using HT-BAKED
PCA and T-BAKED PCA.
kernel density.
Variable R2 classical R2 T-BAKED R2 HT-BAKED
5.2. Optimal bandwidth YH 0.98612 0.9561 0.99861

Y H2 0.99841 0.5095 0.99722
YO 0.98478 0.7722 0.99804
For all the above analysis the T-BAKED PCA was performed Y O2 0.99940 0.5318 0.99756
using a bandwidth computed according to Eq. (26), while this for- YOH 0.95047 0.9494 0.96956
mula is known to gives the optimal value of the bandwidth in case Y H2 O 0.99514 0.4938 0.99693
of normal distribution [19], its optimality is not a priori verified for Y HO2 0.95477 0.9977 0.99062
the temperature distribution extracted from the data-set: the tem- Y H2 O2 0.96671 0.9729 0.98635
perature distribution having a high density of extreme values. The

existence of an optimum is ensured by the fact that if the band-
width is too small, the density of all the samples will be equal. In-
deed, in this case only the current sample will contribute to its
density (see Eq. (18)). On the contrary, if the bandwidth is too high,
all the samples will contribute to the density of the current sample.
This is illustrated by Fig. 9, where results for the reconstruction of
Y H2 O2 are plotted for a T-BAKED PCA using bandwidth values of
h = 104 K, note that using h = 3000 K yields the same results.
Those values are much higher (or lower) than the bandwidth
computed from Eq. (26), which is equal to 237 K. As predicted,
the results of the T-BAKED PCA using a too high or too low band-
width are identical to those of classical PCA (Fig. 10). From this,
it can be concluded that an optimum should exist between those
value. To find it T-BAKED PCAs were performed with variable
bandwidth until the optimum PCA reconstruction of Y H2 O ; Y H2 O2
and YOH in the flame front region was found. Values found are re-
ported in Table 1.
From Table 1, the following conclusions can be drawn: first, the Fig. 11. Description conditions for the flame vortex ring.
optimal bandwidth is dependent on the variable on which the opti-
mum criteria is computed. Second, Eq. (26) provides a value close
to the optimum obtained.
T-BAKED PCA allows recovering the information contained in
the flame front region, and the optimal bandwidth can be com-
puted from Eq. (26). However, as shown in Fig. 8, the T-BAKED
PCA yields inaccurate reconstruction of Y H2 O in the hot and cold
zones.
5.3. HT-BAKED PCA
The T-BAKED PCA approach allows improving the representa-

tion of the flame front region, while the representation of the hot
Table 1
Value of the bandwidth for Y H2 O ; Y H2 O2 and YOH allowing an optimal reconstruction
using a T-BAKED PCA with 4 PCs.
Species h from Eq. (26) Optimal h

Y H2 O (K) 237 280
Y H2 O2 (K) 237 220
YOH (K) 237 240
Fig. 12. Description conditions for the flame vortex ring.
The data-set Xe w containing the information relative to the low sam- following: first the classical PCA is performed retaining one compo-
ple density regions is then substracted from the original data-set, nent, then the T-BAKED PCA is performed retaining one component.
e :
yielding X
e ¼ X
X eX
ew ð31Þ
e will only contains information relevant to the over repre- Table 3
Thus, X
R2 statistics for the reconstruction of 2-D flame-vortex using 4PCs using HT-BAKED
sented zones, on which a classical PCA can be performed to extract PCA and T-BAKED PCA.
the remaining qnw PCs. The data-set recovered using those qnw PCs
Variable R2 classical R2 T-BAKED R2 HT-BAKED
is:
YH 0.9862 0.9729 0.9972
e ¼ Znw At
X ð32Þ Y H2 0.9962 0.7749 0.9892
nw nw
YO 0.9451 0.7153 0.9696
So that the data-set recovered using the q PCs computes as: Y O2 0.9924 0.9654 0.9992
YOH 0.9874 0.9885 0.9938
e¼X
X e w ¼ Zw At þ Znw At
e þ X ð33Þ Y H2 O 0.9847 0.9157 0.9947
nw w nw
Y HO2 0.9999 0.9998 0.9999
From a numerical study performed on the datasets presented Y H2 O2 0.9999 0.9976 0.9999
below, the optimal choice for retaining the PCs of each type is the
0.5 0.5
0.5
0.4 0.4 0.4

Recovered H2O
Recovered H2O
Recovered H2O
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0,1 0,2 0,3 0,4 0,5 0 0.1 0.2 0.3 0.4 0.5
Observed H2O Observed H2O Observed H2O
x 10−4 x 10−4 x 10−4

3 3 3
Recovered H2O2
Recovered H2O2
Recovered H2O2
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
−4 −4
Observed H2O2 x 10 Observed H2O2 x 10 Observed H2O2 x 10−4
0.06 0.06 0.06

Recovered OH
Recovered OH
Recovered OH
0.04 0.04 0.04
0.02 0.02 0.02
0 0 0
0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.02 0.04 0.06 0 0.02 0.04 0.06
Observed OH Observed OH Observed OH
A B C
Fig. 13. Recovered Y H2 O ; Y H2 O2 ; Y OH as a function of their observed counterparts with 4 PCs, using a classical PCA (A), a T-BAKED PCA (B) and a HT-BAKED PCA (C).
This process is then repeated until the total number of PCs retained 6. Results for the 1-D flame data-set
is reached. Centering and scaling and the inverse operations are
performed at the beginning and the end of the algorithm to avoid The reconstruction results for a classical T-BAKED and HT-
introducing a bias in the second PCA. BAKED PCA are plotted on Fig. 10A, B and C respectively for
Results of the T-BAKED and HT-BAKED PCA will now be bench- Y H2 O ; Y H2 O2 ; Y OH . The weighting was computed using the band-
marked against the classical PCA for the 1-D flame data-set. width from Eq. (26) and for each reconstruction 4 PCs where used.
Fig. 14. Recovered Y CO2 ; Y O ; Y O2 and YOH as a function of their observed counterparts with 4 PCs, using a classical PCA (A), a T-BAKED PCA (B) and a HT-BAKED PCA (C).
This number of PCs allows to recover more than 98% of the vari- The temperature field and the distribution of YH for the data-set
ance contained in the original data-set. are shown in Figs. 1 and 12, respectively. For the sake of clarity, the
For the HT-BAKED, 2 PCs of each type where used and the band- y-axis extent has been limited to 2 103 m in the two figures.
width was also computed using Eq. (26). As for the 1D-premixed flame 4 PCs were used for recovering
Table 2 shows the R2 statistic of the species reconstruction for the original data-set, which allows to recover more than 98% of
the three methods. Before any interpretation is made, it should the original information.The T-BAKED and HT-BAKED PCA were
be stressed that the R2 square statistic is strongly influenced by performed using Eq. (26) for the bandwidth h. Note that numerical
the over-represented regions (i.e. the hot and cold zones). Keeping study also indicates that the most accurate reconstruction results
that in mind, it is possible to explain why R2 values for classical are obtained using 2 PCs of each type for the HT-BAKED PCA.
PCA and HT-BAKED PCA are higher than those obtained for T- As for the 1D-premixed flame data-set, results for the recon-
BAKED PCA. T-BAKED PCA provides an optimized representation struction of Y H2 O ; Y H2 O2 ; Y OH are plotted on Fig. 13 and the R2 sta-
of the flame front regions where the data density is lower. This re- tistics for the three methods are shown in Table 3. For Y H2 O and YOH
sults in a penalization of the R2 values which are strongly affected results are similar to those of the 1-D flame: in the flame front the
by the reconstruction in the hot and cold regions. Such consider- T-BAKED PCA yields more accurate results, while from a global
ations make the use of R2 statistic to compare the three methods point of view the HT-BAKED PCA is more accurate than the T-
possibly misleading. Similarly, the use of a density weighted R2 BAKED PCA.
metric would logically favor T-BAKED PCA by definition. Therefore An comparison of the plotted reconstruction of Y H2 O and YOH,
the R2 statistic cannot be used to compare the different methods. shows that the HT-BAKED PCA is superior to the classical PCA. As
The only relevant analysis is a systematic error analysis, which for the 1-D flame, HT-BAKED PCA appears then superior to classical
could be done using plots of the reconstructions. or T-BAKED PCA.
Figure 10 shows that HT-BAKED PCA offers an optimal compro- The two proposed approaches are thus efficient to improve the
mise between the classical and T-BAKED PCA: the method allows a performance of the premixed flame PCA. However, a large part of
globally optimized reconstruction, combining the performances of the combustion processes used relies on diffusion flame. Therefore
T-BAKED and classical PCA in the flame front, and in the cold and the proposed methods will be applied on a DNS data-set of a tur-
hot regions, respectively. This demonstrate the ability of the HT- bulent CO/H2/N2-air diffusion flame in the next section.
BAKED PCA to give a better description of the data-set using the
same number of PCs. However, in the flame front the results for
the T-BAKED PCA are more accurate than the HT- BAKED or classi- 7.2. Diffusion flame
cal PCA.
To fully validate the proposed T-BAKED and HT-BAKED method,
7. Results for multi-dimensional data-set a turbulent diffusion flame data-set is now considered. This
data-set comes from a 3-D DNS of CO/H2/N2-air jet diffusion flame
7.1. 2-D flame vortex interaction data-set using detailed chemistry, details about the computation set-up can
be found in [23]. Twelve variables are included in this data-set:
HT-BAKED and T-BAKED PCA will now be applied on a more Y CO ; Y CO2 ; Y H ; Y H2 O ; Y H2 O2 ; Y HCO ; Y HO2 ; Y H2 O2 ; Y O ; Y O2 ; Y N2 and YOH.
complex test case, a two dimensional flame vortex interaction. In As for the other data-sets 4 PCs where retained for the three
this case the vortex distort the flame front, yielding a more dis- methods, allowing to recover at least 91% of the information, and
persed data-set, more representative of what could be found in the bandwidth was computed using Eq. (26). Figure 14 gives the
DNS or experimental data-set. reconstruction results for Y CO2 ; Y O ; Y O2 and YOH. Unlike the pre-
The configuration of the computational case used to generate mixed cases, the T-BAKED and HT-BAKED PCA gives similar results,
the data-set is presented in Fig. 11. It is a premixed hydrogen except for Y O2 where an improvement in the reconstruction results
flame-vortex interaction at a temperature of 300 K and a pressure can be seen for the HT-BAKED. Moreover both methods displays an
of 101,325 Pa. The unburned gases are composed of hydrogen, improvement with respect to the classical PCA case especially for
nitrogen and oxygen with the following mass fractions: Y H ¼ transient species like YO and YOH.
0:05; Y O2 ¼ 0:4 and Y N2 ¼ 0:55. The grid consist of 161 321
nodes with a constant grid spacing in all directions Dx = Dy =
Dz = 2.5 106 m which results in a box of 2 103 m by 4
103 m. The premixed hydrogen flame chemistry is described
using the mechanism of Katta et al. [21].
As in the work of Renard et al. [22] the flow is initialized by a
one dimensional premixed flame, extended along the y-axis on
which two counter-rotating vortices are superimposed in the un-
burnt gases. The vortices are initialized using Eqs. (34)–(36) with
Rc = 0.2 103 m, s = 20 103, u1 0 = 2 m/s and p0 = 101,325 Pa.
The distance between the vortices center is equal to Rc. This result
in a maximum x-axis velocity of around 122 m/s and so a convec-
tion speed of the ring of 61 m/s.
!
s ðx2 þ y2 Þ
u1 ðx; yÞ ¼ u1 0 þ y exp ð34Þ
R2c 2 R2c
!
s ðx2 þ y2 Þ
u2 ðx; yÞ ¼ x exp ð35Þ
R2c 2 R2c
!!
c s 2 ðx2 þ y2 Þ
pðx; yÞ ¼ p0 exp exp ð36Þ Fig. 15. Density (gray) and weighting (black) functions as a function of the
2 c Rc R2c
temperature, T in K.
The fact that T-BAKED and HT-BAKED PCA yield comparable re- remarkable performances; on the other hand, when a globally opti-
sults can be linked to the weighting profile displayed on Fig. 15. In mized reconstruction is of interest, HT-BAKED PCA provides the
contrast with premixed flame, the reaction zone of a diffusion most accurate results for a fixed number of PCs.
flame in wider, yielding a lower weighting of the reaction zone.
This explain the limited improvement of the T-BAKED and HT- Acknowledgments
BAKED with respect to classical PCA for Y CO2 , however improve-
ment is still noticed for transient species. The first author was supported by a fellowship from the Fonds
National de la Recherche Scientifique, FRS-FNRS (Communauté
8. Conclusions Française de Belgique). Computing resources were provided by
the IDRIS under the allocation 2009-i2009020164 made by GENCI
In this work the kernel density method was introduced into the (Grand Equipement National de Calcul Intensif). The authors wish
PCA process to address the issue of uneven sampling in highly non- to tank Prof. James C. Sutherland from the University of Utah for
linear multi-dimensional data sets, as in combustion systems. The gently providing the diffusion flame data-set.
method introduces a density metric within the PCA, which allows a
better reconstruction of the flame region, usually under-repre- References
sented in high-fidelity numerical (and experimental) data sets.
[1] K. Pearson, Philos. Mag. Ser. 6 (2) (1901) 559–572.
The approach was first demonstrated for a two variables data- [2] C. Frouzakis, Y. Kevrekidis, J. Lee, K. Boulouchos, A. Alonso, Proc. Combust. Inst.
set, and then tested on a 1-D laminar flame, on a 2-D flame vortex 28 (2000) 75–81.
interaction data-set and on a 3-D turbiulent diffusion flame data- [3] S. Danby, T. Echekki, Combust. Flame 144 (2006) 126–138.
[4] U. Maas, S. Pope, Combust. Flame 88 (1992) 239–264.
set. [5] M. Frenklach, H. Wang, C. Yu, M. Goldenberg, C. Bowman, R. Hanson, D.
The 1-D laminar flame was employed to perform a benchmark Davidson, E. Chang, G. Smith, D. Golden, et al., Gri-mech 1.2-an optimized
of the mono-variable vs. multi-variable kernel density method, detailed chemical reaction mechanism for methane combustion, GRI Tech.
Report GRI-95/0058, 1995.
showing that temperature can be used for computing the kernel
[6] N. Peters, Symp. (Int.) Combust. 21 (1988) 1231–1250.
density. The resulting T-BAKED PCA allows a better precondition- [7] H. Pitsch, N. Peters, Combust. Flame 114 (1998) 26–40.
ing of the data-set for the PCA at lower computational cost than [8] N. Peters, Prog. Energy Combust. Sci. 10 (1984) 319–339.
the multi-variable kernel density method. Moreover, an optimal [9] J. Van Oijen, L. De Goey, Combust. Sci. Technol. 161 (2000) 113–137.
[10] B. Fiorina, R. Baron, O. Gicquel, D. Thévenin, S. Carpentier, N. Darabiha,
value of the bandwidth was also found (Eq. (26)). Combust. Theory Modell. 7 (2003) 449–470.
Results on the 1-D laminar flame showed that T-BAKED PCA al- [11] O. Gicquel, N. Darabiha, D. Thévenin, Proc. Combust. Inst. 28 (2000) 1901–
lows to significantly improve the description of the flame. Due to 1908.
[12] B. Fiorina, O. Gicquel, S. Carpentier, N. Darabiha, Combust. Sci. Technol. 176 (5)
the very nature of the method, however, the information contained (2004) 785–797.
in the hot and cold regions disappear from the first PCs, leading to [13] J. Sutherland, A. Parente, Proc. Combust. Inst. 32 (2009) 1563–1570.
higher reconstruction errors in those zones. [14] A. Parente, J. Sutherland, L. Tognotti, P. Smith, Proc. Combust. Inst. 32 (2009)
1579–1586.
To solve the problem, the HT-BAKED PCA method was intro- [15] A. Parente, J. Sutherland, B. Dally, L. Tognotti, P. Smith, Proc. Combus. Inst. 33
duced: it combines classic and T-BAKED PCA, to optimize the (2011) 3333–3341.
PCA reconstruction in all regions providing the best reconstruction [16] M. Rosenblatt, Ann. Math. Stat. (1956) 832–837.
[17] E. Parzen, Ann. Math. Stat. 33 (1962) 1065–1076.
of the original data-set for a fixed number of PCs. This conclusion [18] J. Gower, D. Hand, Biplots, vol. 54, Chapman & Hall/CRC, 1996.
was drawn both from the analysis of the 1-D flame data-set and [19] M. Wand, M. Jones, Kernel Smoothing, Chapman & Hall/CRC, 1995.
more representative 2-D flame vortex interaction and 3-D turbu- [20] R. Kee, F. Rupley, E. Meeks, J. Miller, CHEMKIN-III: A FORTRAN chemical
kinetics package for the analysis of gas-phase chemical and plasma kinetics,
lent diffusion flame data-sets. Finally, the best strategy for choos-
Sandia National Laboratories Report SAND96-8216, 1996.
ing the number of PCs of each type appears to be an equi- [21] V. Katta, L. Goss, W. Roquemore, Combust. Flame 96 (1994) 60–74.
repartition. [22] P. Renard, D. Thévenin, J. Rolon, S. Candel, Prog. Energy Combust. Sci. 26 (2000)
In conclusion the T-BAKED PCA and HT-BAKED are powerful 225–282.
[23] E. Hawkes, R. Sankaran, J. Sutherland, J. Chen, Direct numerical simulation of
and robust tools to study combustion processes from high-fidelity turbulent combustion: fundamental insights towards predictive models, in:
numerical data-set or experimental data-set: if the main interest is Journal of Physics: Conference Series, vol. 16, IOP Publishing, 2005, p. 65.
in the flame region, T-BAKED PCA can be employed due to its

Kerneldensityweighted PCA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kerneldensityweighted PCA

Uploaded by

Copyright:

Available Formats

Combustion and Flame 159 (2012) 2844–2855

Contents lists available at SciVerse ScienceDirect

Combustion and Flame

Kernel density weighted principal component analysis of combustion processes

1. Introduction state-space parameterization: where it is assumed that the

X ¼ ZA1 ð5Þ e ¼ ðX  XÞD1

Fig. 3. x2 as a function of x1.

in the intermediate region, the PCA fails to capture that

0 0,005 0,01 0 0,005 0,01

4.2. Multi-variables case b ¼W X

and the global weighting

Using a global bandwidth or a variable dependent bandwidth is

0.25 0.25 0.25

0.2 0.2 0.2

0.1 0.1 0.1

0.05 0.05 0.05

x 10−5 x 10−5 x 10−5

0,005 0,005 0,005

T-BAKED PCA, yields more accurate results than the multi-variable

5.2. Optimal bandwidth YH 0.98612 0.9561 0.99861

perature distribution having a high density of extreme values. The

5.3. HT-BAKED PCA

The T-BAKED PCA approach allows improving the representa-

Species h from Eq. (26) Optimal h

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

x 10−4 x 10−4 x 10−4

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

You might also like

X ¼ ZA1 ð5Þ e ¼ ðX XÞD1