You are on page 1of 21

Trends in Analytical Chemistry 135 (2021) 116157

Contents lists available at ScienceDirect

Trends in Analytical Chemistry


journal homepage: www.elsevier.com/locate/trac

A guide to good practice in chemometric methods for vibrational


spectroscopy, electrochemistry, and hyphenated mass spectrometry
Manuel David Peris-Díaz*, Artur Kre˛ zel
_ **
Department of Chemical Biology, Faculty of Biotechnology, University of Wrocław, F. Joliot-Curie 14a, 50-383, Wrocław, Poland

a r t i c l e i n f o a b s t r a c t

Article history: Chemometric methods are powerful tools used in analytical sciences and beyond. Their application is
Available online 22 December 2020 continuously increasing due in part to the technological advances that allow exploration of new issues.
This review presents the most common chemometric tools utilized in the most recent research articles
Keywords: (2018 to now), which emerge at the interface between several disciplines and instrumental techniques
Chemometrics such as vibrational spectroscopy, electrochemistry, and hyphenated mass spectrometry techniques. The
Vibrational spectroscopy
review is divided into several sections: statistical design of experiments, signal pre-processing, explor-
Electrochemistry
atory data analysis, predictive modelling of the data, and statistical validation. In each section, we review
Mass spectrometry
Signal processing
the main mathematical models, then we examine the trends observed in the research articles, and finally
Design of experiments discuss the potential pitfalls to avoid during the application of the methods. We believe that bringing
Exploratory analysis together the main mathematical models with the possible experimental challenges will be of great use
Predictive modelling for non-experts chemometricians.
Validation © 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction predictive modelling [4], and will be the subject of this review
(Scheme 1). Therefore, chemometrics plays a key role in under-
The technological developments in analytical chemistry are standing the data and reaching the right conclusions. For example,
accompanied by the generation of abundant and complex chemical an early article that presented a meta-analysis of genomics studies
data. The introduction of new methods and instrumentation is on schizophrenia concluded that “gene association studies are
crucial for solving many fundamental questions. However, it also typically wrong” [5]. Although this has been referred to as geno-
presents a big challenge for data analysis. Understanding what mics studies, statistical issues are still present in the post-genome
chemical information among all of the data obtained is in itself a era. This was exemplified in a recent meta-analysis of metab-
fundamental issue. The continuous development in transmission olomics studies, which concluded that a third of them did not make
technology, mass analyzers or activation methods in mass spec- use of proper statistics [6]. Another study reported how often
trometry has permitted their application in new areas [1,2]. Also, methods cause bias in peptide identification [7].
the development of Raman spectroscopy methods enables their use This review summarizes the advances (2018 to present) in the
in biomedical sciences [3]. Such advances require more time and use of chemometrics in three main fields of analytical chemistry:
effort to comprehensively analyze the data and extract the infor- spectroscopy, electrochemistry and hyphenated mass spectrom-
mation of interest. Here, bioinformatics, systems biology and che- etry. This period is rather narrow but clearly indicates trends in
mometric disciplines overlap in order to develop mathematical and chemometric tools and high interest of their application in multiple
statistical methods for the analysis of the data. The latter discipline physicochemical methods (Scheme 2). By clustering and creating a
is especially useful for the design of experiments, the mathematical bibliometric map, we quickly identified the main fields of appli-
treatment of the signal response, the exploration of the data, or for cations, methods and/or instrumental techniques. For example, in
blue is clustered gas chromatography with volatile organic com-
pounds, biomarkers and metabolomics. This indicates the rela-
tionship between the technique and the field of application. On the
* Corresponding author.
** Corresponding author.
other hand, we see that food chemistry such as the analysis of
E-mail addresses: manuel.perisdiaz@uwr.edu.pl (M.D. Peris-Díaz), artur.krezel@ adulteration of olive oil and traditional Chinese medicines are fields
_
uwr.edu.pl (A. Kre˛ zel). of application of great interest (Scheme 2).

https://doi.org/10.1016/j.trac.2020.116157
0165-9936/© 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Abbreviations AsLSSR asymmetric least squares splines regression


COW correlation optimized warping
DOE statistical design of the experiments icoshift interval correlation optimized shifting
OFAT one-factor-at-a-time GC gas chromatography
RSM response surface method LC liquid chromatography
PB Plackett-Burman MS mass spectrometry
CCD central composite design TIC total ion chromatogram
BBD Box-Behnken designs Q2 test set validation coefficient
ANOVA analysis of variance RMSECV root mean square error of cross validation
ML machine learning RMSEC root mean square error of calibration
MOO multi-objective optimization PLS-DA partial least squares-discriminant analysis
DPC N-dodecylpyridinium chloride PLS partial least squares
QU quercetin PCA principal component analysis
TA tannic acid PC principal components
NIR near-infrared HCA hierarchical clustering analysis
MIR middle infrared LDA linear discriminant analysis
SNV standard normal variate SVMs support vector machines
MSC multiplicative scatter correction RF Random Forest
EEM excitation-emission matrix NIPLS nonlinear iterative PLS
PARAFAC parallel factor analysis SIMPLS statistically inspired modification of PLS
MDR missing data recovery LV latent variable
SG Savitzky-Golay polynomial filters CART classification and regression tree
NW Norris-Williams derivation OOB out-of-bag
FFT fast Fourier transform SRM structural risk minimization
CWT continuous wavelet transform VIP variable importance in projection
DWT discrete wavelet transform CV cross-validation

In the first part of this review, we focus on the use of the sta- similar factor precision. Moreover, the experimental design de-
tistical design of the experiments (DOE) and their optimization. pends on the project lead, which can easily lead to false optimums.
Here, we discuss and present the main DOE used throughout the The second main disadvantage is that OFAT cannot estimate
research analyzed. In the second part, we describe the main signal whether there are interactions between the factors studied.
pre-processing algorithms and strategies according to the afore- Because one factor is changed while the rest are kept fixed, infor-
mentioned three main experimental techniques. This section, and mation about how factors influence each other is lost. Therefore,
the rest that are presented, are concluded by exposing the potential OFAT can only be used for screening of the main factors and informs
pitfalls that one may encounter during application of these algo- about the main effect of a factor on the output. This relates to the
rithms and by presenting the current trends. In the third part, we third disadvantage, finding sub-optimal solutions for the optimi-
address the exploratory analysis of the data, giving a brief intro- zation problem.
duction to the main techniques and concluding with the main All of these issues can be solved by incorporating the DOE
pitfalls and trends. The fourth section constitutes the predictive methods [10,11]. DOE presents multiple advantages: (i) multiple
modelling of the data and it is followed by the model validation. factors can be simultaneously studied and thus their interactions
These articles draw trends for each section directly from the estimated. This results in higher precision to estimate the effects of
manuscripts examined, which were searched in the Scopus citation a factor; (ii) the experiments are designed in factorial designs
database. The search was focused on articles’ titles and keywords (either full or fractional) and thus do not depend on the project
that contained chemometrics in combination with “electrochem- lead; (iii) DOE provides better optimization than OFAT, finding the
istry”, or “spectroscopy” or “mass spectrometry”. The large number optimum solution; (iv) DOE offers the possibility to develop a
of articles (almost 400) published in the last two and half years regression model that captures the relationship between the sig-
confirms how widely chemometrics is used along with analytical nificant factors and their interactions. Therefore, DOE is a much
chemistry. Therefore, it is of utmost importance to reveal the more reliable and complete methodology for screening or optimi-
possible pitfalls at every step of the chemometrics pipeline, and to zation than OFAT experiments [11]. There are several types of DOE,
extract conclusions derived from the trends. which according to the purposes are factorial designs for screening
and response surface method (RSM) designs for optimization pur-
2. Experimental design and mathematical optimization of poses [12,13]. Factorial designs generally attempt to identify the
experimental responses factors that influence the response (output), and depending on the
experimental design also to identify the effect of the interactions
Optimization of the obtained signal response may be addressed (Table 1) [14,15]. Mainly full factorial designs, two-level full facto-
with a one-factor-at-a-time (OFAT) or with an experimental design rial designs, two-level fractional factorial designs, Plackett-Burman
that accounts for multiple factors simultaneously. OFAT is a simple (PB) and Taguchi's orthogonal array are screening designs. Full
experimental design in which one factor is changed while keeping factorial designs completely study all of the factors, evaluating the
the rest of the factors fixed [8]. Commonly approached by non- main factors and all of the interactions between them. However,
expert users, OFAT presents multiple disadvantages [9]. Mathe- the designs quickly become too large when incorporating multiple
matically, OFAT usually requires more experimental runs than factors or levels. For instance, selecting five factors (K) at three
factorial designs (either full or fractional factorial designs) to obtain levels (n) results in 35 ¼ 243 experimental runs (calculated
2
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Scheme 1. Overview of the chemometrics pipeline followed in this review. If the experimental data set (n) is large enough (a threshold of 40 indicated in the scheme), the data can
be split into a training set and test set. On the other hand, when having small datasets (n < 40), resampling algorithms for such cross-validation are preferred to split the data. If the
data set is divided into a training and test set, the preprocessing algorithms should be applied separately to these two data sets. The data preprocessing schemes have been
separated according to the experimental techniques considered herein, indeed electrochemistry, spectroscopies and hyphenated mass spectrometry. That is because each
experimental technique requires specific preprocessing steps in order to remove the noise from the true signal, and prepare it for the exploratory and/or predictive modelling
treatment. After the predictive model is built, it should be validated by means of the validation set, and the previous steps iteratively followed until achieving the optimal model. At
the end, the optimal model can be evaluated using a test set, that is independent of the collected experimental data. Abbreviations: NW: Norris-Williams derivation; FFT: fast
Fourier transform; CWT: continuous wavelet transform; DWT: discrete wavelet transform; AsLSSR: asymmetric least squares splines regression: COW: correlation optimized
warping; icoshift: interval correlation optimized shifting; SNV: standard normal variate; MSC: multiplicative scatter correction.

(three or more factors) are not significant, so they are substituted


with new factors, thus reducing the number of runs required. The
standard notation for this design is 2Kp, where K is the number of
factors and p is the number of design generators. We may calculate
the fraction of experiments required concerning the full factorial
design as 1/2p. For example, considering an experiment with three
factors (K ¼ 3) and one design generator (p ¼ 1) leads to a 231
design with 1/21 or half of the runs required from a full factorial
design (23). So, the design was reduced from eight to four
experiments.
However, half fractional factorial designs can be practically used
only when the factors studied number at least three. This relates to
the resolution of the design, that is its ability to separate the main
effects from the interaction effects. Formally, the resolution is
defined as the length of the shortest word in the defining relation.
Scheme 2. Network topology generated from the documents retrieved in this review. Our previous 231 design with three factors (e.g., factor “A”, factor
The yellow, blue, and green colors represent the three main clusters that correspond to “B” and factor “C”), has resolution III because there is only one
chemometric methods, instrumental techniques, and applications, respectively. The possible design generator with the word “ABC”. A design with at
sizes of the labels are scaled according to the occurrence of the word in the retrieved
least resolution III is needed to estimate the main effects. However,
documents.
here one main factor is confounded with two-factor interaction. An
easy way to interpret it is by splitting the resolution number (III)
according to nK), which is prohibitive to perform in terms of cost into the sum of two integer numbers. Resolution III would be
and effort. Two-level full factorial designs (2K) are widely popular 3 ¼ 1 þ 2. So, the main effects (1) and the two-factor interactions
for screening when the factors are lower than 4e5. Approaching (II) can be confounded. A 241 design with four factors has reso-
this number of factors increases the complexity of the matrix lution IV and may estimate the main effects, confounded or aliased
(24 ¼ 16 or 25 ¼ 32 runs). One way to mitigate the large number of with third-factor interactions. This design can also estimate the
experimental runs required is with fractional factorial designs. two-factor interactions but they are aliased with other two-factor
They have an assumption that higher-order factor interactions interactions. Similarly, splitting IV into two integers results in two

3
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Table 1 interactions are considered. Thus, prior knowledge of the system


Overview of the most commonly used statistical design of experiments (DOE) ac- is required to identify which interactions might be significant for
cording to the main objective, either screening or optimization.
the design.
Objective DOE Features Once the important factors have been screened, the optimiza-
Screening Full factorial design - Allow all factors and levels (nK) tion and finding the optimum value of the response can be per-
- Prohibitive in terms of cost and effort formed with the RSM designs [16]. There are two main types of
Screening Two-level full - Factors < 4-5 RSM designs, central composite design (CCD) and Box-Behnken
factorial design - Two-level: 2K
design (BBD). CCD is often used in optimization designs because
Screening Two-level fractional - 3 factors
factorial design 231 - Resolution III: One main factor is it can account for up to five levels for each factor and also it can
confounded with two-factor include runs from fractional factorial designs. BBD is a three-level
interaction design that results in a less expensive design and thus is
Screening Two-level fractional - 4 factors
preferred to CCD when three factors are involved in the optimiza-
factorial design 241 - Resolution IV: Estimate the main
effects with no confounding, but the
tion. However, BBD cannot include runs from fractional factorial
two-factor interactions are aliased designs and thus does not allow sequential experiments. The
with each other screening designs presented, since they have only two levels for
Screening Two-level fractional - 5 factors each factor, only allow modelling a linear response surface. The
factorial design 251 - Resolution V: main effects and two-
responses obtained are fitted to a linear function (Equation (1)):
factor interactions are estimated
with no aliasing to any other main
effect or two-factor interaction X
k
Screening Plackett-Burman - N-1 factors with N experiments, y ¼ b0 þ bi xi þ ε (1)
where N is equal to a multiple of i¼1
four with eight as starting point.
- Two levels per factor where k is the number of variables, b0 is a constant term or inter-
- Resolution III: One main factor is
confounded with two-factor interac-
cept that represents the overall mean effect, bi are the linear co-
tion. Thus, it is desired for cases efficients, xi are the variables and the residual error is ε. That is, bi is
where we do not expect interactions the effect of the factor xi. On the other hand, both designs CCD and
between effects and we aim to BBD can estimate the interaction between the experimental vari-
investigate main effects.
ables through interaction terms (Equation (2)):
Screening Taguchi's orthogonal - Estimate main effects and two-factor
array interactions
- Prior knowledge of what interactions X
k X
k
might be significant y ¼ b0 þ bi xi þ bij xi xj þ ε (2)
Optimization CCD - Up to five levels each factor i¼1 1ij
- It can include runs from fractional
factorial designs
where bij represents coefficients for the interaction between the
- Estimation main effects, interaction
and quadratic terms factor xi and the factor xj. The terms k, b0 , ε and bi are the same as
Optimization BBD - Three-level each factor for Equation (1).
- Estimation main effects, interaction These models may incorporate quadratic terms in a regression
and quadratic terms model to account for the curvature in the response. This implies
CCD: Central composite design; BBD: Box-Behnken design. that they are able to determine critical points in the surface (i.e.,
maximum, minimum, and saddle points). In order to do so, the
response is fitted to a polynomial function that contains quadratic
possibilities, 4 ¼ 1 þ 3 or 2 þ 2. Therefore, the main factors are terms (Equation (3)):
aliased with three-factor interactions, or the two-factor in-
teractions are aliased with other two-factor interactions. Increasing X
k X
k X
k
to resolution IV leads to estimating the main effects with no con- y ¼ b0 þ bi xi þ bij xi xj þ bii x2i þ ε (3)
founding, but the two-factor interactions are aliased with each i¼1 1ij i¼1
other. A 251 design with 5 factors would have resolution V. Here,
the main effects and two-factor interactions are estimated with no where k; b0 ; ε; bi ; bij , xi and xj are the same than for Equation (2).
aliasing to any other main effect or two-factor interaction. How- Thus, modelling the results obtained can be accomplished by
ever, the two-factor interactions are aliased with three-factor using the simplest linear function (Equation (1)) to a full model that
interaction. Explained as above, V can be split as 1 þ 4 or 3 þ 2, accounts for linear, interaction, and quadratic terms (Equation (3)).
meaning that main factors are aliased with four-factor interaction Generally, the mathematical model should be selected based on the
or two-factor interactions are aliased with three-factor interaction. application of analysis of variance (ANOVA). Based on the p-values
The higher the resolution is, the lower is the confounding, but obtained for each coefficient regression obtained from ANOVA, the
requiring more experimental runs. linear (A, B, C), interaction (AB, BC, AC), or quadratic terms (A2, B2,
Plackett-Burman designs are a two-level fractional factorial C2) are included or excluded in the final model. Once the terms to
design with resolution III, and thus aimed at investigating the main include in the equation have been selected, the regression problem
effects (Table 1). This design allows one to include up to N-1 factors can be solved with a least-squares estimator with machine learning
with N experiments, where N is equal to a multiple of four. PB (ML) algorithms [17e20]. The multiple regression model is assessed
designs offer a lower number of runs than fractional factorial de- based on the determination of the coefficient (R2). The R2 increases
signs for the screening of the main effects. Thus, for cases where we when adding new terms to the model. Therefore, one should
do not expect interactions between factors, PB is desired. On the examine instead the adjusted R2, which takes into account the
other hand, if we assume that there will be an interaction between number of terms in the regression model. The adjusted R2 will in-
the factors, we might want to choose fractional factorial designs. crease only if the new term added to the regression model im-
The last design for screening is Taguchi orthogonal array, a frac- proves its explanatory power. Another parameter that should be
tional factorial design where only the main effects and two-factor checked is the lack-of-fit of regression, which assess the variability
4
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Table 2
Overview of the DOE employed for the different analytical techniques, that is vibrational spectroscopy, electrochemical and hyphenated mass spectrometric techniques.

Analytical method DOE Objective Experimental aim Reference

Spectroscopies Fractional factorial design resolution IV Screening Synthesis of silica nanoparticles [32]
Full factorial design Screening Drug development [29]
Electrochemistry 33 full factorial design Screening Drug quantification, phenol quantification [30,35,36]
Mixture design Optimization Synthesis biosensor [33,34]
L36 fractional factorial design Screening Determination metals in soils and plants [161]
23 full factorial design Screening Determination flavonol in drink, determination of folic acid in urine [37]
CCD Optimization Detection of Sm(III) [38,39]
PB Screening Determination of Sm(III) [40]
BBD Optimization Drug quantification [40]
Four-level orthogonal array Screening [31]
LC-GC/MS CCD Optimization Optimization chromatographic and MS conditions [25,162,163]
Full factorial design Screening Optimization chromatographic conditions [26]
BBD Optimization Optimization MS parameters, compound extraction [17,18,27,28]
Statistical mixture design Optimization Compound extraction [164]

CCD: Central composite design; BBD: Box-Behnken design; PB: Plackett-Burman; LC: liquid chromatography; GC: gas chromatography; MS: mass spectrometry.

due to random error. In order to be able to calculate it, the design optimized, solving the MOO problem through Pareto optimality.
must incorporate replicates or central points. If the error provided Generally, we may observe how BBD is more common employed
by the lack-of-fit of regression is lower than the random pure error than CCD for optimization purposes in hyphenated MS techniques
(p-value > 0.05), the regression model is well fitted. ML algorithms (Table 2). Herein, the focus will be set on using CCD and BBD for
have proved higher accuracy, handling complex nonlinear re- optimization purposes.
lations, and automatically pick up the interaction effects between A recent application in the development of analytical methods,
the variables [17e20]. based on supercritical fluid chromatography coupled to time-of-
When there are multiple responses to simultaneously optimize, flight mass spectrometry in order to identify lipid markers from
finding the optimum solution is not a trivial task. Multi-objective traditional Chinese medicine, used a CCD to optimize the chro-
optimization (MOO) strategies represent an approach for consid- matographic separation [25]. In chromatography separations usu-
ering all of the responses simultaneously in order to provide a so- ally one is interested in the evaluation of several responses, such as
lution. Several approaches such as Bayesian models, genetic the separation between peaks, the analysis time of characteristics
algorithms, Pareto optimality, and desirability functions have been of the peaks such us peak width, and symmetry or resolution. To do
used for solving MOO [11,17,18,21e24]. The most often used crite- so, multiple response optimization through desirability function or
rion is the desirability function or Derringer function is in which Pareto efficiency is of great interest. The authors studied the effect
each response has attributed a desirability function that ranges of the solvent type, flow rate, column temperature and back pres-
from 0 (undesirable response) to 1 (fully desired response). Overall sure at five levels, which required 30 experiments. The experi-
desirability is obtained as the weighted geometric average of the mental design included six repetitions of the central point to
individual desirability functions. The individual desirability func- estimate the variance of the model, and the experiments were
tion can be constructed based on minimization/maximization randomized [25]. Several responses were simultaneously studied:
criteria establishing target values for the response. An interesting resolution between peaks, analysis time, total number of peaks, and
approach that we have recently employed is the Pareto front number of peaks between triglycerides and diglycerides. The four
approach [17]. The Pareto approach identifies a set of optimal so- responses obtained from the CCD were evaluated using a Derringer
lutions according to a trade-off selected (i.e., maximization of a function, which showed for each response the best set of condi-
particular response without degeneration of the rest of responses). tions. The individual desirability values were combined into the
global desirability function that ranged from 0 to 0.22. Then, a
2.1. Applications, trends and potential pitfalls in DOE quadratic model with a R2 ¼ 0.78 was determined and subse-
quently evaluated with ANOVA. Only the flow rate and the inter-
DOE has been used for the screening and optimization of action between black pressure and column temperature were
chromatographic conditions [25,26], optimization of extraction found as significant factors (p < 0.10). The investigated procedure
methods [28,29], in screening stages during drug development allowed the best settings of the experimental variables to be
[29e31] optimization of chemical synthesis [32e34], and for determined for investigation of the lipid profile of coix seeds [25].
determination of other compounds of interest [35e40]. DOE has Several comments and conclusions can be made based on this
become a preferred choice for rational pharmaceutical develop- report. If using desirability functions, the authors should report the
ment. For example, a full factorial design and a four-level orthog- criteria for the individual desirability (i.e., maximization of the
onal array have recently been used in the screening of ramipril and resolution while minimizing the analysis time). The report showed
prediction of drugs used to treat hypertension [29,33]. Another a low global desirability function around 0.2, which may be
common application is in the screening of instrumental parameters explained by very restrictive criteria to for the individual desir-
affecting signal acquisition. In this sense, PB has been used for ability, although it was not explained. An interesting question is
screening the parameters influencing the electrochemical detec- whether the authors should use a screening design instead of an
tion of Sm(III) in its complex with diethylenetriaminepentaacetic optimization design in order to determine the experimental factors
acid. Furthermore, the authors identified significant factors that and interactions that have a significant influence. The answer is in
were then optimized with a BBD design [40]. Other application of fact yes. During the analytical method development, thus investi-
BBD designs is the optimization of mass spectrometry conditions gating the effect of particular factors and later their optimization,
[17,18]. In both reports, the authors employed BBD for the optimi- screening designs are executed in order to determine whether a
zation of preselected instrumental factors related to mass spec- factor should be later included in an optimization design. There-
trometry instruments. Multiple responses were simultaneously fore, a recommendation would be to remove those factors that are
5
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

not significant from the experimental design that are reducing the processing methods are reference-independent and they do
accuracy of the predictors, and obtain a simplified model. not require reference values. Reference-independent pre-pro-
Another application in extraction procedures includes the use of cessing methods can basically be divided into two groups:
ultrasound-assisted extraction of RAs (RA-V, RA-VII and RA-XII) in scattering correction methods and spectral derivatives (Scheme
Rubia plants [28]. RSM combined with BBD design with four vari- 1). Light scatter is a common effect for all analytical techniques
ables was applied in this study to evaluate the optimal methanol using light, such as NIR or IR. When the size of the particles in
concentration, liquid to solid ratio, extraction time and frequency. the sample matches with the magnitude of the spectroscopic
Then, ANOVA was performed to test whether the models were wavelengths, the baseline shifts, and non-linearities affect the
significant. The regression models were significant (p < 0.05) for all signal. Thus, applying proper scatter correction techniques, the
three analytes, and the lack of fit was not significant (p > 0.05) for physical variability in the samples and the baseline shifts can be
RA-V and RA-VII. Herein, the authors provided adjusted R2 of 0.96 removed. Normalization methods such as mean-centering or
and a low coefficient of variance (<10%), which shows high preci- Euclidean distances, baseline correction methods, standard
sion and repeatability. Moreover, the regression coefficients were normal variate (SNV) [41,42] and multiplicative scatter correc-
also found to be significant (p < 0.05). Finally, a set of optimized tion (MSC) [43,44] are commonly used for this purpose. SNV
extraction conditions obtained were used to predict the maximum does auto-scaling on the rows of the matrix (samples in rows
content of the RAs (2.46, 0.35 and 3.68 mg/g). Five additional ex- and wavenumber in the column) whereas MSC performs back-
periments obtained 2.47, 0.35 and 3.57 mg/g, which is in excellent ground correction and normalization simultaneously using a
agreement with the predicted values, and validates the model [28]. reference spectrum or using the entire data set if the reference
Mollaei et al. recently developed an electrochemical method for spectrum is not available. Then, the spectrum is corrected by
the simultaneous determination of folic and folinic acid based on regressing each spectrum against a reference spectrum or the
the application of N-dodecylpyridinium chloride (DPC) as a modi- average spectrum from the data set. Although these two tech-
fier [38]. Here, four-factor CCD and two-factor CCD at five levels niques often provide similar results, if SNV is followed by
were applied to optimize several variables (the first model included baseline correction likely it will give higher baseline correction
scan rate, DPC, step potential and pre-concentration time, and the due to the fact that MSC performs a simultaneous correction. A
second design included the pulse height and pH). Quadratic models dedicated attention has been focused on reducing Rayleigh and
were fitted and the responses surfaces were obtained and evalu- Raman scattering from fluorescence excitation-emission matrix
ated. No significant lack of fit was obtained for all models with a (EEM) [45e47]. In fluorescence experiments, part of the incident
good adjusted R2 > 0.91. Finally, the set of optimal values was energy is absorbed and converted to vibrational and rotational
elucidated and evaluated using real samples [38]. energy, producing scattering bands. Therefore, it is important to
Mosleh et al. proposed a method for the determination of remove such artifacts prior to any chemometric analysis,
quercetin (QU) in the presence of tannic acid (TA) in drinks commonly parallel factor analysis (PARAFAC) [47]. One simple
based on a nano-structured sensor for achieving better detection approach is to remove these scattering bands and add missing
limits of carbon nanotubes [39]. CCD and RSM were used to values [48]. Another approach consist in the use of interpolation
evaluate the effect of pH, scan rate, step potential and amount of methods to reduce the scattering effects [46]. Notwithstanding, a
multi-walled carbon nanotubes on the determination of QU in weighted PARAFAC may model the EEM fluorescence spectra in
presence of TA with 27 experiments. The model was evaluated the presence of Raman scattering [49]. Among the current
with visualization of the normality probability plot, which shows strategies reported, one made use of a missing data recovery
the normal distribution of the residuals. The straight line ob- (MDR) coupled to principal component analysis or PARAFAC to
tained indicates that the residuals follow a normal distribution, correct Rayleigh scattering [48]. Extended information is pre-
which is a requirement for the validity of ANOVA. Then, the sented in a recent review article [50].
authors evaluated the homogeneity of the variance with the On the other hand, baseline correction techniques subtract the
visualization of the residuals of the response, randomly distrib- baseline offset or slope. Detrending is one of the most common
uted. Once it was verified that ANOVA could be applied, it was techniques in spectroscopy for baseline correction, especially in NIR
used to evaluate the quadratic model, although the results were or IR, although other algorithms such as wavelet transformations or
not presented in the manuscript. A final adjusted R2 of 0.98 polynomial baseline fittings are sometimes used.
indicated a good agreement between predicted and observed Spectral derivatives have been commonly used to reduce the
values. Finally, the method was evaluated using real samples and additive and multiplicative effects on spectral data, including finite
showed good performance [39]. differences, Savitzky-Golay polynomial filters [51] (SG), and Norris-
Williams (NW) derivation. The first one, finite differences, calcu-
3. Signal pre-processing lates a difference spectrum between adjacent points for the first
derivate, but in practice it is not used due to the inflation of the
3.1. Pre-processing spectroscopic data noise. SG and NW, on the other hand, are popular choices that
include a previous smoothing step before derivation avoiding
The application of pre-processing methods is of vital impor- reduction of the signal-to-noise ratio. The use of a derivate provides
tance to prepare the data for the subsequent analysis (explor- a means to remove the baseline but also the scattering effects. NW
atory, regression, or classification analysis). Generally, one is performs a moving average over the data and calculates a finite
seeking to transform the data in a way that will better follow difference on the smoothed spectrum. Although NW might provide
Beer's law. Depending on the spectroscopic method, different similar estimates as SG, the latter one is more widely used, giving
physical and chemical phenomena may deviate the signal from better results.
the linear relationship between the concentration and the Generally, to find the optimal pre-processing method, one may
absorbance. For instance, scatter is more pronounced in near- apply a trial-and-error approach where one combines different
infrared (NIR) than in middle infrared (MIR) spectra, although pre-processing techniques and selects those that give the best
baseline and noise are common issues for spectroscopic tech- model performance [52,53]. However, other two approaches are
niques. The pre-processing method aims to improve this linear commonly employed to select the proper pre-processing method.
relationship. In spectroscopy the most commonly used pre- These are the simple visual inspection of the data before and after
6
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

applying a pre-processing method, and the use of quality param- long time delayed the use of chemometrics methods, and still,
eters that assess the quality of the pre-processing. All of these ap- nowadays the number of publications reporting their use is small
proaches will be explained in more detail in section 3.4. Due to the in comparison with spectroscopy. Typical signal processing steps
fact that the diverse spectroscopic techniques are based on are noise removal, baseline correction, potential shift correction
different phenomena, their background signals are different and (alignment of the voltammograms), and separation of overlapping
require particular pre-processing (Scheme 1). Although it is true signals (Scheme 1). Each one of these steps aims at focusing on a
that there are predetermined combinations that work sufficiently particular artefact, and is isolated or consecutively applied in the
well for a particular spectroscopic technique, the investigation of data. As for other techniques, there exists no specific order that
other pre-processing methods might be worthwhile. For example, may be generally applied, but the order in which they are used is
NIR spectroscopy exhibits pronounced scattering effects that might particular to the experimental data and should be carefully
be reduced enough with a derivate [52,53]; IR exhibits less scat- inspected. The noise affecting electrochemical data is commonly
tering effects than NIR and thus pre-processing might not be classified according to three types: 1) quasi-random high-fre-
necessary in some cases; Raman spectroscopy is strongly influ- quency noise; 2) spikes that are short pulses with large amplitude
enced by the intrinsic fluorescence background, which might be and; 3) baseline drifts or low-frequency signals. In order to remove
removed by baseline correction [54]. These effects need to exceed the noise and enhance the signal, algorithms based on smoothing
the detector noise and the intensity fluctuation of the radiation are commonly used. These include Savitzky-Golay [55], splines,
source. Moreover, the scattering algorithms are designed to be fast Fourier transform (FFT), continuous wavelet transform (CWT),
applied over the raw spectrum and thus should be applied before and discrete wavelet transform (DWT). The baseline correction is
differentiation or baseline correction. Whilst these only represent rather a difficult issue in electrochemistry, due to the complexity of
basic guidelines, in practice, one iteratively applies a set of pre- developing a theoretical mathematical model that describes the
processing methods and checks the performance achieved baseline. The background current is constituted by the capacity
(Scheme 1). As we may observe, the proper pre-processing tech- and faradaic components from electrolytes, solvents, or side redox
nique is intrinsic to the particular spectra, and many combinations reactions. Moreover, the background current is affected by the pH,
are found in the literature (Table S1). Approximately 15% of the oxygen, and electro-sensitive components. Experimental methods
manuscripts published in the time period analyzed used a combi- such as subtractive anodic stripping voltammetric have been
nation of pre-processing methods (Fig. 1A). Savitzky-Golay differ- designed to reduce or eliminate the background current. On the
entiation filters represent the most often used method, providing a other hand, common mathematical algorithms are widely
means to remove the baseline but also the scattering effects. Briefly, accepted and employed. Among them are splines [56e59], asym-
SG infers derivatives from local polynomials used for each data metric least squares splines regression (AsLSSR) [60e63] and CWT
point. [64e66] or wavelet-based [67e71] (Scheme 1). Another issue with
electrochemical data is the horizontal or potential shift. The sig-
3.2. Pre-processing electrochemical data nals obtained from electrochemical techniques usually suffer from
shifts in potential values. The nonlinearity effects result in de-
In contrast to spectroscopic methods, the linear relationship viations from the linearity that hinders the application of linear
between signal and concentration is not featured for electro- data processing algorithms. Therefore, the alignment of voltam-
chemical methods in which the signal is understood under elec- mograms is often required. In order to correct the potential shift,
trochemical processes rather than chemical species. Another main correlation optimized warping (COW) and interval correlation
difference with respect to spectrophotometric methods is the optimized shifting (icoshift) are popular algorithms. Table S2 and
dynamic character of many electroanalytical methods that adds an Fig. 1B shows the algorithms used for the different electrochemical
extra complex dimension. The aforementioned reasons have for a techniques in the latest research (2018e2020).

Fig. 1. Trends in the number of publications observed for the pre-processing techniques used in spectroscopic (A) and electrochemical methods (B). Frequency is calculated as the
percentage of publications where a particular method has been used, considering the total sum of all of them. SG: Savitzky-Golay; SNV: standard normal variate; MSC: multi-
plicative scatter correction; ALS: asymmetric least squares; Combination stands for any combination of pre-processing methods. BC Poly: polynomial baseline correction. MC: mean
centering; COW: correlation optimized warping; FFT: fast Fourier transform; DWT: discrete wavelet transform; AsLssr: asymmetric least squares splines regression.

7
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

3.3. Pre-processing hyphenated mass spectrometry data XCMS Matched filter [78], centWave within XCMS [79], cen-
troidPicker in MZmine [80], enviPick standalone [81] or R-Metab-
Common hyphenated mass spectrometry experiments, partic- oList [82] and MAVEN [83]). Deisotoping or deconvolution identify
ularly gas chromatography (GC)- and liquid chromatography (LC)- isotopic peaks that correspond to the same compound, removing
mass spectrometry (MS), have become indispensable tools in all redundant information and obtaining a simplified data matrix [84].
branches of science [72]. The chromatography step that separates Here, the members of an isotopic envelope should be first identified
components of a mixture based on physical or chemical properties in order to reduce it to the monoisotopic peak with an intensity that
coupled with the mass measurement at specific time points gen- results from the intensities of all of the envelope. Certainly, a great
erates a large amount of data, which usually requires several pre- number of algorithms have been developed running either for-
processing steps. Moreover, the use of GC-MS and LC-MS in omics wards using a theoretical model or backward from the experi-
studies where multiple samples are analyzed increases the mental mass spectra. They are mainly classified as peak assignment
complexity of the pre-processing setup. For example, due to the or simulation algorithms (e.g., CHAMP [85], Massign [86], UniDec
several hours or days that the analysis of a batch of samples might [87], PeakSeeker [88] or MetaOdysseus [89]). Peak assignment al-
take, the signal instrument may fluctuate, and thus one needs to gorithms are based on performing a prior peak detection and
consider this and corrected it in the post-processing [72]. System- assigning a specific charge state to each peak based on multiple
atic variation in the signal appearing in time can be modelled and charge states (e.g., MaxEnt [90,91], AutoMass [92], Z-Score [93]) or
corrected through the use of appropriate experimental design [72]. isotope peak spacing (e.g., TRASH [94], Z-Score [93]). Although
Typical signal processing steps are noise removal, baseline correc- these algorithms are relatively fast, difficulties to perform a peak
tion, deisotoping or charge-state deconvolution, peak alignment, picking of complex spectra might hamper its use. On the other
and peak detection [73e75] (Scheme 1). The filtering of the noise hand, the simulation algorithms simulate a multiple hypothetical
highly depends on the analytical platform and instrument used. mass and charge state distributions and select the combination that
Generally, the noise can be classified as chemical noise from the fits the data best. Simulation algorithms produce quantitative re-
buffers and solvents or random noise generated from the detector sults, but they are computationally intensive. Peak alignment is
electronics. As for spectroscopies and electrochemical methods, often required in omics studies in order to correct the retention
Savitzky-Golay or wavelet-based algorithms are commonly time differences between runs. The alignment methods may or may
employed as a smoothening filter [55]. Correction of the baseline is not use the retention time to correct these differences. There are
required since it may affect further quantification or chemometrics several main alignment strategies: i) the direct alignment of the
analysis [76]. Usually, a baseline offset is calculated and then sub- total ion chromatogram (TIC) with another used as a reference. The
tracted from the original spectrum. This can be achieved through correlation optimized warping method has been widely used for
derivatives with a SG filter or by applying asymmetric least squares this purpose [95]; ii) aligning detected peaks by clustering the
[77] to approximate the baseline. Another aspect is peak detection chromatographic peaks [96]. Although this method does not
that aims to identify the true signals from the background. A large require retention time correction, multivariate analysis is per-
number of peak detections algorithms have been released (e.g., formed in order to cluster the peaks; iii) two-step alignment in

Fig. 2. Trends observed for the exploratory (A), the predictive modelling (B) and the validation techniques (C) used in spectroscopy, electrochemistry and hyphenated mass
spectrometry methods. PCA: principal component analysis; HCA: hierarchical cluster analysis; PLS-DA: partial least squares-discriminant analysis; PLS; partial least squares; LDA:
linear discriminant analysis; MCR-ALS: multivariate curve resolution-alternating least squares; SIMCA: soft independent modelling of class analogies; SVM: support vector ma-
chines; RF: random forest; PCA-LDA: principal component analysis-linear discriminant analysis; PCA-SVM: principal component analysis-support vector machines.

8
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

which first retention times are corrected and then the detected whether or not the pre-processing method is optimum [103]. In
peaks are clustered [97]. To correct the retention time deviation, a any case, the visual inspection of the data before and after
reference sample is used, and then the rest of the samples are non- applying a pre-processing strategy is a good idea. The relevant
linearly fitted; iv) two-step alignment in which first the detected information should be kept while the pre-processed data should
peaks are clustered and then the retention times corrected [78]. remove unwanted variations in the signal. The use of dimen-
Regarding hyphenated MS methods (Table S3) there are a dispersed sionality reduction tools such PCA can also help to assess the pre-
number of combinations and also we observed the lack of appro- processed data. The third and last way to assess the quality of the
priate description in many research articles, which complicates the pre-processed data is thorough the use of an objective function
evaluation of the existence of any trend (Scheme 1). that measures the quality of the pre-processed data. Although it
does not yet seem common for spectroscopies or electrochemical
3.4. Applications, trends and potential pitfalls in signal pre- data, chromatography and mass spectrometry have experienced a
processing different situation. Pearson's correlation coefficient or a scoring
system has been used to assess the deconvolution quality of mass
Data pre-processing is a substantial challenge in a chemo- spectra. A second major pitfall that may occur relates to the
metrics workflow. Sequentially, several steps are performed and application of pre-processing methods to a whole data set and
thus the number of possible combinations grows exponentially. splitting the data set into a training and validation set. As will be
Every individual step possibly influences each other, which discussed further in this review, the validation of the predictive
complicates the development of a general protocol that will al- model is an essential step. If the data set is divided into two
ways work. One of the possible pitfalls is bad data pre-processing. training and test set, the pre-processing algorithms should be
Once this has happened, the subsequent analysis of the data will applied separately into these two data sets.
be influenced and thus wrong conclusions may be obtained. Certainly, the most used algorithm is a stand-alone SG filter,
Mainly, there are three approaches that have been used most to covering almost 30% of the research (Fig. 1A, Tables S1e2). The use
select the best pre-processing strategy: (1) trial and error; (2) of SG filters as a signal smoothing step reduces the enhancement of
visual inspection of the data and; (3) the use of an objective the high-frequency noise coming from the derivation. Simply, a
function that measures the quality of the pre-processed data. In polynomial interpolation is carried out through the signal. Here, the
the first one, trial and error, several pre-processing methods are polynomial degree (usually set to 2 or 3) and the data-point size are
applied to the data and subsequently the pre-processed data are the crucial parameters to select for the moving window. Decreasing
used as an input for a classification or regression modelling [98]. the polynomial degree increases the extent of the smoothing. A
The pre-processing method that gives the best performing is simple rule is to keep fixed a 3rd degree polynomial, and vary the
selected. For example, Bhavana et al. performed a comparative data-point windows size, inspecting the final results that will
study by Raman spectroscopy, NIR and MIR of pre-processing depend on the total number of signal points, the S/N and the
methods aimed at improving the signal-to-noise ratio [99]. That resolution.
is, they followed a trial-and-error approach in order to select the In comparison with spectroscopy, the use of chemometrics in
best pre-processing approach. In their study, the spectra were electrochemistry still remains rare, as we can observe by the
divided into spectral regions and submitted to a set of pre- number of research articles that have used it (Table S2). The low
processing techniques and combinations. These were MSC, first number of reports using chemometrics makes its proper statistical
derivative and MSC followed by first derivative. After, a quanti- analysis difficult. However, the analysis of these data suggests that
tative model was built by using PLS regression and the quality of normalization is frequently employed along with electrochemical
the model was evaluated using R2, test set validation coefficient methods (Fig. 1B). For example, Baldo et al. developed a method to
(Q2), root mean square error of cross validation (RMSECV) and determine the acidity of the extra-virgin olive oil by voltammetry
root mean square error of calibration (RMSEC). All of the pre- combined with partial least squares (PLS) regression [104]. Because
processing methods assayed produced high values, e.g., R2 in the of the complexity of the electrochemical signal, no raw pre-
range 0.963e0.995, while MSC yielded to slightly lower RMSEC processing was applied and the data were only mean-centered
and RMSECV. Another report employed NIR and MIR spectroscopy prior to PLS regression.
to characterize the varietal origin of olive oil, where SNV provided No trends can be extracted from hyphenated MS methods due to
better performance of the partial least squares-discriminant both the dispersed number of combinations but also the lack of
analysis (PLS-DA) model [100]. Another approach used to select appropriate description of the methods employed. Notwith-
the pre-processing method is through the use of visual inspection standing, we may easily conclude, observing the data, that trans-
of the data before and after applying any pre-processing tech- formation of the data through normalization prior to multivariate
nique. The pre-processed data besides decreasing the number of regression analysis is common (Table S3). Here, it is interesting to
artifacts, something that it might be difficult to assess, should present the use of wavelet transforms to filter the noise and reduce
show more spectral overlap than the raw data. The possibility to the data size [105e107]. This is mainly achieved by the use of CWT
assess the pre-processing by visual inspection definitely depends or DWT, being the latter more efficient than CWT. Removal of un-
on the type of the data and the pre-processing scheme. Looking at informative data reduces the computational time required for
an SNV-processed NIR spectra is not as complicated as evaluating posterior chemometric analysis.
pre-processed mass spectrometry data. In any case, it is rather
complicated and inaccurate to rely only on the use of visual in- 4. Exploratory analysis
spection to evaluate the signal pre-processing. One approach to
ameliorate the visual inspection is the use of a dimensional The exploratory data analysis represents the first step for che-
reduction technique. In other words, instead of looking at the raw mometric processing after the data have been pre-processed if
spectra, the data is reduced to a lower dimensional space with for needed (Scheme 1) [108]. Without the need of any formulated
example applying a principal component analysis. Then, visual- hypothesis, the exploratory analysis gives an overview of the data,
izing a PCA score plot and/or loading plots can determine the and observes its variation. Through this, one can observe well
optimal pre-processing approach [101,102]. However, it has been defined substructures of the data or diagnosed outliers, and also
demonstrated how the use of PCA does not always determine identify which variables describe the system.
9
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

4.1. Principal component analysis

Principal component analysis (PCA) is probably the most


exploratory method used in the chemical and physical sciences. It
defines new variables named principal components (PC) from
linear combinations of the original ones, allowing them to find
patterns in two-dimensional space [109,110]{Wold, 1987 #126}.
PCA can be applied where the number of variables exceeds the
number of samples without misleading results. The original data
matrix (X) is decomposed into a matrix of scores t and loadings p.
The scores (t) allow one to find patterns in the samples while the
loadings (p) give the weights of the original variables in the PCs,
thus giving information about which variables are part (or not) of
the trend. PCA also has the aim of dimensionality reduction,
removing redundant information while retaining all nonzero PCs
[111]. A dimension reduction tool tries to provide a discriminant
function by the total variability. This approach reduces and com-
pacts the original data, expressed in terms of the scores (t) vector
that can be used for further modelling. This approach might for
instance help to improve the classification performance. However,
the number of PCs retained should be considered carefully since
there is a risk that useful information will be neglected [109].

4.2. Cluster analysis

Although PCA represents the most used exploratory method, it


does not explicitly define clusters in the data but it is left for the
criteria of the user. More formal methods that define specific data
aggrupation are clustering methods [112]. Commonly, hierarchical
clustering analysis (HCA) methods are employed because very
often the data follow a hierarchical structure [113]. Importantly, the
choice of the distance function to calculate the distance between
the clusters and observations can dramatically alter the results. If
there is a correlated group of variables in the data, these variables
may get a too large weight while uncorrelated variables may fail to
be recognized. HCA can be approached from bottom up or
agglomeratively where each observation forms a cluster and then
the clusters are joined or not. The second HCA type is a divisive or
top-down strategy, where all observation starts in one cluster,
followed by division into smaller clusters as the hierarchy moves
down [114]. Currently, one runs several clustering routines and
selects the one that seems to give the most logical results.

4.3. Applications, trends and potential pitfalls in exploratory


analysis

The selection of PCs in PCA for exploratory purposes is not


critical; however, when PCA is used to identify outliers or as a
reduction step prior to predictive modelling the situation is
different. Here, the choice of the number of PCs and thus how much
Fig. 3. Chemometric analysis for case study 1 that dealt with the regression task of a near-
variance is captured should be carefully chosen [115]. One approach infrared spectra from 618 soil samples. The parameter total nitrogen (Nt) has been measured
could use the number of components that provide the best classi- for all of the samples and constituted the y vector for the regression. A) NIR spectra for the
fication model or select PCs by cross-validation. Another relevant first ten samples colored according to their intensity; B) Multiplicate scatter correction
point to consider is the visualization of the Score plot. If one is (MSC)-transformed spectra; C) The spectra pre-processed by Savitzky-Golay polynomial
filters (SG) and second derivative; D) PCA score plot for the first and second PC applied to the
interested in observing the natural aggrupation of the data, the PCs
MSC-transformed spectra. The scores are colored according to the Nt content in the samples.
should capture a large proportion of the variance. PCs with low The two sample outliers are observed in the score plot; E) PCA score plot for the first and
percentage do not explain the raw data. Another issue is the need second PC applied to the pre-processed spectra by SG (2nd der). The spectra are colored by
for pre-processing the data before applying PCA. The selection of the Nt content in the samples; F) Loading plots for the first component obtained from the PLS
regression applied on the MSC-preprocessed spectra. We see how the information con-
the pre-processing method depends on the nature of the data and
tained in the loadings about the original variables in A is not lost; G) Loading plots for the first
should be investigated. Three recent pitfalls that have been greatly component obtained from the PLS model applied on the spectra pre-processed by SG (2nd
covered in a recent review are [116]: (1) use of wrong measurement der). The information in the loadings about the original variables in the raw spectra is
units for the exploratory analysis. In order to avoid mis- retained. We maysee how the positive derivate gives a positive contribution to the loadings.
interpretations, the units used should directly be related to the H) The predictive quality obtained for the validation data pre-processed by MSC (30% of the
training data set) with an R2 of 0.60. I) The predictive quality obtained for the validation data
concentration; for example, convert from transmittance to absor-
pre-processed by SG (2nd der) (30% of the training data set) with an R2 of 0.60.
bance; (2) misinterpretation of the PCA loadings from a first- or
10
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

second-derivative pre-processed signal. The classical PCA rule 5. Predictive modelling


dictates that a sample with a high score value at a particular PC is
characterized with variables that are positively correlated with The main goal of predictive modelling is to build a model with
high loading values for the same PC. In the case of derivative predictive capabilities for new data. Usually, predictive modelling is
spectra, this is not always true. For correct interpretation of the PCA performed after performing an exploratory analysis of the data
loadings from the first or second derivative is not straightforward, (Scheme 1). Depending on the type of variable that we want to
and requires inspection of the derivative signal and the loadings. A predict, continuous or discrete, the problem can be classified as
solution proposed the use of antiderivative function on the PCA regression or classification, respectively. In regression, we are
loadings [116]; (3) normalization methods for pre-processing the interested in constructing a model that predicts about continuous
signal such as SNV or MSC can produce artifacts that leads to variables, while classification aims to assign a determined class
misinterpretation of the PCA loadings. The normalization can
remove not only unwanted variations in the signal but also valuable
information from the original spectra.
Potential pitfalls regarding clustering methods are mostly
based on the selection of the distance metric and the clustering
method selected. Because of that, clustering sometimes can lead
to misleading results without knowledge about the data.
Notwithstanding, the effects of these parameters on the clustering
results should be inspected and the clusters validated using
appropriate metrics [117]. For example, the use of ensemble
clustering can remove uncorrelated noise from the clustering re-
sults [117,118]. Table S4 lists the exploratory methods used in the
research articles for the three techniques explored in this review
(spectroscopies, electrochemistry, and hyphenated MS). These are
grouped and analyzed in a bar plot in Fig. 2A. As expected, PCA is
by far the most frequently used exploratory method among all of
the techniques presented. Approximately, 80% of the research
articles used PCA. One of the issues is illustrated by Tfaili et al.
who investigated the lipid modifications in J774 macrophages
after the addition of eicosapentaenoic acid into cells, loaded or not
with cholesterol [119]. The changes in the cell membrane upon
addition of lipids was studied thorough Raman spectroscopy and
IR. The visualization of the PCA loadings showed a negatively
correlated band at 2824-2865 cm1 assigned to the vsym(CH2) of
lipids, which is more present in the cytoplasm than in the nucleus.
This clear separation in component 1 that captures most of the
variance (75.8%) in the PCA score plot was attributed to this band.
Another popular use of PCA is in dimensional reduction
[120e127]. Zukovskaja et al. employed UV-Raman spectroscopy
for the identification of fungal spores that are implicated in res-
piratory diseases worldwide [124]. The UV-Raman spectra were
background corrected and vector normalized prior to dimensional
reduction with PCA. The optimal number of principal components
was selected according to the maximum accuracy, and the score
matrix (t) was used as an input for further modelling [124].
Another application recently reported is the application of Raman
and FT-IR spectroscopy for follow-up immunoadsorption therapy
treatment of dilated cardiomyopathy [125]. The authors reported
that the first principal component (PC1) separated the data ac-
cording to the time points, capturing about 19.9% of the spectra
variance. However, the results clearly show that TP2 is different
from the rest (TP1, TP3-5). Moreover, the PCA model obtained is
rather informative, since it captures a low amount of variance
(19.9% for the PC1 and 10% for the PC2) [125].
Clustering methods, in particular HCA, are used more frequently
in hyphenated MS studies, mainly because heat maps are especially
useful in omics studies to observe the correlation between samples
and variables simultaneously (Fig. 2A). For example, Negra ~o et al.
used a mass spectrometry-based proteomics approach in order to Fig. 4. Support vector machines (SVM) regression for case study 1. The NIR spectra
identify proteins related to infection with Leishmania spp [128]. were MSC preprocessed. A) Optimization of the penalty error (C) and the spread of the
Here HCA was used for clustering the samples according to the 50 kernel (s) for the non-linear kernel radial basis function (RBF). The blue dots are for
most significant proteins combined with a heatmap for visualiza- s ¼ 0.0001 the red triangles are for the s ¼ 0.001. The RBF achieved the lowest root
mean square error (RMSE) for C ¼ 6 and s ¼ 0.001; B) The prediction quality obtained
tion of the proteins and samples. To conclude, we observed that by the SVM-RBF for the calibration data (70% of the training data set) with an R2 of
other clustering analysis tools (e.g., k-means, k-NN) are employed 0.80. C) The prediction quality obtained by the SVM-RBF for the validation data (30% of
in spectroscopic studies (Fig. 2A). the training data set) with an R2 of 0.74.

11
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

label to a particular sample. Herein, our focus will be set on para- [130]. Other more complex decision surfaces (e.g., non-linear in
metric classification algorithms (linear discriminant analysis (LDA) quadratic discriminant analysis) also exist but require more pa-
and PLS-DA) and two machine learning algorithms that are rameters to tune. The classification may be expressed either in the
nonparametric (support vector machines (SVMs) and Random Bayesian form or Mahalanobis distance. First, the Mahalanobis
Forest (RF)). distance is calculated for all samples to the class centers using the
pooled covariance matrix, and the samples are assigned to the
5.1. Linear discriminant analysis lowest class distance [131]. LDA is a well-established technique that
works well even when the data distribution moves from normality.
Linear discriminant analysis [129] is a method that aims to find a However, LDA should not be applied for matrices where the num-
linear separator between classes of objects and use this decision ber of variables exceeds the number of samples, especially in the
function for assigning the class membership to unknown samples presence of correlated variables. To overcome this limitation, LDA

Fig. 5. Chemometric analysis for case study 2 that dealt with the classification task of mid-infrared spectra from the 60 authenticated extra virgin olive oils from three different
countries (Spain, Italy and Greece). A) MIR spectra for the first ten samples; B) PCA score plot for the first and second principal component; C) PLS-DA score plot for the first and
second latent variable; D) Histogram of the Q2Y values obtained by 1000 permutation tests of the response variable vector (y) from the PLS-DA model. The red dot is the original Q2Y
(0.73) obtained in the PLS-DA model along with its confidence interval. All of the permutated Q2Y are lower than the original Q2Y that indicates the validity of our PLS-DA model
with an empirical p-value of 0.001. E) PLS-DA loadings plot for the first and second latent variable. The loadings are colored according to their variable in the importance (VIP)
values.

12
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

can be applied in the score (t) matrix obtained after applying PCA until it is fulfilled, by choosing at each node the best split among
and retaining all nonzero principal components. Another signifi- random variables. The process is repeated until a collection of trees
cant disadvantage is that LDA considers that all classes have the is obtained. It can be applied when the number of variables is much
same pooled covariance matrix. Thus, dealing with groups with larger than the number of samples not leading to overfitting, in
different variance structures is not appropriate [132]. In this case, it non-linear situations, and can be used both for two-class and multi-
is more appropriate to use non-linear methods. class problems as well [139]. Another great feature of Random
Forest is that it only has several parameters to tune, the most
5.2. Partial least squares important being the number of trees in the forest and the number
of variables in the random subset at each node. Random Forest
Partial least squares discriminant analysis, in the PLS1 version, is provides a metric of the error rate considered as a good estimator of
a linear two-class method that aims to find a linear separation to the error, based on prediction of the data that were not in the
classify the group of samples in the spacing algorithm where the y bootstrap sample, called out-of-bag (OOB), using the tree grown
vector takes values þ1 and 1/0 for each class, as opposed to PLS with the bootstrap sample. Finally, the OBB error is the average
regression, where the y vector takes continuous values [133]. Other error frequency obtained [142]. The algorithm also includes several
PLS versions such as PLS1 modified or PLS2 algorithms allow measures of the variable's importance, such as the Gini index and
modelling of more than two classes [134]. In this sense, a range of classification accuracy [143].
algorithms such as the nonlinear iterative PLS (NIPLS) [134], kernel
algorithms [135] and statistically inspired modification of PLS
(SIMPLS) have been developed [136]. Of note, kernel algorithms 5.4. Support vector machines
work with the variance-covariance matrices and are more advan-
tageous when there is a much lower number of samples than var- Support vector machines are a classification and regression
iables [135]. Given a set of independent variables (matrix X), we are method derived from statistical learning theory [144,145]. It at-
interested in building a mathematical model that allows the pre- tempts to form either a linear or non-linear boundary between the
diction of dependent variables (vector y). Briefly, the PLS1 model groups by minimizing the structural risk minimization (SRM,
works using a matrix Xij and a vector yik, where i is a sample, j is a Equation (4)):
variable and k is the class coded. The PLS components are obtained vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
    ffi
u  
sequentially, where the first one is obtained from the cross-product u h
udVC log 2N þ 1  log 4
matrix w ¼ XTy. The X-scores t can be computed as t ¼ Xw, which is t dVC
a linear combination of the weights (w) and the matrix X. In the Re  Remp þ (4)
N
next step, the normalized loadings (p) for X are calculated as
p ¼ XTt/(tTt). Similarly, the normalized loadings (q) for y are where dVC is the Vapnik-Chervonenkis dimension related to the
calculated as q ¼ yTt/(tTt). To estimate the second component or complexity of the classifier, Re is the complete error and Remp is just
latent variable (LV), first, the variation related to the previous the error using the N samples of the training set, in fact the
component is removed from X and y. To do so, the outer products misclassification obtained during the model building. This principle
tpT or tqT for X or y are subtracted from the current data matrices, reduces the risk of overfitting since it is not just focused on the
respectively. Then, the next component is computed from these empirical error Remp as other classification methods do, but it fo-
deflated matrices [137]. Sometimes the Y-scores (u) are used for the cuses on the true error Re. The difference between both errors (Re
model interpretation, calculated as u ¼ yq/(qTq). The optimal and Remp ) is based on the number of the samples used in the
number of components or LVs is usually determined by using cross- training set (N) and the complexity of the boundary of the model
validation techniques in a way that maximizes the covariance be- (dVC ). Higher model complexity and number of samples lead to a
tween X and y. After every interaction, the vectors (w, t, p, q, u) are larger and smaller difference between the errors, respectively.
gathered in matrices. The prediction that a sample belongs to a Thus, it is essential to control the complexity of the model ac-
particular class is calculated as y ¼ tA þ F, where A is the regression cording to the number of samples [146]. SVMs aim to find a sepa-
coefficient and F is the abscissa vector. To calculate A, we use the rating hyperplane maximizing the distance or margin between the
scores t, A ¼ (tTt)1tTy. Values above 0 are assigned to one class (i.e., samples and thus dVC comes down, hence reducing the general
“A”), and below to another class (i.e., “B”), using the þ1/-1 criteria. error Re. A generic hyperplane can be written as Equation (5):
The deviation between experimental and modelled may be
assessed by calculating the y-residuals as TqT þ F. The normality wT x  b ¼ 0 (5)
plot of the y-residuals gives insights into the quality of the model
and helps in the identification of outliers [138]. The common way to where x represents a set of points or sample, w the related weight
show the PLS-DA results is with the score plot, although it requires or the normal vector to the hyperplane, and b a bias parameter.
a proper validation to guarantee that the samples did not group by Since many hyperplanes can be obtained, the optimal that sepa-
chance [132]. As for LDA, PLS-DA is a linear method and therefore rates the data is chosen in such a way that the margin is maximal
does not consider that each group has a different structure. between the samples and minimizes the empirical error. The
optimization results in the following function (Equation (6)):
5.3. Random forest X X
LD ¼ 0:5 ai aj yi yj xTi xj  ai (6)
Random Forest, introduced by Breiman and Cutler (2001) [139], i;j i
is a machine learning technique based on the combination of
bagging and tree-based methods. There are several algorithms of where yi is the class label, either þ1 or 1, and a is a vector of
RF according to how the tree is constructed or how the data is Lagrange multipliers that are used to calculate the weight vector w
preliminarily divided. The most widely algorithm used is a classi- and thus to determine the optimal hyperplane. The optimization of
fication and regression tree (CART) [140] that uses Gini impurity as LD is a quadratic programming problem, minimizing it with respect
a splitting criterion [141]. First, each tree is built from a bootstrap to the coefficient a (Equation (6)). It should be noted that it is a
sample of the data, and for each bootstrap sample, a tree is grown reproducible solution since the minimum is unique. The scalar
13
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Table 3
Comparison between several preprocessing methods that were applied to the data from the case study 1. Subsequently, the preprocessed data were used as an input to PLS
regression modelling.

MSC SG (1st der) SG (2nd der) SNV Autoscaling MSC þ Autoscaling SG (2nd der) þ Autoscaling

Variance explained (%) 81 81 78 73 81 83 83


RMSEP 0.70 0.63 0.73 0.84 0.63 0.66 0.65
Adjusted R2 0.71 0.70 0.65 0.60 0.80 0.70 0.65
N. comp 12 6 7 12 7 13 5
RMSEP validation set 0.48 0.44 0.36 0.52 0.77 0.68 0.82
Adjusted R2 validation set 0.62 0.46 0.60 0.48 0.66 0.53 0.59

SNV: standard normal variate; MSC: multiplicative scatter correction; SG: Savitzky-Golay; RMSEP: root mean square error prediction; N. comp: number of components.

product xTi xj makes the method suitable when the number of var- same main disadvantage as LDA: it considers that each class has a
iables exceeds the number of samples. The boundary between the similar variance structure [132]. In this case, other non-linear
classes is determined by only a limited number of those samples methods such as Random Forest or SVM are more appropriate. For
that lie close to the margin, called support vectors. Only for these example, Li et al. attempted to identify potential biomarkers of ce-
points will the coefficient a not be zero and therefore will be rebral infarction using GC-MS and chemometrics [147]. Random
involved in the optimal boundary. The samples that are far away Forest was used to discriminate among healthy, cerebral infarction
from the margin will have the value 0 in the coefficient a, not patients and quality control groups. In another example, we recently
influencing the separating hyperplane but used in the model optimized the mass spectrometry instrumental settings by using a
building. The model includes a penalty error C term that de- BBD, and modelling the signal response by a set of regression
termines the importance of the margin errors or deviations that can models [17]. From all of the algorithms assayed, SVM achieved the
become training errors when having a value higher than 1. The best metrics with R2 of 0.95 and RMSEP of 0.05. Therefore, superior
smaller the C value, the higher are the deviations obtained and findings were achieved with nonlinear alternatives.
margin maximization has more importance in the computation. One disadvantage of using LDA versus PLS-DA is that if the
The opposite is having a high C value, with lower deviations but number of variables exceeds the number of samples, LDA requires a
minimizing the training error Remp. This leads to no error with a prior step for reduction dimension with for example PCA. In the
narrow margin, indeed a poor model. Thus the C value can be used aforementioned study performed by Zukovskaja et al. the authors
to obtain the optimal trade-off between the complexity of the applied LDA over the scores (t) matrix obtained from PCA while
model and the training error of the SRM equation (Equation (4)). retaining the optimum number of principal components [124]. Out
The Lagrange multiplier a is also subject to the penalty error of 1079 spectra, only 27 were misclassified, which resulted in 97.5%
(Equation (7)): of accuracy. The largest number of misclassifications comes from
the lack of separation between Aspergillus species and Penicillium.
0  ai  C (7) Another advantage of PLS-DA over PCA-DA is that it may provide
insights into the variables via the loading weights. Direct infor-
The classification function to separate two classes includes the
mation obtained about the variables is often desired in the omics
support vector si by means of the following Equation (8):
technologies, where PLS-DA is ruled out. Here, a common variable
X
Ns  selection method is the variable importance in projection (VIP).
y ¼ sgn ai yi sTi x þ b (8) Briefly, it accumulates the importance of each variable reflected by
i¼1 the weight from each component. Usually, the variables with a VIP
value greater than one are selected as important. For example, Luo
The prediction of the class a sample belongs to is determined by et al. performed an untargeted metabolomics study by LC-MS in
the sign of the y value obtained. Leading with non-linear situations order to find biomarkers for early diagnosis of lung cancer [148].
SVMs return more satisfactory separations than the linear The authors applied orthogonal PLS-DA in order to identify po-
discriminant functions as PLS-DA [146]. Substituting the scalar tential metabolic biomarkers. Selection of the ions with a VIP value
product xTi xj with the kernel functions Kðxi ; xj Þ ¼ 4ðxi Þ4ðxj Þ it is greater than one revealed 31 key metabolites involved in seven
possible to determine more complex class boundaries [29]. For such metabolic pathways [148]. The last and common pitfall is pre-
an aim, the data are mapped into a higher-dimensional space senting the PLS-DA score plot of the training sets for data visuali-
where the samples are projected by means of a feature function zation. It can give misleading results, especially when the number
4ðxÞ and a suitable hyperplane is found. Then the complex class of variables exceeds the number of samples. Moreover, because
boundary is translated into the original space. there are many variables presented in the data, the existence of
correlation between them may occurs just by chance. This can lead
5.5. Applications, trends and potential pitfalls in predictive to obtaining a false separation between groups in a PLS-DA score
modelling plot. Therefore, it is recommended to assess the natural aggrupa-
tion of the data through a PCA score plot, since it can be used when
One of the main misuses that we may observe in the literature is there are more many variables than samples. Yan et al. differenti-
the use of PLS-DA without considering what the structure of the ated traditional Chinese medicines by using gas chromatography-
data looks like. Often, PLS-DA is applied over matrices with small mass spectrometry. Wisely, the authors used PCA in order to
sample sizes and a large number of variables or with groups with visualize differences between different samples, obtaining two
different variance. One needs to have in mind that the linear sepa- clear clusters that represent two plant species. To get insights into
rator is the same for LDA as for PLS-DA with all non-zero compo- the variables, in their case volatile compounds that are responsible
nents retained. Therefore, the same results are achieved with LDA or for the difference between both species, they employed a PLS-DA
PLS-DA when the class sizes are equal, but LDA requires fewer de- variant named orthogonal PLS-DA [149]. Another way to assess
cisions to be made about the parameters and thus less risk of the the reliability of the classification model is the use of permutation
equivoque results [132]. We should emphasize that PLS-DA has the tests over the classifier. The group label is randomly permuted, and
14
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

Table 4
Comparison between several preprocessing methods that were applied to the data from the study 2. Subsequently, the preprocessed data were used as an input to PLS-DA
classification modelling. 3-group modelling.

R2X R2Y Q2Y RMSEP No LV MC calibration set MC validation set

MSC 0.774 0.723 0.702 0.250 2 4 0


SNV 0.774 0.723 0.702 0.250 2 4 0
SG (1st der) 0.702 0.759 0.699 0.233 3 6 0
SG (2nd der) 0.651 0.729 0.649 0.248 3 6 0
Autoscaling 0.802 0.712 0.667 0.255 3 4 0
Mean centering 0.888 0.780 0.730 0.224 4 4 0

SNV: standard normal variate; MSC: multiplicative scatter correction; SG: Savitzky-Golay; RMSEP: root mean square error prediction; N. comp: number of components; MC:
number of misclassifications.

statistics calculated. After a large number of permutations, an folds. The predictive performance of the model is evaluated using
ensemble of statistics obtained from permutations is compared the validation set [154]. All of this is repeated k times so each k-fold
again to the unpermuted data set. is used once as a validation set, and the predictive performance is
Table S5 lists the predictive modelling methods used in spec- calculated as the average [151].
troscopic, electrochemical, and hyphenated MS methods for
2018e2020. A bar plot containing this information clearly indicates
6.1. Applications, trends and potential pitfalls in model validation
that PLS-DA is by far the most frequently used method in hy-
phenated MS studies (Fig. 2B). It is not surprising that the devel-
We can observe in the literature a disagreement over the
opment of mass spectrometry-based omics sciences has been
definition of external validation [96,97]. Some authors defined
accompanied by the abusive use of PLS-DA. When properly used,
external validation as simply the case when the validation of the
PLS-DA can give great insights into the discovery of potential bio-
model was carried out by using objects that were not used for
markers in metabolomics studies [150]. On the other hand, PLS has
building the model. These samples were randomly pulled out
been widely used in electrochemical methods for regression
from the data set [155]. However, the strict definition of external
(Fig. 2B). This finding suggests that electrochemical methods have
validation implies the use of a new and external set of data, that is
been poorly used for classification purposes but extensively for
independent but representative [4,156]. For instance, the new
building regression models for quantification aims. In spectroscopic
external data set has been measured in a different period than the
studies, the situation is intermediate. PLS-DA but also PLS in its
training set, so it was not available at the time [4]. On the other
regression mode are both equally employed.
hand, splitting the data that have been collected at once into
training and validation set falls into the category of internal vali-
6. Model validation dation. External validation does not always guarantee better
model validation than using resampling methods [156,157]. That
The performance of the mathematical model used for modelling is because it requires a large number of samples and they must be
the data may be influenced by several factors (i.e., the number of representative and independent samples. We collected and
samples in contrast to the number of variables, the representa- analyzed the validation protocol implemented in the last years
tiveness, extreme outliers) [4]. The overfitting of the model can give (2018e2020) for the three experimental techniques (Table S6).
the wrong idea that the model can predict the properties of interest Hyphenated MS approaches mostly used resampling algorithms
better than it does. It is common to apply a set of modelling models for validation of the predictive model (Fig. 2C). We may argue that
to a specific problem and select the one that gives better perfor- this is due to the low number of samples available for the exper-
mance (Scheme 1). However, a proper validation of the model is iments in omics studies, because of the difficulty to get biological
needed to assess how well it can predict. For example, selecting too samples. In those cases, the use of resampling methods is more
many components or LVs in a PCA-DA, PLS-DA or PLS can lead to recommended than a split-sample protocol. In the case of elec-
overfitting [4]. The chemometrics strategies that address the trochemical methods, validation using external data sets appears
assessment of the quality of a predictive model are referred to as the most frequent. Traditionally, electrochemical methods have
validation [151]. To evaluate the quality of the model, a diagnostic been used for regression modelling of pharmaceutical or chemical
metric should be defined. It could be based on either a model compounds with PLS regression, as we commented above.
parameter (e.g., variance explained) or the calculation of residuals. Because of the facility to collect such samples and the nature of
One needs to bear in mind that the predictive model will only be the studies, validation of the models is performed with a test set.
relevant when the data are used as representative for the system Finally, spectroscopic studies matched in an intermediate situa-
studied. In this context, validation of the model is usually based on tion (Fig. 2C). Split-sample, resampling and test set are almost
splitting the sample into a training set and test set or by using equally employed along with these studies. This may be due to the
resampling methods (bootstrapping, jack-knifing and cross- fact that spectroscopies have been used for different aims, from
validation) (Scheme 1). When dealing with large data sets, building classificatory models in metabolomics studies to cali-
dividing the data into a training set and a test or validation set is the bration of pharmaceutical compounds. The trends observed in the
most conservative approach. Here, the application of Kennard- validation protocols according to the experimental techniques
Stone algorithm [152] provides a way to divide the sample. On correspond to their field of application. It is easy to understand
the other hand, when having small datasets (smaller than 40 and find a correlation between the predictive modelling and the
samples) [151], resampling algorithms is preferred to split the data. validation protocols (Fig. 2BeC). Hyphenated mass spectrometry
However, it is assumed that the data used for resampling algo- has been mostly used for applications aimed at building classifi-
rithms are representative of the data that will be used in the future. catory models, and because of this PLS-DA is the most used. The
The most popular resampling algorithm is cross-validation (CV) main drawback of these studies is the low number of samples
[153]. CV divides the data into k-folds; one fold is left out of the data available, and this is reflected in the validation technique.
as a validation set, and the model is built using the remaining k-1 Resampling is the most employed approach. The opposite
15
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

scenario occurs with electrochemical methods. Traditionally used adjusted R2 of 0.62 and 0.60 for MSC SG (2nd der), respectively
for calibration purposes, PLS regression dominates over the rest of (Table 3, Fig. S4).
the methods (Fig. 2B) and the ease of collecting samples allows for The PLS score plot did not show any clear grouping or outliers
a more robust and independent validation thorough the test set (Figs. S5AeB.) Plotting the loadings on component 1 showed
(Fig. 2C). agreement with the profile of both the MSC-transformed spectra
and those preprocessed by SG (2nd der) (Fig. 3FeG). We can see
7. Case study 1: A regression model how the information contained in the loadings about the orig-
inal variables is not lost. For example, the positive derivatives
We have incorporated a case study that consisted of an NIR gave a positive contribution of the peaks on component 1
dataset of 618 soil samples provided in the “Chimiome trie 2006” (Fig. 3A, F-G).
meeting [158]. Here we aim to build a predictive model for the Nt The validation set (30% of the training data) was preprocessed as
parameter (total nitrogen in g/kg of dry soil) that has been deter- for the calibration set, and the constructed PLS model used to
mined for each sample. Thus, the Nt parameter is our vector y, and predict the Nt parameter. The predictions of the validation set
the NIR dataset is the matrix X. To follow the case study, the R code achieved an RMSEP of 0.48 and 0.36 with an adjusted R2 0.6 for
can be found in the Supplementary information. MSC, and SG (2nd der), respectively (Table 3, Fig. 3HeI). These are
In the first step, the 618 training samples were divided into 70% similar values to those obtained at the “Chimiome trie 2006”
for the calibration (433 samples) and 30% for the validation set (185 meeting [158]. Another source of information is the relationship
samples) by means of the Kennard-Stone algorithm. In brief, KS between the X-scores and Y-scores, which did not follow a linear
starts by selecting the pair of samples that are the farthest apart. relation as observed in Figs. S5CeD (see section 5.2). Therefore, we
These points are assigned to the calibration set and removed from decided to use the SVM regression in order to check whether we
the original training set. Then, the algorithm assigns the remaining would obtain lower RMSEP and higher R2 for the predictions. The
samples to the calibration set by computing the Euclidean or linear kernel and the non-linear kernel radial basis function (RBF)
Mahalanobis distance between each unassigned samples. The were tested with appropriate optimization of the main parameters
points that are the farthest from their closest neighbors are selected by means of a grid of tuning values and thorough 7-fold CV, that is,
from the training set and assigned to the calibration set. The the penalty error (C) and the spread of the kernel (s) for the RBF
Mahalanobis distance can be used by performing a PCA and and C for the linear kernel. The RBF achieved the lowest RMSE
computing the Euclidean distance on the score matrix. The number cross-validation (0.56) and highest R2 (0.80) with optimal values of
of PC retained can be specified when the increase in the cumulative C ¼ 6 and s ¼ 0.001 (Fig. 4AeB). When using the model to predict
explained variance in the next components is lower than some the unseen data (validation set), an adjusted R2 and RMSEP of 0.74
percentage. Sample selection based on the Kennard-Stone algo- and 0.28 were obtained, respectively (Fig. 4C).
rithm in the principal component space is presented in Fig. S1. The In conclusion, we have shown how to follow the pipeline pre-
red solid points are the selected samples for the calibration set sented in Scheme 1, and to interpret the results obtained for each
(70%) and the empty black points correspond to the validation set step. We have discussed several main possible pitfalls that may
(30%). Following the pipeline in Fig. 1, the calibration set was then occur in such outlier detection, the interpretation of derivative
subjected to several preprocessing algorithms. These included first signals and the use of linear or non-linear regression methods.
and second derivatives with SG filters, MSC and SNV and a com- Moreover, the example can be easily reproduced using the R code
bination with column scaling (Fig. S2 and Fig. 3 and 4A-C). In the included in the Supplementary information.
first step, we attempt to select the best pre-processing method by
trial and error. This is based on the performance for building a 8. Case study 2: A classification task
predictive model, in this case by PLS for the Nt parameter (total
nitrogen in g/Kg of dry soil). To examine the natural aggrupation of In the second case study, we used 120 mid-infrared spectra from
the data, and identify possible outliers, PCA was used. PCA scores of 60 authenticated extra virgin olive oils (EVOO) from three different
the MSC-transformed spectra revealed the presence of two outliers countries (Spain, Italy and Greece) (Fig. 5) [160]. The main objective
(samples no 313 and 402), while applying SG (2nd der) resulted in is to build a classificatory model that will be able to distinguish
the appearance of two clusters along the PC1 (Fig. 3DeE, Fig. S3). between the EVOO provenance. Also, the R code is available in the
However, the aggrupation was not related to the Nt content or any Supplementary information.
of the parameters available. Similar to case study 1, sample selection was based on the
Next, we consider building a PLS regression model for the Nt Kennard-Stone algorithm and the best pre-processing technique
parameter as the real-valued dependent variable. The number of was selected via a trial-and-error approach by means of PLS-DA
PLS components was selected by 7-fold CV, and the validation re- modelling. It is evident that no single pre-processing method
sults are the root mean square error of prediction (RMSEP) stood out over the rest (Table 4). Among them, mean centering
(Figs. S4AeB). The decision of how many components to retain provided 88% accuracy for the calibration set with four mis-
remains subjective and therefore several strategies have been classified samples. We can observe that the first and second PCA
developed to deal with that. Here, we used a permutation component captured ca 70% of the variance and could show some
approach, which tests whether the addition of a new component trend for clustering according to their provenance (Fig. 5B). For
improves the model [159]. The algorithm starts backwards and example, samples from Spain and Italy or Greece are “separated”
continues by removing components until the model deteriorates. along the PC1, while Greece was separated from them along the
Table 3 lists the number of components selected according to this PC2. To determine whether the three groups of samples can be
strategy along with the RMSEP, the adjusted R2 for the prediction distinguished we employed a PLS-DA classification model, the
based on CV, and the RMSEP and adjusted R2 for the validation set second most often used algorithm (Fig. 2B). As previously dis-
(30% of the training data). The use of either MSC or SG (2nd der) cussed, presenting the PLS-DA score plot can give misleading
provided the models with the highest prediction capabilities, that is results. Visualizing a separation of the samples does not mean
the lowest RMSEP and highest adjusted R2 (Table 3). Twelve and 7 that there is a chemical reason behind it. Simply the correlation
components were selected with this strategy, yielding ca 80% of the between many variables is a simple reason why a PLS-DA score
variance explained, which gives an RMSEP of 0.48 and 0.36, and an plot can show separation between groups [132]. To visualize the
16
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

natural aggrupation of the data, one should use instead a PCA Declaration of competing interest
score plot. We can see how in our example the score plots ob-
tained by PLS-DA modelling (Fig. 5C) closely agree with the PCA The authors declare that they have no known competing
score plot (Fig. 5B). Therefore, there exists a pattern visualized by financial interests or personal relationships that could have
both score plots. The model achieved a cumulative predictable Y appeared to influence the work reported in this paper.
ability (Q2Y) of 0.73. The metric evaluates the error between the
predicted and measured response variable vector (y) from the Acknowledgements
regression model. However, the metric is not a robust diagnostic
criterion since the model can be overfitted when the variables This work and AK were supported by the Polish National Science
highly exceed the number of samples [132]. A solution can be the Centre (NCN) under Opus grant no. 2019/33/B/ST4/02428 (to AK)
use of permutation testing for the response variable vector (y) and Preludium grant no. 2018/31/N/ST4/01909 (to MDPD). Publi-
and the subsequent calculation Q2Y. We performed 1000 per- cation of this article in open access was financially supported by the
mutations, obtaining a set of permuted Q2Y that could be Excellence Initiative - Research University (IDUB) program for the
compared with the original Q2Y (Fig. 5D). The maximum Q2Y for University of Wroclaw.
the permutated sets is 0.04, which is considerably lower than the
original Q2Y of 0.73, which validates the metric. As a simple
definition of the p-value, 1000 permutations (n) corresponds to Appendix A. Supplementary data
the level 0.001 (1/n). In order to characterize those wavenumbers
that contribute most to the separation observed in the PLS-DA Supplementary data to this article can be found online at
score plot, we extracted the loadings for the first and second https://doi.org/10.1016/j.trac.2020.116157.
LV and colored them according to the VIP score (Fig. 5E). VIP
reflects the importance of each variable or wavenumber in the References
model. The loading plot follows a similar triangle shape as the
PLS-DA score plot, where LV1 discriminates Spain from the rest [1] J. Gault, I. Liko, M. Landreh, D. Shutin, J.R. Bolla, D. Jefferies, M. Agasid,
H.Y. Yen, M.J.G.W. Ladds, D.P. Lane, S. Khalid, C. Mullen, P.M. Remes,
and LV2 discriminates between Greece and Italy. Examination of R. Huguet, G. McAlister, M. Goodwin, R. Viner, J.E.P. Syka, C.V. Robinson,
the VIP values demonstrates that the interval 1650-1800 cm1 Combining native and ‘omics’ mass spectrometry to identify endogenous
highly contributes to the model and these positively contribute ligands bound to membrane proteins, Nat. Methods 17 (2020) 505e508.
https://doi.org/10.1038/s41592-020-0821-0.
to both LV1 and LV2 to the LV1 (Fig. 5C and E). Therefore, this
[2] R.J. Rose, E. Damoc, E. Denisov, A. Makarov, A.J.R. Heck, High-sensitivity
interval is important for the discrimination between Greece and Orbitrap mass analysis of intact macromolecular assemblies, Nat. Methods 9
Italy. The second interval that corresponds to the 1110-1200 cm1 (2012) 1084e1086. https://doi.org/10.1038/nmeth.2208.
match with the position of Spain clusters in the LV space. This [3] D. Kurouski, A. Dazzi, R. Zenobi, A. Centrone, Infrared and Raman chemical
imaging and spectroscopy at the nanoscale, Chem. Soc. Rev. 49 (2020)
interval contributes less to the model according to the VIP values 3315e3347. https://doi.org/10.1039/c8cs00916c.
but still contributes to the Spain cluster (Fig. 5E). To conclude, [4] R.G. Brereton, J. Jansen, J. Lopes, F. Marini, A. Pomerantsev, O. Rodionova,
this second study dealt with a classification task that was solved J.M. Roger, B. Walczak, R. Tauler, Chemometrics in analytical chemistrydpart
II: modeling, validation, and applications, Anal. Bioanal. Chem. 410 (2018)
by using the popular PLS-DA algorithm. We have presented the 6691e6704. https://doi.org/10.1007/s00216-018-1283-4.
main steps that should be followed to construct, interpret and [5] J. Lucentini, Gene association studies typically wrong: reproducible gene-
validate the model, discussing at each step the possible pitfalls disease associations are few and far between, Sci. 18 (2004) 20e21.
[6] S. Mutter, C. Worden, K. Paxton, V.P. M€ akinen, Statistical reporting of
and their solution. metabolomics data: experience from a high-throughput NMR platform and
epidemiological applications, Metabolomics 16 (2020) 5. https://doi.org/
9. Conclusions 10.1007/s11306-019-1626-y.
[7] Y. Danilova, A. Voronkova, P. Sulimov, A. Kerte sz-Farkas, Bias in false dis-
covery rate estimation in mass-spectrometry-based peptide identification,
Chemometrics provides plenty of techniques for the experi- J. Proteome Res. 18 (2019) 2354e2358. https://doi.org/10.1021/
mental design, exploratory analysis, building calibration and clas- acs.jproteome.8b00991.
[8] V. Czitrom, One-Factor-at-a-Time versus designed experiments, Am. Statis-
sification models and also for their validation. The quantitative and
tician 53 (1999) 126. https://doi.org/10.2307/2685731.
qualitative prediction of the responses based on the experimental [9] L.S. Riter, O. Vitek, K.M. Gooding, B.D. Hodge, R.K. Julian, Statistical design of
signals acquired from the samples is of utmost importance. For experiments as a tool in mass spectrometry, J. Mass Spectrom. 40 (2005)
example, chemometrics coupled to various spectroscopies repre- 565e579. https://doi.org/10.1002/jms.871.
[10] M.A. Bezerra, R.E. Santelli, E.P. Oliveira, L.S. Villar, L.A. Escaleira, Response
sents a highly versatile tool for pharmaceutical analysis. In the surface methodology (RSM) as a tool for optimization in analytical chemistry,
current omics era, chemometrics has exploited and captured the Talanta 76 (2008) 965e977. https://doi.org/10.1016/j.talanta.2008.05.019.
attention of a broad scientific community. Now more than ever, [11] L. Vera Candioti, M.M. De Zan, M.S. Ca mara, H.C. Goicoechea, Experimental
design and multiple response optimization. Using the desirability function in
there is an awareness of what chemometrics or data analysis can analytical methods development, Talanta 124 (2014) 123e138. https://
offer. As the development of chemical instruments is advancing and doi.org/10.1016/j.talanta.2014.01.034.
they are being exported to all of the fields of science, the need to [12] E.S. Hecht, A.L. Oberg, D. Muddiman, W.M. Keck, D.C. Muddiman, Optimizing
mass spectrometry analyses: a tailored review on the utility of design of
incorporate chemometric methods also increases. experiments, J. Am. Soc. Mass Spectrom. 27 (2016) 767e785. https://doi.org/
We have reviewed the latest research where chemometrics has 10.1021/acs.analchem.5b01609.
been employed in electrochemistry, spectroscopy and hyphenated [13] D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, L. Kaufman,
Exploration of response surfaces, in: Data Handl. Sci. Technol., Elsevier, 2003,
mass spectrometry for the 2018e2020 three-year period. Partic- pp. 271e291. https://doi.org/10.1016/S0922-3487(08)70230-2.
ularly, we focused on a pipeline that started with the experi- [14] G. Hanrahan, J. Zhu, S. Gibani, D.G. Patil, Chemometrics and statistics,
mental design, followed by the exploratory analysis, then Experimental design, in: Encyclopedia of Analytical Science, Elsevier, 2005,
pp. 8e13. https://doi.org/10.1016/B0-12-369397-7/00079-0.
predictive modelling and finishing with the validation of the
[15] D.K. Lloyd, J. Bergum, Application of quality by design (QbD) to the devel-
optimal model. Our search resulted in more than 300 research opment and validation of analytical methods, in: Specif. Drug Subst. Prod.
articles. This vast number represents an ongoing revolution where Dev. Valid. Anal. Methods, Elsevier, 2013, pp. 29e72. https://doi.org/
chemometrics takes part and cannot be disregarded. We envisage 10.1016/B978-0-08-098350-9.00003-5.
[16] S.L.C. Ferreira, R.E. Bruns, E.G.P. da Silva, W.N.L. dos Santos, C.M. Quintella,
that chemometrics will remain close to the current technological J.M. David, J.B. de Andrade, M.C. Breitkreitz, I.C.S.F. Jardim, B.B. Neto, Statis-
revolution. tical designs and response surface techniques for the optimization of

17
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

chromatographic systems, J. Chromatogr., A 1158 (2007) 2e14. https:// detection, Electroanalysis 31 (2019) 2238e2245. https://doi.org/10.1002/
doi.org/10.1016/j.chroma.2007.03.051. elan.201900072.
[17] M.D. Peris-Díaz, M.A. Sentandreu, E. Sentandreu, Multiobjective optimization [37] J.R. de Carvalho, E.L. Reis, C. Reis, O.I.C. Damasceno, A.A. Neves, A.A. Matias,
of liquid chromatographyetriple-quadrupole mass spectrometry analysis of Chemometric optimization of the methodology for determination of mo-
underivatized human urinary amino acids through chemometrics, Anal. lybdenum in soils and plants by square wave adsorptive stripping voltam-
Bioanal. Chem. 410 (2018) 4275e4284. https://doi.org/10.1007/s00216-018- metry, J. Braz. Chem. Soc. 31 (2020) 716e723. https://doi.org/10.21577/
1083-x. 0103-5053.20190235.
[18] M.D. Peris-Díaz, O. Rodak, S.R. Sweeney, A. Kre˛ zel, _ E. Sentandreu, Chemo- [38] M. Mollaei, S.M. Ghoreishi, A. Khoobi, Multivariate optimization and validation
metrics-assisted optimization of liquid chromatography-quadrupole-time- of a new procedure for simultaneous determination of folic acid and folinic
of-flight mass spectrometry analysis for targeted metabolomics, Talanta acid based on enhancement effect of n-dodecylpyridinium chloride, Micro-
199 (2019) 380e387. https://doi.org/10.1016/j.talanta.2019.02.075. chem. J. 154 (2020) 104653. https://doi.org/10.1016/j.microc.2020.104653.
[19] M. Pastell, L. Frondelius, M. J€ arvinen, J. Backman, Filtering methods to [39] M. Mosleh, S.M. Ghoreishi, S. Masoum, A. Khoobi, Determination of
improve the accuracy of indoor positioning data for dairy cows, Biosyst. Eng. quercetin in the presence of tannic acid in soft drinks based on carbon
169 (2018) 22e31. https://doi.org/10.1101/186353. nanotubes modified electrode using chemometric approaches, Sensor.
[20] M. Aceves-Fernandez, Artificial Intelligence: Emerging Trends and Applica- Actuator. B Chem. 272 (2018) 605e611. https://doi.org/10.1016/
tions, IntechOpen, 2018. j.snb.2018.05.172.
[21] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective [40] S. Wyantuti, U. Pratomo, Y.W. Hartati, D. Hendrati, H.H. Bahti, A study of
genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2002) 182e197. green electro-analysis conducted by experimental design method for
https://doi.org/10.1109/4235.996017. detection of Samarium as complex with diethylenetriaminepentaacetic acid
[22] P. Ranjan, R. Haynes, R. Karsten, A computationally stable Approach to (DTPA), in: AIP Conf. Proc., American Institute of Physics Inc., 2018.
Gaussian process interpolation of deterministic computer simulation data, [41] T. Fearn, C. Riccioli, A. Garrido-Varo, J.E. Guerrero-Ginel, On the geometry of
Technometrics 53 (2011) 366e378. https://doi.org/10.1198/TECH.2011.09141. SNV and MSC, Chemometr. Intell. Lab. Syst. 96 (2009) 22e26. https://doi.org/
[23] N. Costa, J. Lourenço, A comparative study of multiresponse optimization 10.1016/j.chemolab.2008.11.006.
criteria working ability, Chemometr. Intell. Lab. Syst. 138 (2014) 171e177. [42] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Standard normal variate transformation
https://doi.org/10.1016/j.chemolab.2014.08.004. and detrending of near-infrared diffuse reflectance spectra, Appl. Spectrosc.
[24] L. Lu, C.M. Anderson-Cook, T.J. Robinson, Optimization of designed experi- 43 (1989) 772e777. https://doi.org/10.1366/0003702894202201.
ments based on multiple criteria utilizing a pareto frontier, Technometrics [43] M.S. Dhanoa, S.J. Lister, R. Sanderson, R.J. Barnes, The link between multi-
53 (2011) 353e365. https://doi.org/10.1198/TECH.2011.10087. plicative scatter correction (MSC) and standard normal variate (SNV)
[25] J.-J. Hou, C.-M. Cao, Y.-W. Xu, S. Yao, L.-Y. Cai, H.-L. Long, Q.-R. Bi, Y.-Y. Zhen, transformations of NIR spectra, J. Near Infrared Spectrosc. 2 (1994) 43e47.
W.-Y. Wu, D.-A. Guo, Exploring lipid markers of the quality of coix seeds with https://doi.org/10.1255/jnirs.30.
different geographical origins using supercritical fluid chromatography mass [44] I.S. Helland, T. Næs, T. Isaksson, Related versions of the multiplicative scatter
spectrometry and chemometrics, Phytomedicine 45 (2018) 1e7. https:// correction method for preprocessing spectroscopic data, Chemometr. Intell.
doi.org/10.1016/j.phymed.2018.03.010. Lab. Syst. 29 (1995) 233e241. https://doi.org/10.1016/0169-7439(95)80098-
[26] A.A. D'Archivio, F.D. Donato, M. Foschi, M.A. Maggi, F. Ruggieri, Uhplc anal- T.
ysis of saffron (crocus sativus l.): optimization of separation using chemo- [45] P.H.C. Eilers, P.M. Kroonenberg, Modeling and correction of Raman and
metrics and detection of minor crocetin esters, Molecules 23 (2018) 1851. Rayleigh scatter in fluorescence landscapes, Chemometr. Intell. Lab. Syst. 130
https://doi.org/10.3390/molecules23081851. (2014) 1e5. https://doi.org/10.1016/j.chemolab.2013.09.002.
[27] X. Zhang, Q. Bi, X. Wu, Z. Wang, Y. Miao, N. Tan, Systematic characterization [46] M. Bahram, R. Bro, C. Stedmon, A. Afkhami, Handling of Rayleigh and Raman
and quantification of Rubiaceae-type cyclopeptides in 20 Rubia species by scatter for PARAFAC modeling of fluorescence data using interpolation,
ultra performance liquid chromatography tandem mass spectrometry com- J. Chemom. 20 (2006) 99e105. https://doi.org/10.1002/cem.978.
bined with chemometrics, J. Chromatogr., A 1581e1582 (2018) 43e54. [47] C.A. Stedmon, R. Bro, Characterizing dissolved organic matter fluorescence
https://doi.org/10.1016/j.chroma.2018.10.049. with parallel factor analysis: a tutorial, Limnol Oceanogr. Methods 6 (2008)
[28] H. Jiang, L. Yang, X. Xing, M. Yan, X. Guo, B. Yang, Q.-H. Wang, H.-X. Kuang, 572e579. https://doi.org/10.4319/lom.2008.6.572.
Chemometrics coupled with UPLC-MS/MS for simultaneous analysis of [48] R.G. Zepp, W.M. Sheldon, M.A. Moran, Dissolved organic fluorophores in
markers in the raw and processed Fructus Xanthii, and application to opti- southeastern US coastal waters: correction method for eliminating Rayleigh
mization of processing method by BBD design, Phytomedicine 57 (2019) and Raman scattering peaks in excitation-emission matrices, in: Mar. Chem.,
191e202. https://doi.org/10.1016/j.phymed.2018.12.020. Elsevier, 2004, pp. 15e36.
[29] S.M. Dadou, Z. Senta-Loys, A. Almajaan, S. Li, D.S. Jones, A.M. Healy, Y. Tian, [49] R.D. Jiji, K.S. Booksh, Mitigation of Rayleigh and Raman spectral interferences
G.P. Andrews, The development and validation of a quality by design based in multiway calibration of excitation-emission matrix fluorescence spectra,
process analytical tool for the inline quantification of Ramipril during hot- Anal. Chem. 72 (2000) 718e725. https://doi.org/10.1021/ac990418j.
melt extrusion, Int. J. Pharm. 584 (2020) 119382. https://doi.org/10.1016/ [50] H.L. Wu, T. Wang, R.Q. Yu, Recent advances in chemical multi-way calibra-
j.ijpharm.2020.119382. tion with second-order or higher-order advantages: multilinear models, al-
[30] D. Ortiz-Aguayo, M. Bonet-San-Emeterio, M. Del Valle, Simultaneous vol- gorithms, related issues and applications, TrAC, Trends Anal. Chem. 130
tammetric determination of acetaminophen, ascorbic acid and uric acid by (2020) 115954. https://doi.org/10.1016/j.trac.2020.115954.
use of integrated array of screen-printed electrodes and chemometric tools, [51] A. Savitzky, M.J.E. Golay, Smoothing and differentiation of data by simplified
Sensors 19 (2019) 3286. https://doi.org/10.3390/s19153286. least squares procedures, Anal. Chem. 36 (1964) 1627e1639. https://doi.org/
[31] Y. Zhang, Y. Zhou, S. Chen, Y. You, P. Qiu, Y. Ni, Analysis of the over- 10.1021/ac60214a047.
lapped electrochemical signals of hydrochlorothiazide and pyridoxine on [52] Å. Rinnan, Pre-processing in vibrational spectroscopy-when, why and how,
the ethylenediamine-modified glassy carbon electrode by use of che- Anal. Methods 6 (2014) 7124e7129. https://doi.org/10.1039/c3ay42270d.
mometrics methods, Molecules 24 (2019) 2536. https://doi.org/10. [53] Å. Rinnan, F. van den Berg, S.B. Engelsen, Review of the most common pre-
3390/molecules24142536. processing techniques for near-infrared spectra, TrAC Trends Anal. Chem. 28
[32] M.S. Elazazy, A.A. Issa, M. Al-Mashreky, M. Al-Sulaiti, K. Al-Saad, Application (2009) 1201e1222. https://doi.org/10.1016/j.trac.2009.07.007.
of fractional factorial design for green synthesis of cyano-modified silica [54] P. Lasch, Spectral pre-processing for biomedical vibrational spectroscopy and
nanoparticles: chemometrics and multifarious response optimization, Adv. microspectroscopic imaging, Chemometr. Intell. Lab. Syst. 117 (2012)
Powder Technol. 29 (2018) 1204e1215. https://doi.org/10.1016/ 100e114. https://doi.org/10.1016/j.chemolab.2012.03.011.
j.apt.2018.02.012. [55] K. Vidovi c, A. Krofli
c, P. Jovanovi 
c, M. Sala, I. Grgi
c, Electrochemistry as a tool
[33] C.A.R. Salamanca-Neto, G.G. Marcheafave, J. Scremin, E.C.M. Barbosa, for studies of complex reaction mechanisms: the case of the atmospheric
P.H.C. Camargo, R.F.H. Dekker, I.S. Scarminio, A.M. Barbosa-Dekker, aqueous-phase Aging of catechols, Environ. Sci. Technol. 53 (2019)
E.R. Sartori, Chemometric-assisted construction of a biosensing device to 11195e11203. https://doi.org/10.1021/acs.est.9b02456.
measure chlorogenic acid content in brewed coffee beverages to discrimi- [56] Y. Bonfil, M. Brand, E. Kirowa-Eisner, Characteristics of subtractive anodic
nate quality, Food Chem. 315 (2020) 126306. https://doi.org/10.1016/ stripping voltammetry of Pb and Cd at silver and gold electrodes, Anal. Chim.
j.foodchem.2020.126306. Acta 464 (2002) 99e114. https://doi.org/10.1016/S0003-2670(02)00489-0.
[34] M.R. Moghaddam, J.B. Ghasemi, P. Norouzi, F. Salehnia, Simultaneous [57] M.B. Gholivand, A.R. Jalalvand, H.C. Goicoechea, T. Skov, Chemometrics-
determination of dihydroxybenzene isomers at nitrogen-doped graphene assisted simultaneous voltammetric determination of ascorbic acid, uric acid,
surface using fast Fourier transform square wave voltammetry and multi- dopamine and nitrite: application of non-bilinear voltammetric data for
variate calibration, Microchem. J. 145 (2019) 596e605. https://doi.org/ exploiting first-order advantage, Talanta 119 (2014) 553e563. https://
10.1016/j.microc.2018.11.009. doi.org/10.1016/j.talanta.2013.11.028.
[35] E. Dinç, S. Dermiş, S. Can Akcasoy, Z. Ceren Ertekin, A new chemometric [58] M.B. Gholivand, A.R. Jalalvand, H.C. Goicoechea, R. Gargallo, T. Skov,
strategy in electrochemical method optimization for the quantification of G. Paimard, Combination of electrochemistry with chemometrics to intro-
cefdinir in tablets, effervescent tablets and suspension samples, Electro- duce an efficient analytical method for simultaneous quantification of five
analysis 32 (2020) 613e619. https://doi.org/10.1002/elan.201900574. opium alkaloids in complex matrices, Talanta 131 (2015) 26e37. https://
[36] C. Kalinke, P.R. de Oliveira, M. Bonet San Emeterio, A. Gonza lez-Calabuig, doi.org/10.1016/j.talanta.2014.07.053.
M. del Valle, A. Salvio Mangrich, L. Humberto Marcolino Junior, [59] J. Veerbeek, A. Me ndez-Ardoy, J. Huskens, Electrochemistry of redox-active
M.F. Bergamini, Voltammetric electronic tongue based on carbon paste guest molecules at b-cyclodextrin-functionalized silicon electrodes, Chem-
electrodes modified with biochar for phenolic compounds stripping ElectroChem 4 (2017) 1470e1477. https://doi.org/10.1002/celc.201600872.

18
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

[60] P.H.C. Eilers, I.D. Currie, M. Durba n, Fast and compact smoothing on large [85] F. Stengel, A.J. Baldwin, M.F. Bush, G.R. Hilton, H. Lioe, E. Basha, N. Jaya, E. Vierling,
multidimensional grids, Comput. Stat. Data Anal. 50 (2006) 61e76. https:// J.L.P. Benesch, Dissecting heterogeneous molecular chaperone complexes using
doi.org/10.1016/j.csda.2004.07.008. a mass spectrum deconvolution approach, Chem. Biol. 19 (2012) 599e607.
[61] G. Mohammadi, K. Rashidi, M. Mahmoudi, H.C. Goicoechea, A.R. Jalalvand, https://doi.org/10.1016/j.chembiol.2012.04.007.
Exploiting second-order advantage from mathematically modeled voltam- [86] N. Morgner, C.V. Robinson, Massign: an assignment strategy for maximizing
metric data for simultaneous determination of multiple antiparkinson agents information from the mass spectra of heterogeneous protein assemblies,
in the presence of uncalibrated interference, J. Taiwan Inst. Chem. Eng. 88 Anal. Chem. 84 (2012) 2939e2948. https://doi.org/10.1021/ac300056a.
(2018) 49e61. https://doi.org/10.1016/j.jtice.2018.04.007. [87] M.T. Marty, A.J. Baldwin, E.G. Marklund, G.K.A. Hochberg, J.L.P. Benesch,
[62] A.R. Jalalvand, M.B. Gholivand, H.C. Goicoechea, Å. Rinnan, T. Skov, Advanced C.V. Robinson, Bayesian deconvolution of mass and ion mobility spectra:
and tailored applications of an efficient electrochemical approach assisted by from binary interactions to polydisperse ensembles, Anal. Chem. 87 (2015)
AsLSSR-COW-rPLS and finding ways to cope with challenges arising from the 4370e4376. https://doi.org/10.1021/acs.analchem.5b00140.
nature of voltammetric data, Chemometr. Intell. Lab. Syst. 146 (2015) [88] J. Lu, M.J. Trnka, S.H. Roh, P.J.J. Robinson, C. Shiau, D.G. Fujimori, W. Chiu,
437e446. https://doi.org/10.1016/j.chemolab.2015.06.017. A.L. Burlingame, S. Guan, Improved peak detection and deconvolution of
[63] A.R. Jalalvand, M.B. Gholivand, H.C. Goicoechea, Multidimensional voltam- native electrospray mass spectra from large protein complexes, J. Am. Soc.
metry: four-way multivariate calibration with third-order differential pulse Mass Spectrom. 26 (2015) 2141e2151. https://doi.org/10.1007/s13361-015-
voltammetric data for multi-analyte quantification in the presence of un- 1235-6.
calibrated interferences, Chemometr. Intell. Lab. Syst. 148 (2015) 60e71. [89] M.D. Peris-Díaz, R. Guran, O. Zitka, V. Adam, A. Kre˛ zel, _ Mass spectrometry-
https://doi.org/10.1016/j.chemolab.2015.09.003. based structural analysis of cysteine-rich metal-binding sites in proteins
_ Süslü, E. Dinç, S. Altino
[64] I. €z, An application of continuous wavelet transform to with MetaOdysseus R software, J. Proteome Res. (2020). https://doi.org/
electrochemical signals for the quantitative analysis, in: Math. Methods Eng., 10.1021/acs.jproteome.0c00651.
Springer, Netherlands, 2007, pp. 303e313. [90] B.B. Reinhold, V.N. Reinhold, Electrospray ionization mass spectrometry:
[65] L. Nie, S. Wu, J. Wang, L. Zheng, X. Lin, L. Rui, Continuous wavelet transform deconvolution by an Entropy-Based algorithm, J. Am. Soc. Mass Spectrom. 3
and its application to resolving and quantifying the overlapped voltammetric (1992) 207e215. https://doi.org/10.1016/1044-0305(92)87004-I.
peaks, Anal. Chim. Acta 450 (2001) 185e192. https://doi.org/10.1016/S0003- [91] M. Mann, C.K. Meng, J.B. Fenn, Interpreting mass spectra of multiply charged
2670(01)01374-5. ions, Anal. Chem. 61 (1989) 1702e1708. https://doi.org/10.1021/
[66] S. Wu, L. Nie, J. Wang, X. Lin, L. Zheng, L. Rui, Flip shift subtraction method: a ac00190a023.
new tool for separating the overlapping voltammetric peaks on the basis of [92] Y.H. Tseng, C. Uetrecht, S.C. Yang, A. Barendregt, A.J.R. Heck, W.P. Peng,
finding the peak positions through the continuous wavelet transform, Game-theory-based search engine to automate the mass assignment in
J. Electroanal. Chem. 508 (2001) 11e27. https://doi.org/10.1016/S0022- complex native electrospray mass spectra, Anal. Chem. 85 (2013)
0728(01)00526-5. 11275e11283. https://doi.org/10.1021/ac401940e.
[67] M. Cocchi, J.L. Hidalgo-Hidalgo-De-Cisneros, I. Naranjo-Rodríguez, [93] Z. Zhang, A.G. Marshall, A universal algorithm for fast and automated charge
J.M. Palacios-Santander, R. Seeber, A. Ulrici, Multicomponent analysis of state deconvolution of electrospray mass-to-charge ratio spectra, J. Am. Soc.
electrochemical signals in the wavelet domain, Talanta 59 (2003) 735e749. Mass Spectrom. 9 (1998) 225e233. https://doi.org/10.1016/S1044-0305(97)
https://doi.org/10.1016/S0039-9140(02)00615-X. 00284-5.
[68] M. Jakubowska, Inverse continuous wavelet transform in voltammetry, [94] D.M. Horn, R.A. Zubarev, F.W. McLafferty, Automated reduction and inter-
Chemometr. Intell. Lab. Syst. 94 (2008) 131e139. https://doi.org/10.1016/ pretation of high resolution electrospray mass spectra of large molecules,
j.chemolab.2008.07.003. J. Am. Soc. Mass Spectrom. 11 (2000) 320e332. https://doi.org/10.1016/
[69] X. Zou, J. Mo, Spline wavelet analysis for voltammetric signals, Anal. Chim. S1044-0305(99)00157-9.
Acta 340 (1997) 115e121. https://doi.org/10.1016/S0003-2670(96)00458-8. [95] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgaard, Aligning of single and multiple
[70] X.P. Zheng, J.Y. Mo, The coupled application of the B-spline wavelet and RLT wavelength chromatographic profiles for chemometric data analysis using
filtration in staircase voltammetry, in: Chemom. Intell. Lab. Syst., Elsevier, correlation optimised warping, J. Chromatogr., A 805 (1998) 17e35. https://
1999, pp. 157e161. doi.org/10.1016/S0021-9673(98)00021-1.
[71] X. Shao, C. Pang, S. Wu, X. Lin, Development of wavelet transform voltam- [96] Progressive peak clustering in GC-MS Metabolomic experiments applied to
metric analyzer, Talanta 50 (2000) 1175e1182. https://doi.org/10.1016/ Leishmania parasites, Bioinformatics 22 (2006) 1391e1396. https://doi.org/
S0039-9140(99)00227-1. 10.1093/bioinformatics/btl085.
[72] H.H. Maurer, Hyphenated mass spectrometric techniques - indispensable [97] M. Bellew, M. Coram, M. Fitzgibbon, M. Igra, T. Randolph, P. Wang, D. May,
tools in clinical and forensic toxicology and in doping control, J. Mass J. Eng, R. Fang, C. Lin, J. Chen, D. Goodlett, J. Whiteaker, A. Paulovich,
Spectrom. 41 (2006) 1399e1413. https://doi.org/10.1002/jms.1112. M. Mcintosh, A suite of algorithms for the comprehensive analysis of com-
[73] R. Vettukattil, Preprocessing of raw metabonomic data, in: J.T. Bjerrum, plex protein mixtures using high-resolution LC-MS, Bioinformatics 22 (2006)
Metabonomics (Editors), Springer protocols, 2015, pp. 123e136. 1902e1909. https://doi.org/10.1093/bioinformatics/btl276.
[74] M. Katajamaa, M. Oresi c, Data processing for mass spectrometry-based [98] J. Engel, J. Gerretzen, E. Szyman  ska, J.J. Jansen, G. Downey, L. Blanchet,
metabolomics, J. Chromatogr., A 1158 (2007) 318e328. https://doi.org/ L.M.C. Buydens, Breaking with trends in pre-processing? TrAC Trends Anal.
10.1016/j.chroma.2007.04.021. Chem. 50 (2013) 96e106. https://doi.org/10.1016/j.trac.2013.04.015.
[75] L. Martens, Data management in mass spectrometry-based proteomics, in: [99] V. Bhavana, R.B. Chavan, M.K.C. Mannava, A. Nangia, N.R. Shastri, Quantifi-
R.J. Simpson, D.W. Greening (Editors), Serum/Plasma Proteomics, 2011, cation of niclosamide polymorphic forms e a comparative study by Raman,
pp. 321e332. NIR and MIR using chemometric techniques, Talanta 199 (2019) 679e688.
[76] M. Daszykowski, B. Walczak, Use and abuse of chemometrics in chroma- https://doi.org/10.1016/j.talanta.2019.03.027.
tography, TrAC Trends Anal. Chem. 25 (2006) 1081e1096. https://doi.org/ [100] A. Male chaux, Y. Le Dreau, J. Artaud, N. Dupuy, Control chart and data fusion
10.1016/j.trac.2006.09.001. for varietal origin discrimination: application to olive oil, Talanta 217 (2020).
[77] P.H.C. Eilers, A perfect smoother, Anal. Chem. 75 (2003) 3631e3636. https:// https://doi.org/10.1016/j.talanta.2020.121115.
doi.org/10.1021/ac034173t. [101] M.M.W.B. Hendriks, L. Cruz-Juarez, D. De Bont, R.D. Hall, Preprocessing
[78] C.A. Smith, E.J. Want, G. O’maille, R. Abagyan, G. Siuzdak, XCMS: processing and exploratory analysis of chromatographic profiles of plant extracts,
mass spectrometry data for metabolite profiling using nonlinear peak Anal. Chim. Acta 545 (2005) 53e64. https://doi.org/10.1016/j.aca.2005.
alignment, matching, and identification, Anal. Chem. 78 (2006) 779e787. 04.026.
https://doi.org/10.1021/ac051437y. [102] R.A. van den Berg, H.C.J. Hoefsloot, J.A. Westerhuis, A.K. Smilde, M.J. van der
[79] R. Tautenhahn, C. Bottcher, S. Neumann, Highly sensitive feature detection Werf, Centering, scaling, and transformations: improving the biological in-
for high resolution LC/MS, BMC Bioinf. 9 (2008) 504. https://doi.org/10.1186/ formation content of metabolomics data, BMC Genom. 7 (2006) 1e15.
1471-2105-9-504. https://doi.org/10.1186/1471-2164-7-142.
[80] M. Katajamaa, M. Oresi c, Processing methods for differential analysis of LC/ [103] R.J.O. Torgrip, K.M. Åberg, E. Alm, I. Schuppe-Koistinen, J. Lindberg, A note on
MS profile data, BMC Bioinf. 6 (2005) 179. https://doi.org/10.1186/1471- normalization of biofluid 1D 1H-NMR data, Metabolomics 4 (2008) 114e121.
2105-6-179. https://doi.org/10.1007/s11306-007-0102-2.
[81] M. Loos, H. Singer, Nontargeted homologue series extraction from hyphen- [104] M.A. Baldo, P. Oliveri, S. Fabris, C. Malegori, S. Daniele, Fast determination of
ated high resolution mass spectrometry data, J. Cheminf. 9 (2017) 12. https:// extra-virgin olive oil acidity by voltammetry and Partial Least Squares
doi.org/10.1186/s13321-017-0197-z. regression, Anal. Chim. Acta 1056 (2019) 7e15. https://doi.org/10.1016/
[82] M.D. Peris-Díaz, S.R. Sweeney, O. Rodak, E. Sentandreu, S. Tiziani, R-Metab- j.aca.2018.12.050.
oList 2: A flexible tool for metabolite annotation from high-resolution data- [105] E. Pere -Trepat, S. Lacorte, R. Tauler, Alternative calibration approaches for
independent acquisition mass spectrometry analysis, Metabolites 9 (2019) LC-MS quantitative determination of coeluted compounds in complex
187. https://doi.org/10.3390/metabo9090187. environmental mixtures using multivariate curve resolution, Anal. Chim.
[83] E. Melamud, L. Vastag, J.D. Rabinowitz, Metabolomic analysis and visualiza- Acta 595 (2007) 228e237. https://doi.org/10.1016/j.aca.2007.04.011.
tion engine for LC - MS data, Anal. Chem. 82 (2010) 9818e9826. https:// [106] M. Cocchi, C. Durante, G. Foca, D. Manzini, A. Marchetti, A. Ulrici, Application
doi.org/10.1021/ac1021166. of a wavelet-based algorithm on HS-SPME/GC signals for the classification of
[84] S. Castillo, P. Gopalacharyulu, L. Yetukuri, M. Oresi c, Algorithms and balsamic vinegars, Chemometr. Intell. Lab. Syst. 71 (2004) 129e140. https://
tools for the preprocessing of LC-MS metabolomics data, Chemometr. doi.org/10.1016/j.chemolab.2004.01.004.
Intell. Lab. Syst. 108 (2011) 23e32. https://doi.org/10.1016/j.chemolab. [107] I.S. Pe rez, M.J. Culzoni, G.G. Siano, M.D. Gil García, H.C. Goicoechea,
2011.03.010. M.M. Galera, Detection of unintended stress effects based on a metabonomic

19
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

study in tomato fruits after treatment with carbofuran pesticide. Capabilities [131] M.G. Davies, The expectation of Mahalanobis' generalized distance, Ann. Inst.
of MCR-ALS applied to LC-MS three-way data arrays, Anal. Chem. 81 (2009) Stat. Math. 24 (1972) 111e125. https://doi.org/10.1007/BF02479743.
8335e8346. https://doi.org/10.1021/ac901119h. [132] R.G. Brereton, G.R. Lloyd, Partial least squares discriminant analysis: taking
[108] J.W. Tukey, Exploratory data analysis, in: The Concise Encyclopedia of Sta- the magic away, J. Chemom. 28 (2014) 213e225. https://doi.org/10.1002/
tistics, Springer, New York, 2008, p. 7. cem.2609.
[109] R. Bro, A.K. Smilde, Principal component analysis, Anal. Methods 6 (2014) [133] S. Wold, M. Sjostrom, L. Eriksson, PLS-regression: a basic tool of chemo-
2812e2831. https://doi.org/10.1039/c3ay41907j. metrics, Chemometr. Intell. Lab. Syst. 58 (2001) 109e130. https://doi.org/
[110] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometr. 10.1016/s0169-7439(01)00155-1.
Intell. Lab. Syst. 2 (1987) 37e52. https://doi.org/10.1016/0169-7439(87) [134] M. Andersson, A comparison of nine PLS1 algorithms, J. Chemom. 23 (2009)
80084-9. 518e529. https://doi.org/10.1002/cem.1248.
[111] K. Keerthi Vasan, B. Surendiran, Dimensionality reduction using Principal [135] S. Wol, H. Martens, H. Wol, The multivariate calibration problem in chem-
Component Analysis for network intrusion detection, Perspect. Sci. 8 (2016) istry solved by the PLS method, in: B. Kågstro €m, A. Ruhe (Editors), Matrix
510e512. https://doi.org/10.1016/j.pisc.2016.05.010. Pencils. Lecture Notes in Mathematics, Springer, Berlin, Heidelberg, 1983,
[112] S.C. Johnson, Hierarchical clustering schemes, Psychometrika 32 (1967) pp. 286e293.
241e254. https://doi.org/10.1007/BF02289588. [136] S. de Jong, SIMPLS: an alternative approach to partial least squares regres-
[113] Z. Zhang, F. Murtagh, S. Van Poucke, S. Lin, P. Lan, Hierarchical cluster sion, Chemometr. Intell. Lab. Syst. 18 (1993) 251e263. https://doi.org/
analysis in clinical research with heterogeneous study population: high- 10.1016/0169-7439(93)85002-X.
lighting its visualization with R, Ann. Transl. Med. 5 (2017) 75. https:// [137] P. Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal.
doi.org/10.21037/atm.2017.02.05. Chim. Acta 185 (1986) 1e17. https://doi.org/10.1016/0003-2670(86)80028-
[114] F. Nielsen, Hierarchical Clustering. Introduction to HPC with MPI for Data 9.
Science, Springer, Cham, 2016, pp. 195e211. [138] M. Barker, W. Rayens, Partial least squares for discrimination, J. Chemom. 17
[115] Choosing a subset of principal components or variables, in: Princ. Compon. (2003) 166e173. https://doi.org/10.1002/cem.785.
Anal., Springer-Verlag, 2006, pp. 111e149. [139] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5e32. https://doi.org/
[116] P. Oliveri, C. Malegori, R. Simonetti, M. Casale, The impact of signal 10.1023/A:1010933404324.
pre-processing on the final interpretation of analytical outcomes e a tutorial, [140] M. Krzywinski, N. Altman, Classification and regression trees, Nat. Methods
Anal. Chim. Acta 1058 (2019) 9e17. https://doi.org/10.1016/j.aca.2018.10. 14 (2017) 757e758. https://doi.org/10.1038/nmeth.4370.
055. [141] P.O. Gislason, J.A. Benediktsson, J.R. Sveinsson, Random forests for land cover
[117] T. Ronan, Z. Qi, K.M. Naegle, Avoiding common pitfalls when clustering classification, in: Pattern Recognit. Lett., North-Holland, 2006, pp. 294e300.
biological data, Sci. Signal. 9 (2016). https://doi.org/10.1126/scisigna- [142] S. Janitza, R. Hornung, On the overestimation of random forest's out-of-bag error,
l.aad1932. re6. PloS One 13 (2018), e0201904. https://doi.org/10.1371/journal.pone.0201904.
[118] T. Alqurashi, W. Wang, Clustering ensemble method, Int. J. Mach. Learn. [143] B.H. Menze, B.M. Kelm, R. Masuch, U. Himmelreich, P. Bachert, W. Petrich,
Cybern. 10 (2019) 1227e1246. https://doi.org/10.1007/s13042-017-0756-7. F.A. Hamprecht, A comparison of random forest and its Gini importance with
[119] S. Tfaili, A. Al Assaad, N. Fournier, F. Allaoui, J.-L. Paul, P. Chaminade, A. Tfayli, standard chemometric methods for the feature selection and classification of
Investigation of lipid modifications in J774 macrophages by vibrational spectral data, BMC Bioinf. 10 (2009) 213. https://doi.org/10.1186/1471-
spectroscopies after eicosapentaenoic acid membrane incorporation in 2105-10-213.
unloaded and cholesterol-loaded cells, Talanta 199 (2019) 54e64. https:// [144] B.E. Boser, I.M. Guyon, V.N. Vapnik, Training algorithm for optimal margin
doi.org/10.1016/j.talanta.2019.01.122. classifiers, in: Proc. Fifth Annu. ACM Work. Comput. Learn. Theory, ACM,
[120] P. Duan, B. Liu, C.L.M. Morais, J. Zhao, X. Li, J. Tu, W. Yang, C. Chen, M. Long, New York, USA, 1992, pp. 144e152.
X. Feng, F.L. Martin, C. Xiong, 4-Nonylphenol effects on rat testis and sertoli [145] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines
cells determined by spectrochemical techniques coupled with chemometric and Other Kernel-Based Learning Methods, University Pres Cambridge,
analysis, Chemosphere 218 (2019) 64e75. https://doi.org/10.1016/ Cambridge, UK, 2013.
j.chemosphere.2018.11.086. [146] R.G. Brereton, G.R. Lloyd, Support vector machines for classification and
[121] D.C. Caixeta, E.M.G. Aguiar, L. Cardoso-Sousa, L.M.D. Coelho, S.W. Oliveira, F. regression, Analyst 135 (2010) 230e267. https://doi.org/10.1039/
S. Espindola, L. Raniero, K.T.B. Crosara, M.J. Baker, W.L. Siqueira, W.L. b918972f.
Siqueira, R. Sabino-Silva, Salivary molecular spectroscopy: a sustainable, [147] M.-J. Li, H. Xiao, Y.-X. Qiu, J.-H. Huang, R.-Y. Man, Y. Qin, G.-H. Xiong, Q.-
rapid and non-invasive monitoring tool for diabetes mellitus during insulin H. Peng, Y.-Q. Jian, C.-Y. Peng, W.-N. Zhang, W. Wang, Identification of po-
treatment, PloS One 15 (2020). https://doi.org/10.1371/journal.pone. tential diagnostic biomarkers of cerebral infarction using gas
0223461. chromatography-mass spectrometry and chemometrics, RSC Adv. 8 (2018)
[122] S. Akyuz, F. Guliyev, S. Celik, A.E. Ozel, V. Alakbarov, Investigations of the 22866e22875. https://doi.org/10.1039/C8RA03132K.
Neolithic potteries of 6th millennium BC from Go € ytepe-Azerbaijan by [148] W. Luo, J.-W. Zhang, L.-J. Zhang, W. Zhang, High-throughput untargeted
vibrational spectroscopy and chemometric techniques, Vib. Spectrosc. 105 metabolomics and chemometrics reveals pharmacological action and
(2019). https://doi.org/10.1016/j.vibspec.2019.102980. molecular mechanism of chuanxiong by ultra performance liquid chro-
[123] L.M.M. Le, B. Ke gl, A. Gramfort, C. Marini, D. Nguyen, M. Cherti, S. Tfaili, matography combined with quadrupole-time-of-flight-mass spectrom-
A. Tfayli, A. Baillet-Guffroy, P. Prognon, P. Chaminade, E. Caudron, Optimi- etry, RSC Adv. 9 (2019) 39025e39036. https://doi.org/10.1039/
zation of classification and regression analysis of four monoclonal antibodies c9ra06267j.
from Raman spectra using collaborative machine learning approach, Talanta [149] X. Yan, W. Wang, Z. Chen, Y. Xie, Q. Li, Z. Yu, H. Hu, Z. Wang, Quality
184 (2018) 260e265. https://doi.org/10.1016/j.talanta.2018.02.109. assessment and differentiation of Aucklandiae Radix and Vladimiriae Radix
[124] 
O. Zukovskaja, S. Kloß, M.G. Blango, O. Ryabchykov, O. Kniemeyer, A.A. based on GC-MS fingerprint and chemometrics analysis: basis for clinical
Brakhage, T.W. Bocklitz, D. Cialla-May, K. Weber, J. Popp, UV-Raman spec- application, Anal. Bioanal. Chem. 412 (2020) 1535e1549. https://doi.org/
troscopic identification of fungal spores important for respiratory diseases, 10.1007/s00216-019-02380-2.
Anal. Chem. 90 (2018) 8912e8918. https://doi.org/10.1021/acs.analchem. [150] J.C. García-Can ~ averas, M.D. Peris-Díaz, M.I. Alcoriza-Balaguer, M. Cerd an-
8b01038. Calero, M.T. Donato, A. Lahoz, A lipidomic cell-based assay for studying drug-
[125] J. Huang, A. Ramoji, S. Guo, T. Bocklitz, V. Boivin-Jahns, J. Mo € ller, induced phospholipidosis and steatosis, Electrophoresis 38 (2017)
M. Kiehntopf, M. Noutsias, J. Popp, U. Neugebauer, Vibrational spectroscopy 2331e2340. https://doi.org/10.1002/elps.201700079.
as a powerful tool for follow-up immunoadsorption therapy treatment of [151] F. Westad, F. Marini, Validation of chemometric models - a tutorial, Anal.
dilated cardiomyopathy-a case report, Analyst 145 (2020) 486e496. https:// Chim. Acta 893 (2015) 14e24. https://doi.org/10.1016/j.aca.2015.06.056.
doi.org/10.1039/c9an01696a. [152] R.W. Kennard, L.A. Stone, Computer aided design of experiments, Techno-
[126] H. Dies, J. Raveendran, C. Escobedo, A. Docoslis, Rapid identification and metrics 11 (1969) 137e148. https://doi.org/10.1080/00401706.1969.
quantification of illicit drugs on nanodendritic surface-enhanced Raman 10490666.
scattering substrates, Sensor. Actuator. B Chem. 257 (2018) 382e388. [153] P. Filzmoser, B. Liebmann, K. Varmuza, in: J. Chemom (Editor), Repeated
https://doi.org/10.1016/j.snb.2017.10.181. Double Cross Validation, John Wiley & Sons, Ltd, USA, 2009,
[127] S. He, S. Fang, W. Xie, P. Zhang, Z. Li, D. Zhou, Z. Zhang, J. Guo, C. Du, J. Du, pp. 160e171.
J. Du, D. Wang, Assessment of physiological responses and growth phases of [154] Y. Xu, R. Goodacre, On splitting training and validation set: a comparative
different microalgae under environmental changes by Raman spectroscopy study of cross-validation, bootstrap and systematic sampling for estimating
with chemometrics, Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 204 the generalization performance of supervised learning, J. Anal. Test. 2 (2018)
(2018) 287e294. https://doi.org/10.1016/j.saa.2018.06.060. 249e262. https://doi.org/10.1007/s41664-018-0068-2.
[128] F. Negr~ ao, J.K. Diedrich, S. Giorgio, M.N. Eberlin, J.R. Yates, Tandem mass tag [155] V. Consonni, D. Ballabio, R. Todeschini, Evaluation of model predictive ability
proteomic analysis of in vitro and in vivo models of cutaneous leishmaniasis by external validation techniques, J. Chemom. 24 (2010) 194e201. https://
reveals parasite-specific and nonspecific modulation of proteins in the host, doi.org/10.1002/cem.1290.
ACS Infect. Dis. 5 (2019) 2136e2147. https://doi.org/10.1021/ [156] D. Baumann, K. Baumann, Reliable estimation of prediction errors for QSAR
acsinfecdis.9b00275. models under model uncertainty using double cross-validation, J. Cheminf. 6
[129] R.G. Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons, Ltd, (2014) 47. https://doi.org/10.1186/s13321-014-0047-1.
New York, USA, 2009. [157] E.W. Steyerberg, Validation in prediction research: the waste by data split-
[130] J.H. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc. 84 (1989) ting, J. Clin. Epidemiol. 103 (2018) 131e133. https://doi.org/10.1016/
165e175. https://doi.org/10.1080/01621459.1989.10478752. j.jclinepi.2018.07.010.

20
M.D. Peris-Díaz and A. Kre˛ z_ el Trends in Analytical Chemistry 135 (2021) 116157

[158] J.A. Fern


andez Pierna, P. Dardenne, Soil parameter quantification by NIRS [162] C.M. Teglia, M. Guin~ ez, H.C. Goicoechea, M.J. Culzoni, S. Cerutti, Enhancement
as a Chemometric challenge at “Chimiome trie 2006”, Chemometr. Intell. of multianalyte mass spectrometry detection through response surface
Lab. Syst. 91 (2008) 94e98. https://doi.org/10.1016/j.chemolab.2007.06. optimization by least squares and artificial neural network modelling,
007. J. Chromatogr., A 1611 (2020) 460613. https://doi.org/10.1016/
[159] H. van der Voet, Comparing the predictive accuracy of models using a simple j.chroma.2019.460613.
randomization test, Chemometr. Intell. Lab. Syst. 25 (1994) 313e323. https:// [163] C. Reymond, A. Le Masle, C. Colas, N. Charon, A rational strategy based on
doi.org/10.1016/0169-7439(94)85050-X. experimental designs to optimize parameters of a liquid chromatography-
[160] H.S. Tapp, M. Defernez, E.K. Kemsley, FTIR spectroscopy and multivariate mass spectrometry analysis of complex matrices, Talanta 205 (2019)
analysis can distinguish the geographic origin of extra virgin olive oils, 120063. https://doi.org/10.1016/j.talanta.2019.06.063.
J. Agric. Food Chem. 51 (2003) 6110e6115. https://doi.org/10.1021/ [164] P.R.N. Rocha, F.A. de Freitas, C.F.F. Angolini, L.-S.F. Vasconcelos, A.L.B. da Silva,
jf030232s. E.V. Costa, F.M.A. da Silva, M.N. Eberlin, G.A. Bataglion, P.K. Soares,
[161] M. Bonet-San-Emeterio, A. Gonza lez-Calabuig, M. del Valle, Artificial P.K. Soares, H.H.F. Koolen, Statistical mixture design investigation for
neural networks for the resolution of dopamine and serotonin extraction and quantitation of aporphine alkaloids from the leaves of Uno-
complex mixtures using a graphene-modified carbon electrode, nopsis duckei R.E. Fr. by HPLCeMS/MS, Phytochem. Anal. 29 (2018)
Electroanalysis 31 (2019) 390e397. https://doi.org/10.1002/elan. 569e576. https://doi.org/10.1002/pca.2768.
201800525.

21

You might also like