You are on page 1of 22

4.

18 Chemometrics Analysis of Big Data


José Camacho, Network Engineering and Security Group, Signal Theory, Networking and Communications Department, University
of Granada, Granada, Spain
Edoardo Saccenti, Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, The Netherlands
© 2020 Elsevier B.V. All rights reserved.

4.18.1 Introduction 437


4.18.2 Modeling Massive Volumes of Multivariate Data With PCA and PLS 438
4.18.2.1 Big Data in the Observations 438
4.18.2.2 Big Data in the Variables 439
4.18.2.3 Big Data Streams 440
4.18.3 Visualizing the Big Mode 441
4.18.3.1 Clustering to Visualize Big Data 442
4.18.3.2 Compressed Score/Loading Plots 443
4.18.4 Software: The Multivariate Exploration Data Analysis Toolbox 446
4.18.5 Case Study I: Cybersecurity Data 447
4.18.6 Case Study II: DNA Methylation Data 453
4.18.7 Challenges and the Future 456
Acknowledgment 458
References 458

4.18.1 Introduction

The technological advances in the last decades, specially the development of (i) new data transport technologies at the core of the
Internet, (ii) new Internet access technologies (Wi-Fi, 4G/5G) and devices (smartphones, wearables, sensors) and (iii) new digital
services (e.g., eHealth, multimedia) have led to the so-called Big Data era. Companies are finding new ways to optimize their oper-
ations by making the most of data: we can measure (almost) anything (almost) everywhere, take the measurements to a data center
and analyze the data in order to improve our understanding of the underlying process, regardless it is an industrial line or a socio-
logical or biological phenomenon. To this regard, the development of Big Data is closely related to the advent of the Internet of
Things (IoT): the connection of an increasing diversity of sensing devices to the Internet. The IoT creates an avatar of the physical
world in the digital world, making everything reachable (and thus analyzable) from everywhere.
As for the third quarter of 2019, current estimates of Internet traffic1,2 approach 100 terabytes (TB, 212 bytes) of data per second
and 7 exabytes (EB, 218 bytes) per day. This explosion of data includes data generated by humans and machines3: from Instagram
photos, Youtube videos and Facebook entries to Google searches and automated economic transactions. The increase of data
production is exponential: 90% of the total volume of data was generated in the last 2 years.4
The availability of new sources of data has led to the apparition of a large number of Big Data applications.5 Big Data has been
defined in different ways. The worldwide renowned consulting firm Gartner6 proposes the following definition ”Big Data is
high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information pro-
cessing that enable enhanced, insight, decision making, and process automation”. An aspect that is common to most Big Data defi-
nitions, also present in the previous one, is the need of new processing tools to handle the so-called V’s7:

• Variety: Big Data is diverse in nature. Different sources, including unstructured and structured information, need to be properly
combined in order to make the most of the analysis. Structured information is composed of records with a fixed structure (e.g.,
and excel data sheet) while unstructured information is the opposite (e.g., email contents).
• Volume: Exabytes, zettabytes, and even higher amounts of data are described in Big Data applications. This amount of data
needs to be handled simultaneously and requires parallel processing means.
• Velocity: In Big Data problems, a high rate of sampling is common. This further complicates the analysis and makes parallel
processing even more necessary.
Initiatives for Big Data analysis like the open software Apache projects Hadoop (http://hadoop.apache.org/), Mahout (http://
mahout.apache.org/) or Spark (http://spark.apache.org/), among others, and the companies supporting them, have fostered the
apparition of a wide ecosystem of Big Data solutions8 ranging from data storage to processing to analysis. Within this landscape,
machine learning, visual analytics and data analysis tools are principal resources. The promising results of early Big Data applica-
tions has led to a golden age for data-driven methods. However, after the typical period of inflated expectations, the integration of
machine learning in successful mainstream use cases has become a major concern.9
How has this trend impacted chemometrics? For some reason, chemometric applications have not been benefited from the Big
Data approach, where Volume and Velocity are typically defined in terms of massive amounts of observations. In chemometrics,

Comprehensive Chemometrics, 2nd edition, Volume 4 https://doi.org/10.1016/B978-0-12-409547-2.14602-5 437


438 Chemometrics Analysis of Big Data

data can be massive, but typically in terms of variables, except for industrial process applications and the like. Still, approaches to Big
Data like distributed (parallel) processing can be of interest for chemometrics, as we try to illustrate with the examples in this article.
There has been a reduced number of contributions related to Big Data within the chemometrics literature. In 2014, Qin10 dis-
cussed what the Big Data model can contribute to industrial process applications. The same year, Camacho11 introduced the
Compressed Score Plots (CSPs), to visualize score plots with unlimited number of observations. With the CSPs, we can extend
the exploratory data analysis approach in chemometrics to Big Data problems. Some months after, already in 2015, Camacho
and coworkers presented the Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab,12 with a Big Data processing
module using Principal Component Analysis (PCA) and Partial Least Squares (PLS) models. The same year, Martens13 discussed
the contribution of the chemometrics approach to Big Data. In 2016, Offroy and Duponchel14 discussed the application of Topo-
logical Data Analysis to Big Data in chemistry applications. In 2017, Vitale et al.15 presented an approach to compute multivariate
models on-the-fly for exploratory analysis, very similar to the approach in the MEDA Toolbox.
In this article, we describe a methodological extension of both PCA and PLS modeling and associated visualizations to the Big
Data scenario based on the original approaches of.11,12 The rest of the article is organized as follows. Sections “Modeling Massive
Volumes of Multivariate Data With PCA and PLS” and “Visualizing the Big Mode” present the modeling and visualization
approaches followed, respectively. In “Software: The Multivariate Exploration Data Analysis Toolbox” section we introduce the
MEDA Toolbox, a free package of Matlab routines that can be used to analyze and visualize Big Data. Sections “Case Study I: Cyber-
security Data” and “Case Study II: DNA Methylation Data” illustrate these approaches on two case studies. The first case study
relates to the application of chemometric tools to computer network security data: while this is not strictly chemometrics, it is
a nice example of what chemometrics can contribute to the Big Data arena; this case study can be fully reproduced by the reader
using the MEDA Toolbox. The second case study pertains the analysis of large data set containing DNA methylation measurements.
Section “Challenges and the Future” offers prospective and directions for future work and extensions of the approaches presented.

4.18.2 Modeling Massive Volumes of Multivariate Data With PCA and PLS

Most algorithms for PCA (or PLS) model fitting take the N  M data matrix X (and the N  O response matrix Y for PLS) as input.
Due to limited computer resources, in particular limited memory, this approach is infeasible when N or M grow beyond a certain
size, like in the case of Big Data sets. For instance, a 109  103 matrix, containing 1012 data points, would require 8  1012 bytes
(8TB) of RAM, if working at double precision. That requirement cannot be met with common hardware. The challenge then is to
compute the model out-of-core,16 that is, without loading and retaining the whole data set in the computer memory.
A viable solution to tackle this problem is to use the cross-product matrices. For instance, in the previous example, for X of dimen-
sions 109  103, only 8 MB are needed to store X0 X, which has dimensions 1000  1000, and this matrix can be computed iteratively,
so that the complete X does not need to be stored in memory. Substituting X by its cross-product removes one of the dimensions and
this can be conveniently used to hide one single Big mode (observations or variables), making possible to deploy chemometric tools,
like PCA and PLS, on Big Data applications without the need of high-performance hardware. Thus, the loading vectors of PCA can be
identified using the eigendecomposition (ED) of the cross-product matrix X0 X for any size of N. Similarly, the loadings and weights in
PLS regression can be identified from matrices X0 X and X0 Y using the kernel algorithm.17–19 Conversely, if M is huge, we can compute
the PCA scores from the ED of the cross-product matrix XX0 , and PLS scores can be computed from XX0 and YY0 .
The computation of cross-product matrices has been an extended choice for the iterative computation of PCA20 and it is suitable
for parallelization,21 a must in Big Data analysis. Also, this is a suitable approach for continuous model update. An obvious alter-
native is data sampling, which results in an approximated model rather than an exact one.
In the following, we discuss the algorithmic details to fit a PCA/PLS model whether the Big Data mode are the observations or the
variables. Without loss of generality, we will assume the data is pre-processed by auto-scaling (i.e., mean centered and scaled to unit
variance), but other pre-processing methods may be derived following a similar procedure.

4.18.2.1 Big Data in the Observations


The iterative computation of a PCA/PLS model from Big Data in the observations is illustrated in Fig. 1. For brevity, only the x-block
processing will be discussed. If a PLS model is fitted, the y-block is treated equivalently.
To avoid problems with RAM memory limitation, the data is first split observation-wise in batches {X1, X2, ., Xt, ., XT}, with
corresponding number of observations {B1, B2, ., Bt, ., BT} and M variables. Then, a 1  M cumulative sum is iteratively
computed following
X
Bt
Mxt ¼ Mxt1 þ xit (1)
i¼1

where xti represents the i-th observation in Xt and M0x is a vector of 0’s. After all the T batches are considered, the total mean, of
dimension 1  M, is computed
mx ¼ ð1=NÞ$MxT (2)
Chemometrics Analysis of Big Data 439

Fig. 1 Illustration of the iterative approach for Big Data sets in the observations.

with:
X
T
N¼ Bt (3)
t¼1

If auto-scaling is desired, the standard deviation of the blocks needs to be computed. First, a 1  M estimate of the accumulated
variance is obtained

 2  2 X
Bt  2
sxt ¼ sxt1 þ xit  mx (4)
i¼1

for (s0x)2 a vector of 0’s. From this estimate computed from the T batches, the standard deviation, of dimension 1  M, can be
derived as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2ffi
sx ¼ ð1=ðN  1ÞÞ$ sxT (5)

The auto-scaled version of the i-th observation of Xt is given by


 
xit ¼ xit  mx Bsx
e (6)

where B is the Hadamard (element-wise) division.


e t and the cross-product matrix is then updated as
xit are arranged in Bt  M matrix X
The e
T
e $X
ðX 0 XÞt ¼ ðX 0 XÞt1 þ X et (7)
t

for (X0 X)0 equal to the suitable matrix of 0’s.


The same procedure is followed for a response data matrix for the cross-product (X0 Y):
e T $Y
ðX 0 YÞt ¼ ðX 0 YÞt1 þ X et (8)
t

for (X0 Y)
0 equal to the suitable matrix of 0’s.
Finally, an exact PC A model of the data can be fitted with the ED of (X0 X)T, and an exact PLS model with the kernel algorithm of
(X0 X)T and (X0 Y)T.

4.18.2.2 Big Data in the Variables


The iterative computation of a PCA and PLS model from a data matrix X Big in the variables mode is illustrated in Fig. 2.
The data matrix X is first split variable-wise in T batches {X1, X2, ., Xt, ., XT}, each one containing N observations and {B1, B2,
., Bt, ., BT} variables, and each batch is independently preprocessed, e.g., applying auto-scaling.
The mean mt x for the t-th batch, of size 1  Bt, is

1 XN
mxt ¼ xi (9)
N i¼1 t

where xti is the i-th row in batch Xt. The corresponding 1  Bt vector of standard deviation stx is given by
440 Chemometrics Analysis of Big Data

Fig. 2 Illustration of the iterative approach for Big Data sets in the variables.

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u1 X N  2
sxt ¼t x it  mxt (10)
N i¼1

x it of xt i is given by
and the auto-scaled version e
 
xit ¼ x it  mxt Bsxt
e (11)

Then, the cross-product matrices are computed as


T
e t $X
ðXX 0 Þt ¼ ðXX 0 Þt1 þ X e (12)
t

T
e t $Y
ðYY 0 Þt ¼ ðYY 0 Þt1 þ Y e (13)
t

for (XX0 )0 and (YY0 )0 equal to the suitable matrices of 0’s.

4.18.2.3 Big Data Streams


The problem of fitting a PCA or PLS model when dealing with high velocity data (i.e., a Big Data stream) is similar to the problem of
fitting such models to a data matrix X which is Big in the observation mode, which has been discussed in “Big Data in the Obser-
vations” section. The main difference is that usually a data stream is not even stored in external memory (e.g., a hard drive, storage
server or data center), and therefore the model needs to be fitted on-the-fly from incoming data.22 The solution based on the compu-
tation of cross-product matrices can also be applied to this scenario, but in this case the pre-processing parameters need to be
updated as the data flows.
In order to account for the non-stationarity in the data, the update strategy can follow an Exponentially Weighted Moving
Average (EWMA) procedure, as proposed in Ref. 23. (Please, note that the EWMA approach in Ref. 23 is slightly different to the
more traditional EWMA in which weights for past and current information are normalized to 1.) The procedure makes use of
a forgetting factor l in the interval [0, 1], such that for l ¼ 0 the model is fitted using only the current data, while for l ¼ 1 all
past data at any time are used with the same relevance to fit the model (no forgetting factor).
The iterative computation of a PCA and PLS model from a Big Data stream is illustrated in Fig. 3. For a batch Xt containing Bt
observations of M variables at time t, the EWMA update of the mean is computed in two steps. First, the 1  M cumulative sum is
calculated as:
X
Bt
Mxt ¼ l:Mxt1 þ xit (14)
i¼1

for M0x a vector of 0’s. Then, the actual mean is computed as:
1 x
mxt ¼ M (15)
Nt t
with Nt also computed using an EWMA, starting from N0 ¼ 0:
Chemometrics Analysis of Big Data 441

Fig. 3 Illustration of the EWMA approach for Big Data streams.

Nt ¼ l$Nt1 þ Bt (16)
Following Eq. (16), if data batches are of the same size (i.e., Bt ¼ B), Nt converges to B/(1  l).
If data are to be auto-scaled, the standard deviation of the blocks needs to be computed. First, an EWMA estimate of the 1  M
vector of accumulated variance is calculated as:

 2  2 X
Bt  2
sxt ¼ l$ sxt1 þ x it  mxt (17)
i¼1

with (s0x)2 a vector of 0’s. From this estimate, the standard deviation is calculated:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
sxt ¼ ðsx Þ2 (18)
Nt  1 t
xit of the i-th column of Xt is given by
The auto-scaled version e
 i 
xit ¼ e
e x t  mxt Bsxt (19)

The xit
e e t and the cross-product matrix is computed after preprocessing from:
are arranged in a Bt  M matrix X
e T $X
ðX 0 XÞt ¼ l$ðX 0 XÞt1 þ X et (20)
t

Similarly, matrix (X0 Y)t


is computed to obtain a PLS model. Note that in this case we obtain T different PCA/PLS models, which
are updated on-the-fly once each batch of data is available.
An alternative to handle projection models from non-stationary data streams is the use of recursive model fitting approaches like.24
However, this is less suitable for visualization considering that it may be necessary to compute several variants of a model throughout
a data analysis, e.g., after outliers extraction or variable selection. For this, computing intermediate parameters like cross-product
matrices has the advantage that different models, with different subsets of variables and/or observations, may be obtained straight-
forwardly. A similar approach to the one discussed here for handling Big Data streams was developed independently in.15

4.18.3 Visualizing the Big Mode

In the previous section we have discussed how to compute PCA/PLS models from Big Data sets using cross-product matrices. The
trick is to hide the Big Data mode (columns or rows) within the cross-product. A limitation is that we cannot obtain the correspond-
ing factors in the Big Data mode from the cross-product: either the scores for Big Data in the observations or loadings for Big Data in
the variables. One solution is to perform another iteration through the T batches of data to compute these factors. However, with
this solution the number of factors (scores or loadings) per component remains Big, and in practice they cannot be visualized.
We illustrate this limitation using a data set from an industrial continuous process collected by Perceptive Engineering LTD
(http://www.perceptiveapc.com/).25 The data set was collected during a period of more than 4 days of continuous operation of
a fluidized bed reactor fed with four reactants. The collection rate is 20 s. The data consists of 18.887 observations on 36 process
variables including feed flows, temperatures, pressures, vent flow and steam flow; the observations contain 22 different operational
points of the process.
Fig. 4 shows the score plot of the first two principal components. The plot contains 18.887 points, which, strictly speaking,
cannot be considered Big Data. However, this number is already too large for proper visualization. There are clouds of dots for
each operational point, some of them overlapping and hiding part of the others: this makes the plot poorly interpretable. In
a true Big Data scenario, with millions or billions of dots, the computer cannot even render the plot. A solution to solve this visu-
alization problems is based on the use of Compressed Scatter Plots11 which will be illustrated in the following section.
442 Chemometrics Analysis of Big Data

15

10

PC 2
0

−5

−10
−30 −20 −10 0 10 20 30 40
PC 1
Fig. 4 Score plot for the first two PCs of the PCA model of the Perceptive data set.

4.18.3.1 Clustering to Visualize Big Data


The visualization problem shown in Fig. 4 can be tackled using clustering techniques,26,27 density plots28 or other form of bivariate
histograms (see Ref. 29 and references therein). In this article, we employ clustering techniques, since we can use them for visual-
ization while retaining access and understanding of each single element in the data. Density plots substitute data by density distri-
butions, and therefore the link with the original elements is lost. Bivariate histograms are based on tessellating a bivariate subspace
with a regular grid of bins of a specific form (e.g., rectangles or hexagons) and counting the number of points contained in each
bin30; these methods are faster than clustering but are less visually representative of the data, especially when several classes
have to be visualized. Furthermore, clustering is complementary to the model calibration approach in previous section, as it will
be shown.
Clustering has been widely considered in exploratory analysis.31 Let us illustrate the clustering approach with the Perceptive data
set. The Big Data mode is the observations. The resulting Compressed Scores Plot (CSP) in the PCA subspace is shown in Fig. 5. For
the sake of interpretability, multiplicity (number of elements represented by each centroid) is included in the plot by using different
marker sizes. This CSP design makes an efficient use of three visualization components32:
1. Location: To represent the data distribution
2. Marker Size: Also to represent the data distribution
3. Color: To distinguish among classes.
The procedure to visualize the multiplicity is principal in a CSP design. An alternative CSP is shown in Fig. 6, where the multiplicity
is shown in the Z-axis. Comparing the original score plot in Fig. 4 with the CSPs, it can be seen that the distribution of the scores of
the different operational points is clearer in the latter visualization for what concerns the cluttered zone. From the CSP, we can also
visualize the distribution of the original observations belonging to a cluster in the projection sub-space, as illustrated in Fig. 7.

15

10

5
PC 2

−5

−10
−30 −20 −10 0 10 20 30 40
PC 1
Fig. 5 Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers. Camacho, J. Visualizing Big Data
With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 443

2500

2000

1500

1000

500

10

PC 2 0 40
20
0
−10 −20
PC 1
Fig. 6 Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in a third dimension. Camacho, J. Visualizing Big Data With
Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.

15

10

5
PC 2

15
0

10
−5
5
mult = 52

−10
−30 −20 −10 0 10 20 30 40
0
PC 1

−5

−10
−30 −20 −10 0 10 20 30 40

Fig. 7 Recovering the 52 original observations corresponding to a cluster in the compressed score plot of the Perceptive data set and PCA. Cama-
cho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.

For comparison purposes, in Fig. 8 two views of a bivariate histogram based on a regular grid are shown. A histogram is much
faster to compute than a clustered plot. However, the fidelity of the former is also reduced in comparison to the latter. Furthermore,
a principal limitation is how to represent class information, in particular when there is a high number of classes. Neither the marker
form (Fig. 8A) nor a third dimension (Fig. 8B) are adequate choices for that purpose. Most popular forms of bivariate plots are
density plots and hexagonal binning plots. An example of the corresponding plots for the PCA subspace of the Perceptive data
set, without class information, are shown in Fig. 9. Again, main shortcomings are the limited fidelity and the difficulty to include
the class information.

4.18.3.2 Compressed Score/Loading Plots


The application of clustering methods to compress score/loading plots is not straight-forward. These plots are used for data interpre-
tation. Therefore, the clustering should not destroy significant details or introduce artifacts in the data distribution. Also, since the
data is massive, a fast clustering approach is needed. Jain34 defines five categories of efficient clustering techniques for large data sets:

• Efficient nearest neighbor search.


• Data summarization.
444 Chemometrics Analysis of Big Data

(A) (B)
15

25
10
20
5
PC 2

15

0 10

5
−5

0
−10 10
−30 −20 −10 0 10 20 30 40 0 20 40
PC 1 −20 0
−10
PC 2 PC 1
Fig. 8 Bivariate histogram of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers. Classes are shown in the form of the
markers and in a third dimension: (A) X-axis vs. Y-axis and (B) rotated view to inspect the classes. Camacho, J. Visualizing Big Data With
Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.

500

0
10
0
−10 20 40
−20 0
15
10
5
0
−5
−10
−30 −20 −10 0 10 20 30 40
28 33
Fig. 9 Classless density plot (top) and hexagonal binning plot computed with the EDA toolbox (bottom) of the Perceptive data set and PCA.
Camacho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.

• Distribute computing.
• Incremental clustering.
• Sampling-based methods.
The clustering algorithm defined in Ref. 11, which we use here, is a variant of the algorithm of Bradley et al.35 It belongs to the
incremental clustering category and only requires one scan of the data set. Moreover, it can be easily extended to distribute
computing (parallelization), or combined with sampling-based or summarization methods.
To define the grouping criterion, a measure of the similarity between observations/variables is employed, which is usually
distance-based. Common distance metrics are the Euclidean distance and the Mahalanobis distance, and should be consistent
with the subspace of the plot, in order to minimize the distortion due to the clustering. In general, the quadratic distance can be
expressed as:
  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 0  
dK x i ; x j ¼ xi  xj RK1 R0 x i  x j (21)

where xi and xj can either refer to observations or variables, depending on the data mode that is Big. R are the parameters of the
model fitted following the previous section. These can be either loadings or scores, again depending on the data mode that is Big:
observations or variables, respectively. The matrix K is a A  A identity matrix if the Euclidean distance is considered or the suitable
covariance matrix in the case of the Mahalanobis distance. The choice between Euclidean and Mahalanobis distances depends on
the procedure to depict the score/loading plots. When both dimensions in the plot are of similar size regardless the magnitude in the
Chemometrics Analysis of Big Data 445

margins (i.e., of the variance of components), Mahalanobis distance is a more suitable choice. In Fig. 5, where we illustrated the
clustering approach with the Perceptive data set, the Mahalanobis distance in the PCA subspace corresponding to the first 2 PCs was
used to perform the clustering. Also, we accommodate for the display aspect ratio.
The features of the clustering are summarized as follows:

• Incremental clustering of the data set, previously arranged into batches of data and pre-processed.
• Mahalanobis or Euclidean distance in the model subspace following (21).
• When the data sets contain different classes (in the Big Data mode), the clustering is applied for each one separately.
The clustering algorithm is presented in Algorithm 1. In coherence with the model building approach in the previous section, the
input data set X is divided in T batches of Bt observations or variables, data is pre-processed and the model is fitted. Then, the
distance is selected. From the batches of data, a set of centroids is iteratively computed using the merge() routine. The computation
of centroids is optimized for a given definition of K. The number of original elements (observations or variables in the Big Data
mode) represented by each centroid, i.e., its multiplicity, is stored in m. The multiplicity of individual elements is 1. Therefore,
each time a new batch of data Xt is joined to the previously computed clusters, a vector of Bt ones, 1Bt, is included in the vector
of multiplicities m. An illustration of the merge() procedure is in Fig. 10.

Algorithm 1 Clustering algorithm.

[X1, ., XT] ) partition (X)


e 1 ; .; X
½X e T  ) preprocess ðX1 ; .; XT Þ
R ) model ðX e 1 ; .; X
eT Þ
Select K–1
C¼[]
m¼[]
for each packet Xt,
C ) [C, Xt]
m / [m, 1Bt]
[C, m] ) merge (C, m, RK–1 R0 )
end

The merge() routine in the clustering algorithm is presented in Algorithm 2. In the routine, the pair of elements/centroids with
minimum distance in C are iteratively replaced by their centroid and multiplicities are conveniently recomputed. For elements/
centroids in different classes, the distance is set to infinitum. This replacement operation is repeated until only Lend elements remain
in C.
When all the batches of data have been already processed by the clustering algorithm, a reduced data set of centroids is provided,
along with the associated multiplicity. The number of remaining centroids is Lend. Lend is user-defined and should be chosen so that
the visualization of the plot is adequate. Typically, a total of 100–300 points are adequately visualized in a compressed plot.
When the input is a Big Data stream, an EWMA approach for the compressed plots can also be used. The clustering algorithm can
be straightforwardly extended to the EWMA law in the updating of the multiplicities. For each batch of data, the merge() is launched
after the following update of the multiplicities:

2 2
2 2
9 9
1.5 1.5

1 1

0.5 0.5

0 7
0 7
8 10 8 10

−0.5 1 4 −0.5
4
5 Cluster
3 3
−1 6 −1 6

−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1

Fig. 10 Illustration of the merge procedure: the two closest elements are combined in a cluster of multiplicity 2. Camacho, J. Visualizing Big Data
With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
446 Chemometrics Analysis of Big Data

Algorithm 2 Merge routine in the clustering algorithm.


 
½C; m ) merge C; m; K e 1 :
L d # (C)
C d [c1, ., cL]
m d [m1, ., mL]
while
  > Lend), 
(#(C) 
ci ; cj ) min_dist C; K e 1
ci ) centroid(mi , ci, mj , cj)
C ) [c1, ., cj  1, cj þ 1, ., cL]
m i ) mi þ mj
m ) [m1, ., mj – 1, mj þ 1, ., mL]
end

m ) ½l,m; 1Bt  (22)


also, centroids with multiplicities below a given value (say 0.1) are discarded. The EWMA update of a CSP model is illustrated in
Fig. 11. In Fig. 12, the EWMA CSP for the Perceptive data set with l equal to 0.9 is shown. Only the last 10 operational points of the
process remain in the plot, while the others have been forgotten by the EWMA law.

4.18.4 Software: The Multivariate Exploration Data Analysis Toolbox

The Multivariate Exploratory Data Analysis (MEDA) Toolbox12 is a set of multivariate analysis tools for data exploration written in
Matlab. In the MEDA Toolbox, traditional exploratory plots based on PCA or PLS(-DA), such as score, loading and residual plots,
are combined with methods like MEDA,36 oMEDA,37 SVI plots,38 sparse PCA39 and sparse PLS,40 ASCA41,42 or the Group-wise
models,43–45 a recent class of methods to calibrate sparse component models. Moreover, other useful tools such as cross-
validation and double cross-validation algorithms, Multivariate Statistical Process Control (MSPC) charts and data simulation/
approximation algorithms (ADICOV, SimuleMV) are included in the toolbox. Finally, several of the aforementioned exploratory
tools are extended for their use with Big Data with unlimited number of observations. In this article, for the first time, we also extend
some of the functionality to unlimited number of variables, a functionality that will be incorporated in version 1.3 of the toolbox. A
view of the MEDA Toolbox is shown in Fig. 13.
There are two ways to work with the MEDA toolbox: using the GUI (starting users) and using the commands (expert user). The
GUI is self-explanatory. The commands provide of more functionality. Each command includes helping information, that is dis-
played by typing ”help < command >” in the command line of Matlab. This helping information includes examples of use.
Also, in the ”Examples” directory within the toolbox, several real data examples are included.
The toolbox is freely available in Github at https://github.com/josecamachop/MEDA-Toolbox. You may download the last
version or track any code change using the ’git’ command: a version-control system for tracking changes in source code, which
use is straightforward using the Github desktop application. Tracking changes is recommended since the MEDA Toolbox is in
continuous evolution. Installation instructions and tutorial information are provided with the Toolbox. The MEDA Toolbox is
free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3. Contributions
to the Toolbox are always welcome.

Fig. 11 Illustration of three consecutive CSPs from PC A using the EWMA law. Camacho, J. Visualizing Big Data With Compressed Score Plots:
Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 447

PC 2
0

−1

−2

−3

−4
−10 0 10 20 30
PC 1
Fig. 12 Exponentially weighted moving average compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the
markers and l ¼ 0.9.

Fig. 13 Illustration of the MEDA Toolbox for Matlab: the main GUIs at the left and some examples of visualization at the right: a score plot and
a diagnosis plot.

4.18.5 Case Study I: Cybersecurity Data

The data set considered in this first Case Study was generated by the 1998 DARPA Intrusion Detection evaluation Program, prepared
and managed by MIT Lincoln Labs.46,47 The objective of this program was to survey and evaluate research in networking intrusion
detection to improve the security of communication networks (e.g., the Internet). For that, a large data set with network traffic simu-
lated in a military network environment, including a wide variety of intrusions, was provided. While this data set is not related to
chemo-metrics, data is highly multivariate and Big in the observations mode, providing a good illustration of our approach. Besides,
there is a recent interest in the application of chemometric tools in the area of cybersecurity.48,49
448 Chemometrics Analysis of Big Data

The data set includes 4.844.253 observations. The observations belong to 22 different classes, one class for normal traffic and the
remaining for different types of network attacks. For illustrative purposes, the analysis will be restricted to two types of attacks, smurf
and neptune, and normal traffic. These three classes represent a 99.3% of the total traffic in the data set. For each connection, 42
variables are computed, including numerical and categorical variables. To consider categorical variables, one dummy variable per
category is included in the data set. The resulting data set is 4, 844, 253  122 variables. This was split in 489 batches of data, with
10, 000  122 per batch, except for the last batch with 4, 253  122.
The data is included as an example in the MEDA Toolbox, so reproducibility of this case study is straightforward. A glimpse of the
code needed to run the example is shown in Fig. 14. This code can be found under the example folder of the MEDA Toolbox, in
Networkmetrics/KDD/run.m. The first part of the code set the principal choices of the analysis, namely: type of model (PCA/PLS),

Fig. 14 Glimpse of the code for the section “Case Study I: Cybersecurity Data”. The code can be found under the example folder of the MEDA
Toolbox, in Networkmetrics/KDD/run.m. Camacho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Che-
mom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 449

type of data (iterative for Big Data in the observations, EWMA for Big Data streams), number of latent variables, preprocessing
method and number of clusters in the compressed plots. After this, the main code is the part for Model Building. We can see
that only a few lines of code are needed. The routine ’update_iterative’ is used for Big Data in the observations, and ’update_ewma’
for Big Data streams. The remaining of the code is used to visualize the models and data.
Let us go through the code in detail, so that the interested reader can reproduce the example. We will start with ’update_iterative’,
for which ’Lmodel.update’ (line 4 in the code in Fig. 14) should be set to 2. For command line help information type:
[ help update_iterative.
As for the other parameters in the Lmodel, we leave them as in Fig. 14. The code will fit a PLS model of the data. We also add a line in
the first part of the script with:
Lmodel.path ¼ ’out/’
in order to specify the output directory. (The directory needs to exist, so we need to create it before the analysis is started.)
The first argument of ’update_iterative’ is ’short_list’. This includes a subset of the data, so that the computational time is reduced
in the example. We will change it to ’list’, and the complete data will be considered. The second argument is the path for the input
files, which we leave to ’ ’. The third argument is the Large Model (’Lmodel’) we have initialized in the first part of the code. The forth
argument is the updating step in each file, and we set it to a 1% (0.01). We will modify the fifth argument to 1 in order to create a file
system with the clusters information (note this will require around 7GB of free storage). This is necessary to inspect data in detail,
e.g., to investigate outliers. Finally, the last argument controls the amount of debugging info. We leave it to 1.
Therefore, the call should look as follows:
[ Lmodel ¼ update_iterative(list,’ ’,Lmodel,step,1,1);
This computation takes 90 min in a regular computer, Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 16GB of RAM and Windows
10, when the fifth argument is set to 0, and more than 5 h if the new file system is created. The most computationally intensive parts
are the file system creation and the clustering. Future improvements of this approach include the parallelization.48 The resulting
Lmodel structure is depicted in Fig. 15. The parameters are:

• Lmodel.centr: [N  M], centroids of the clusters of observations.


• Lmodel.centrY: [N  L], responses of centroids of the clusters of observations.
• Lmodel.nc: scalar, number of clusters in the model.
• Lmodel.multr: [Lmodel.nc  1], multiplicity of each cluster.
• Lmodel.class: [Lmodel.nc  1], class associated to each cluster.
• Lmodel.vclass: [M  1], class associated to each variable.
• Lmodel.N: scalar, number of effective observations in the model.
• Lmodel.type: scalar, PCA (1) o PLS (2).
• Lmodel.update: scalar, EWMA (1) or ITERATIVE (2).
• Lmodel.XX: [M  M], sample cross-product matrix of X.
• Lmodel.lvs: scalar, number of latent variables (e.g., lvs ¼ 1:2 selects the first two LVs).
• Lmodel.prep: scalar, preprocessing of the data (0: no preprocessing, 1: mean centering (default), 2: auto-scaling).
• Lmodel.av.: [1  M], sample average according to the preprocessing method.
• Lmodel.sc: [1  M], sample scale according to the preprocessing method.
• Lmodel.weight: [1  M], weight applied after the preprocessing method.
• Lmodel.updated: [Lmodel.nc  1], specifies whether a data point is new.
• Lmodel.obs_l: {Lmodel.nc  strings}, label of each cluster.
• Lmodel.var_l: {Lmodel.nc  strings}, label of each variable.
• Lmodel.mat: [M  A], distance matrix.
• Lmodel.prepy: scalar, preprocessing of the data (0: no preprocessing, 1: mean centering (default), 2: auto-scaling).
• Lmodel.avy: [1  L], sample average according to the preprocessing method.
• Lmodel.scy: [1  L], sample scale according to the preprocessing method.
• Lmodel.weighty: [1  L], weight applied after the preprocessing method.
• Lmodel.XY: [M  L], sample cross-product matrix of X and Y.
• Lmodel.YY: [L  L], sample cross-product matrix of Y.
• Lmodel.index_fich: {Lmodel.nc  strings} file system for ITERATIVE models.
• Lmodel.path: string, path to the file system for ITERATIVE models.
From the PLS-DA Lmodel structure, we can compute a number of visualizations. The Compressed Score Plot (CSP) of the first 2 LVs
is shown in Fig. 16. The code to obtain this plot is:
[ Lmodel ¼ scores_Lpls(Lmodel);
We can see that the three classes are located in different parts of the plot, indicating potential differences among them. Note most
labels are not shown in order to avoid blurring the plot. Those clusters with more than one observation are identified by big circles,
450 Chemometrics Analysis of Big Data

Fig. 15 Elements of the Lmodel structure in the KDD data.

the bigger the more observations in the cluster, and by labels starting with ’MEDA’ followed by ’< #batch > o< #observation > c<
#class >’, where < #batch > is the index of the batch for the first observation of the cluster, < #observation > is the index of that
observation and < #class > the class. For example, the largest green circle has the label ’MEDA52o5940cll’, meaning that it was orig-
inally started by observation 5940 in batch 52, and that it belongs to class 11. We can also identify individual observations, like
’1756’ or ’7776’.
To get more detail on cluster ’MEDA52o5940cll’, we can go to the corresponding files in the file system, as illustrated in Fig. 17.
The first file, in the upper left corner, contains pointers to a set of secondary files, which store the original observations that conform
the cluster. Each file contains three numbers in the first line: the type of content (0: raw observations, 1: file list), the number of
elements stored and the class. Since the file with pointers contains more than 8 K of them, and the number of elements in each
secondary file contains 100 observations, the cluster in the figure represents more than 800 K observations. Looking at the detail
of the secondary file, we can see that observations ’5940’ and ’5959’ belong to the cluster. With this structure, we retain the original
information but organized in an optimal way, following the clustering performed, so that we can easily retrieve the observations/
clusters with interesting behavior for further inspection, e.g., using figures of cluster scores like Fig. 18. Individual observations that
were not added to any cluster during the iterative modeling phase are directly stored in ’Lmodel.centr’ (see Fig. 15).
In Fig. 19 we show the MEDA plot36 of the PLS-DA Lmodel. With MEDA we can inspect the relationships among variables. The
code to obtain this plot is:
[ [map,ind,ord] ¼ meda_Lpls(Lmodel,[],111);
where the second argument is selected by default (’[]’ means ’by default’ in the MEDA Toolbox) and the third argument specifies the
plotting options: reorder variables and plot only the most relevant variables. Please, refer to the command line help for more
Chemometrics Analysis of Big Data 451

5
MEDA52o5940c11
4108 MEDA8o8201c19
0

Scores LV 2 (5%) -5

−10 MEDA1o8325c1

6298
−15
MEDA449o3709c1
MEDA85o1187c1
−20 8765
8735

−25

1756
−30

7776
−35
−8 −6 −4 −2 0 2 4

Scores LV 1 (11%)
Fig. 16 PLS Compressed Score Plot of the first 2 LVs in the KDD data.

Fig. 17 Illustration of the file system for cluster “MEDA52o5940c11.”


452 Chemometrics Analysis of Big Data

mult = 853560
5

−5

−10

Scores LV 2 (5%) −15

−20

−25

−30

−35
−8 −6 −4 −2 0 2 4
Scores LV 1 (11%)
Fig. 18 Score plot for cluster “MEDA52o5940c11.”

fg5 1
srr
rr
dhrr
dhsrr
pt1
dhdsr
dsr 0.5
dhsr
dhssr
fg3
sr
ssr
srv57
fg10
ssr2 0
dhssr
dhsc
pt2
dhsspr
pt0
srv60
sc -0.5
cnt
srv69
li
dhc
dhsdhr
sdhr
srv68
-1
Fig. 19 MEDA Plot of the _rst 2 LVs in the KDD data.

information. The plot shows only one fourth of the 122 variables, and two groups of variables are highlighted. We can use this
information to derive sparse models.43,44 This grouping is also useful to interpret the loading plot. The output of MEDA provides
the complete MEDA map (with all 122 variables), the variables selected and the new ordering.
The loading plot of the first 2 LVs is shown in Fig. 20, with the groups found in MEDA annotated. Combining this with the score
plot in Fig. 16, we can see that the green group of features marks the difference between the green class and the red class of scores.
Similarly, the blue group of features marks the difference between the blue class of scores and the red class.
Let us show how we can also use the MEDA output to derive group-sparse models. First, we apply the variables reordering and
selection obtained in MEDA to both the map and the Lmodel:
[ map2 ¼ map(ind(ord), ind(ord));
[ Lmodel2 ¼ select_vars(Lmodel, ind(ord));
Then we apply the GIA algorithm43 over the map, which identifies the groups of variables, and then GPCA:
[ [bel,states] ¼ gia(map,0.5);
[ [P,T,bel,E] ¼ Lgpca(Lmodel,states);
Chemometrics Analysis of Big Data 453

dhc cnt
0.3

0.2 fg3 sc
srv60
dhssr
sr
ssr
dhsr pt0
srv57
0.1 dhsspr
dsrrr
fg5
fg4
Weights LV 2
srcd
0 dhrr

dhdsr dstb
dur
-0.1 srv55
pt2 srv67
srv68 dhsc
dhssr
-0.2 dhsdhr
sdhr
pt1 fg10
ssr2

-0.3

srv69
-0.4 li
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
Weights LV 1
Fig. 20 PLS Compressed Score Plot of the first 2 LVs in the KDD data.

The result is shown in Fig. 21, from which we can conclude the same as before but in a clearer way. The first sparse PC makes a good
separation of the blue class from the rest, and the corresponding loading contains only 6 variables out of the 122 related to that
separation. The second sparse PC does the same for the green class, selecting 10 variables. The benefit of sparse models is that
they are easier to interpret than the combination of loading/score plots. Yet, both methodologies are available for Big Data in
the MEDA Toolbox.

4.18.6 Case Study II: DNA Methylation Data

The second case study is a DNA methylation dataset extracted from the Cancer Genome Atlas for Breast Invasive Carcinoma (BRCA).
Data was originally collected with the Illumina Infinium Human DNA Methylation 450 platform (HumanMethylation450),
including the status of 450K CpG sites.50 The data set contains 897 observations of 485,812 variables, which can be considered
a Big Data set in the variables.
Observations and variables with more than 30% of missing data were discarded (first observations and then variables). The rest
of missing elements were imputed with unconditional mean replacement and auto-scaled. Data was split in 49 batches of data, with
897  10,000 per batch, except for the last batch with 897  5, 812. The residual variance in terms of the number of PCs is shown in
Fig. 22. We can see that the variance is distributed across the PCs, as often found in massive data.
The Compressed Loading Plot (CLP) of the first 2 PCs is shown in Fig. 23. The z500 K variables are homogeneously distributed
across the subspace with a fish shape. The 2 PCs only account for 19% of the variance. The corresponding score plot is shown in
Fig. 24. We see a group of individuals deviating from the rest towards the left side of the plot. We would like to put some light on
this deviation. The plot shows the specific disease sub-type of the individuals by using different colors. A clear dominance of one
disease sub-type, ’Ductal and Lobular Neoplasms’, is manifesting. We can see that the separation between the group of individuals
and the rest cannot be attributed to disease sub-type, since both groups share the same dominant sub-type. A similar observation is
concluded if we show the colors of individuals in terms of gender or ethnicity (not shown). Thus, the separation cannot be attrib-
uted to them either.
We can use contribution plots to identify the variables (composite elements) related to the deviation between groups of indi-
viduals. For that we use oMEDA.37 It is a bar plot of the variables, built to compare two groups of observations. Each bar represents
the contribution of the variable to the difference between both groups. A positive bar implies that the first group of observations
presents a higher value in the corresponding variable than the second group. A negative bar reflects the opposite. A bar close to zero
for a variable means that both groups of observations have a similar value in that variable. The MEDA Toolbox includes the oMEDA
routine and an extension to Big Data, which is the one we use here.
Fig. 25 illustrates the oMEDA plot that compares the two groups of individuals in the score plot. Since variables can be clusters of
composites, the multiplicity of the cluster is visualized using the area of circles in the base of the plot, so that the higher the multi-
plicity the larger the area of the circle. Blue color bars correspond to clusters of composites, and red color bars to individual
454 Chemometrics Analysis of Big Data

(A) (B)
0.6 0.4

0.5
0.3

0.4
0.2
0.3
0.1

Loadings PC 2
Loadings PC 1

0.2

0.1 0

0
-0.1

-0.1
-0.2
-0.2
-0.3
-0.3

-0.4 -0.4
5 10 15 20 25 30 5 10 15 20 25 30

(C) 3 (D) 2

1
2
0

1 -1

-2
Scores PC 2
Scores PC 1

0
-3
-1
-4

-2 -5

-6
-3
-7

-4 -8
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

Fig. 21 GPCA model in the KDD data: (A) loadings of the first sparse PC, (B) loadings of the second sparse PC, (C) scores of the first sparse PC,
and (D) scores of the second sparse PC.

0.9

0.8

0.7
% Residual Variance

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20
#PCs
Fig. 22 Residual variance in terms of the number of PCs in PCA in the BRCA data.
Chemometrics Analysis of Big Data 455

25 MEDA49o5868c1
MEDA1o51c1
20 cg04904099

15

Compressed Loadings PC 1 (10%)


MEDA4o3370c1
10 cg04988869
MEDA10o7955c1
5
MEDA1o40c1
MEDA2o6845c1
0

-5 MEDA1o5706c1
MEDA2o2241c1 cg04396454

-10 MEDA6o5578c1
cg04490024
-15
MEDA47o4277c1
MEDA43o6958c1 MEDA30o7372c1
-20
MEDA47o2570c1
-25 MEDA47o1991c1
-25 -20 -15 -10 -5 0 5 10 15 20
Compressed Loadings PC 2 (9%)
Fig. 23 PCA Compressed Loading Plot of the first 2 PCs in the BRCA data.

0.1 625
424
582
71455
444
423
0.08 186 862
435 227561 291 802
823 85 147
345 263 741
0.06 151
70 851
760 590 348
201
431 51
473 177 261
145 346 350
4 116 161
534
765
0.04
Scores PC 1 (10%)

580
396728828 700
50
703 253568 627
0.02 233 17 859
408
288589 495373 717
675
0 712
362 602
303 108 619 152
557 144
-0.02 512140
433 792 143 76
265 155
874 520662
445
-0.04 882
644
613 522 121
259
289
870
504
235 531320
-0.06 841341 605
665 347
774 411 626
434
44 608
-0.08 185
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15
Scores PC 2 (9%)
Fig. 24 PCA Score Plot of the first 2 LVs in the BRCA: classes according to type of sub-disease.

composites. Let us focus on the four topmost bars, those with an oMEDA score above the dashed line. These are expected to hold the
main discrepancy between the groups of individuals. The top-most cluster in the plot (bar number 8) has a multiplicity of 13,262
composites (i.e., contains this number of original composites). The second one (number 70) has a multiplicity of 106 composites.
The third one (number 94) is an individual composite (cg04396454) and the fourth one (number 93) represents only 5 compos-
ites. If we mark those clusters in the CLP we see that they are actually located in the extreme positions along the horizontal axis
(Fig. 26). Also, Fig. 27 shows the scores of the individuals for the selected clusters of composites, highlighting one group of indi-
viduals with circles. Clearly, the selection by oMEDA is useful to determine a subset of interesting variables. Finally, we retrieved the
data corresponding to the first cluster and made a normal PCA from it. The result is displayed in Fig. 28, where we can see that the
selection of composites in that cluster is related to the deviation of individuals: individuals in the group deviating towards the right
in the new score plot show a higher value for those composites.
456 Chemometrics Analysis of Big Data

700

600

500

Diagnosis
400

300

200

100

0
20 40 60 80 100
Composite Element
Fig. 25 oMEDA plot of the cluster: red bars represent individual variables and blue bars clusters of variables.

MEDA49o5868c1
MEDA1o51c1
20 cg04904099
Compressed Loadings PC 1 (10%)

MEDA4o3370c1
10 cg04988869
MEDA10o7955c1
MEDA1o40c1 MEDA2o6845c1
0
MEDA1o5706c1
MEDA2o2241c1 cg04396454
-10 MEDA6o5578c1
cg04490024

MEDA47o4277c1
MEDA43o6958c1 MEDA30o7372c1
-20
MEDA47o2570c1
MEDA47o1991c1
-25 -20 -15 -10 -5 0 5 10 15 20
Compressed Loadings PC 2 (9%)
Fig. 26 PCA Compressed Loading Plot of the first 2 PCs in the BRCA data.

4.18.7 Challenges and the Future

In this article, we have illustrated how extensions of chemometric tools to Big Data can be useful to derive insights on this data.
However, there are a number of challenges11 that need to be addressed in order to make this approach of practical use. The
following are those that the authors see the most relevant:

• Interactivity with Data. Chemometric tools are especially useful if we can interact with data. For instance, every time we find an
outlier or cluster of anomalous items, it would be interesting to investigate them in depth and then continue the analysis with
the rest of the data. While in the approach described in this article models can be recomputed easily by properly updating the
cross-product matrices, the clustering for the visualization needs to be recomputed from scratch. This is, unfortunately,
a computational demanding operation. An alternative based on data approximation techniques is explored in Ref. 11, but its use
in practice is complex.
• Big Data in the two modes. The approach described in this article cannot handle Big Data sets in the two modes. However, to the
best of our knowledge, such Big Data sets are very rare.
• Use of High Performance Computing (HPC). We can use full parallelization to speed up computation. The computation of
cross-product matrices can still be exact with this approach, but the effect on the clustering performance for the visualization
needs to be determined.
Chemometrics Analysis of Big Data 457

(A) (B)
6 1

5 0
MEDA1o25c1 (autoscaled)

4 -1

MEDA2o6845c1 (autoscaled)
3 -2

2 -3

1 -4

0 -5

-1 -6
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900
Individuals Individuals
(C) 2 (D) 6

0 5

MEDA48o5996c1 (autoscaled)
cg04396454 (autoscaled)

-2 4

-4 3

-6 2

-8 1

-10 0

-12 -1
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900
Individuals Individuals

Fig. 27 Scores of the individuals for the clusters of composites with the highest diagnosis value in Fig. 25: (A) cluster with the highest value,
(B) cluster with the second highest value, (C) single composite with the third highest value, and (D) cluster with the fourth highest value. Individuals
that belong to the group deviating towards the left in Fig. 24 (the score plot) are highlighted with circles.

(A) (B)
40 0.03
TCGA-E9-A2JT
TCGA-AC-A2QH cg12001592:
30
0.02 cg17330983:
TCGA-A7-A26J
20
Loadings PC 2 (2%)
Scores PC 2 (2%)

10 TCGA-A7-A26E 0.01 cg18206859:


TCGA-BH-A208 cg14875827:
cg22596515::::::::
cg21656293:
0 cg13692765: cg08449531:
cg14343062:
cg18081430::
TCGA-E2-A9RU 0 cg00171192:
cg17287984:
cg24236265:
TCGA-E9-A1RC cg13768304:
-10
TCGA-AO-A03L TCGA-AC-A23H cg00781723:
-20 TCGA-A2-A0YD -0.01
cg06706076:
TCGA-E9-A1RF cg12398397:::
TCGA-E9-A22D
-30 cg08811082:cg15127806:
-0.02
TCGA-BH-A0BA TCGA-E9-A22E
-40

-50 TCGA-BH-A0E2 -0.03 cg06720424:


-100 0 100 200 300 400 500 600 700 0 0.002 0.004 0.006 0.008 0.01
Scores PC 1 (77%) Loadings PC 1 (77%)
Fig. 28 PCA model with 2 PCs obtained from the complete set of individuals and only the clusters of composites with the highest diagnosis value
in Fig. 25: (A) score plot and (B) loading plot.
458 Chemometrics Analysis of Big Data

Acknowledgment

This work is partly supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds through project TIN2017-83494-R and the
“Plan Propio de la Universidad de Granada,” grant number PPVS2018-06.

References

1. Internet Live Stat. https://www.internetlivestats.com (Accessed 13 October 2019).


2. DOMO, Data Never Sleeps 7 Infographic. (Accessed 13 October 2019).
3. White, T. Hadoop: The Definitive Guide, 1st ed.; O’Reilly Media, Inc., 2009.
4. B. Marr, How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. (Accessed 13 October 2019).
5. Marx, V. Biology: The big challenges of big data. Nature 2013, 498, 255–260.
6. Gartner Glossary Definition of Big Data. https://www.gartner.com/it-glossary/big-data/.
7. Schroeck, M.; Shockley, R.; Smart, J.; Romero-Morales, D.; Tufano, P. Analytics: The Real-World Use of Big Data. In IBM Institute for Business ValuedExecutive Report, IBM
Institute for Business Value, 2012.
8. Data & AI Landscape, 2019. http://mattturck.com/wp-content/uploads/2019/07/2019_Matt_Turck_Big_Data_Landscape_Final_Fullsize.png (Accessed 13 October 2019).
9. Goldman, T. 2019 Big Data 2019: Top 5 Predictions on Trends, Technologies and Landscape. (Accessed 13 October 2019).
10. Qin, S. J. Process Data Analytics in the Era of Big Data. AIChE J. 2014, 60 (9), 3092–3100.
11. Camacho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
12. Camacho, J.; Villegas, A. P.; Rodríguez-Gómez, R. A.; Jiménez-Mañas, E. Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab. Chemom. Intell. Lab. Syst. 2015,
143, 49–57.
13. Martens, H. Quantitative Big Data: Where Chemometrics Can Contribute. J. Chemom. 2015, 29 (11), 563–581.
14. Offroy, M.; Duponchel, L. Topological Data Analysis: A Promising Big Data Exploration Tool in Biology, Analytical Chemistry and Physical Chemistry. Anal. Chim. Acta 2016, 910,
1–11.
15. Vitale, R.; Zhyrova, A.; Fortuna, J. F.; de Noord, O. E.; Ferrer, A.; Martens, H. On-the-Fly Processing of Continuous High-Dimensional Data Streams. Chemom. Intell. Lab. Syst.
2017, 161 (C), 118–129.
16. Rabani, E.; Toled, S. Out-of-Core SVD and QR decompositions. Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing.
17. Lindgren, F.; Geladi, P.; Wold, S. The Kernel Algorithm for PLS. J. Chemom. 1993, 7, 45–59.
18. de Jong, S.; ter Braak, C. Comments on the PLS Kernel Algorithm. J. Chemom. 1994, 8, 169–174.
19. Dayal, B.; MacGregor, J. Improved PLS Algorithms. J. Chemom. 1997, 11, 73–85.
20. Halko, N.; Martinsson, P.-G.; Shkolnisky, Y.; Tygert, M. An Algorithm for the Principal Component Analysis of Large Data Sets, 2010. cite arxiv:1007.5510 Comment: 17
pages, 3 figures (each with 2 or 3 subfigures), 2 tables (each with 2 subtables).
21. Ordon, C.; Mohanam, N.; Garcia-Alvarado, C. PCA for Large Data Sets With Parallel Data Summarization. Distrib. Parallel Databases 2013, 32, 1–27.
22. Balsubramani, A.; Dasgupta, S.; Freund, Y. The Fast Convergence of Incremental PCA. In Advances in Neural Information Processing Systems; Burges, C., Bottou, L.,
Welling, M., Ghahramani, Z., Weinberger, K., Eds.; 2013; pp 3174–3182, 26.
23. Dayal, B. S.; Macgregor, J. F. Recursive Exponentially Weighted PLS and Its Applications to Adaptive Control and Prediction. J. Process Control 1997, 7 (3), 169–179.
24. Qin, S. Recursive PLS Algorithms for Adaptive Data Modeling. Comput. Chem. Eng. 1998, 22, 503–514.
25. Camacho, J.; Padilla, P.; Díaz-Verdejo, J. Least-Squares Approximation of a Space Distribution for a Given Covariance and Latent Sub-Space. Chemom. Intell. Lab. Syst. 2011,
105 (2), 171–180.
26. Jain, A. K.; Murty, M. N.; Flynn, P. J. Data Clustering: A Review. ACM Comput. Surv. 1999, 31 (3), 264–323.
27. Grabmeier, J.; Rudolph, A. Techniques of Cluster Algorithms in Data Mining. Data Min. Knowl. Disc. 2002, 6, 303–360.
28. Eilers, P. H. C.; Goeman, J. J. Enhancing Scatterplots With Smoothed Densities. Bioinformatics 2004, 20 (5), 623–628.
29. Hao, M. C.; Dayal, U.; Sharma, R. K.; Keim, D. A.; Janetzko, H. Visual Analytics of Large Multidimensional Data Using Variable Binned Scatter Plots. In VDA; Park, J.,
Hao, M. C., Wong, P. C., Chen, C., Eds.; vol. 7530; Proceedings of the SPIE, 2010.
30. N. Lewin-Koh, Hexagon Binning: An Overview, Technical Report, 2011.
31. Ellis, G.; Dix, A. A Taxonomy of Clutter Reduction for Information Visualization. IEEE Trans. Vis. Comput. Graph. 2007, 13, 1216–1223.
32. Yau, N. Data, Points: Visualization That Means Something, 1st ed.; Wiley, 2013.
33. Martinez, W. L. Exploratory Data Analysis With MATLAB (Computer Science and Data Analysis), Chapman & Hall/CRC, 2004.
34. Jain, A. K. Data Clustering: 50 Years Beyond k-Means. Pattern Recogn. Lett. 2010, 31 (8), 651–666. Award winning papers from the 19th International Conference on Pattern
Recognition (ICPR) 19th International Conference in Pattern Recognition (ICPR).
35. Bradley, P. S.; Fayyad, U. M.; Reina, C. Scaling Clustering Algorithms to Large Databases. Knowledge Discovery and Data Mining; 1998; pp 9–15.
36. Camacho, J. Missing-Data Theory in the Context of Exploratory Data Analysis. Chemom. Intell. Lab. Syst. 2010, 103, 8–18.
37. Camacho, J. Observation-Based Missing Data Methods for Exploratory Data Analysis to Unveil the Connection Between Observations and Variables in Latent Subspace Models.
J. Chemom. 2011, 25 (11), 592–600.
38. Camacho, J.; Picó, J.; Ferrer, A. Data Understanding With PCA: Structural and Variance Information Plots. Chemom. Intell. Lab. Syst. 2010, 100 (1), 48–56.
39. Witten, D.; Tibshirani, R.; Hastie, T. A Penalized Matrix Decomposition, With Applications to Sparse Principal Components and Canonical Correlation Analysis. Biostatistics
2009, 10, 515–534.
40. Cao, K.-A. L.; Rossouw, D.; Robert-Granié, C.; Besse, P. A Sparse PLS for Variable Selection When Integrating Omics Data. Stat. Appl. Genet. Mol. Biol. 2008, 7 (1), 35.
41. Smilde, A. K.; Jansen, J. J.; Hoefsloot, H. C.; Lamers, R.-J. A.; Van Der Greef, J.; Timmerman, M. E. ANOVA-Simultaneous Component Analysis (ASCA): A New Tool for
Analyzing Designed Metabolomics Data. Bioinformatics 2005, 21 (13), 3043–3048.
42. Jansen, J. J.; Hoefsloot, H. C.; van der Greef, J.; Timmerman, M. E.; Westerhuis, J. A.; Smilde, A. K. ASCA: Analysis of Multivariate Data Obtained From an Experimental
Design. J. Chemom. 2005, 19 (9), 469–481.
43. Camacho, J.; Rodríguez-Gómez, R. A.; Saccenti, E. Group-Wise Principal Component Analysis for Exploratory Data Analysis. J. Comput. Graph. Stat. 2017, 26 (3), 501–512.
44. Camacho, J.; Saccenti, E. Group-Wise Partial Least Square Regression. J. Chemom. 2018, 32 (3), e2964. e2964 cem.2964.
45. Saccenti, E.; Smilde, A.; Camacho, J. Group-Wise ANOVA Simultaneous Component Analysis for Designed Omics Experiments. Metabolomics 2018, 14 (6), 73.
46. Kdd Cup 1999 Data Set, The UCI KDD Archive, Information and Computer Science, University of California, http://kdd.ics.uci.edu/databases/kddcup99/kdd cup99.html, 1999.
47. Kayacik, H.; Zincir-Heywood, A. N.; Heywood, M. I. Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets.
Proceedings of the Third Annual Conference on Privacy, Security and Trust. Citeseer; 2005; pp 3–8.
48. Camacho, J.; Maciá-Fernández, G.; Díaz-Verdejo, J.; García-Teodoro, P. Tackling the Big Data 4 vs for Anomaly Detection. ProceedingsdIEEE INFOCOM; 2014; pp 500–505.
no. 1.
49. Camacho, J.; Pérez-Villegas, A.; García-Teodoro, P.; Maciá-Fernández, G. PCA-Based Multivariate Statistical Network Monitoring for Anomaly Detection. Comput. Secur. June
2016, 59, 118–137.
50. Celli, F.; Cumbo, F.; Weitschek, E. Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers. Big Data Res. 2018, 13, 21–28.

You might also like