Professional Documents
Culture Documents
4.18.1 Introduction
The technological advances in the last decades, specially the development of (i) new data transport technologies at the core of the
Internet, (ii) new Internet access technologies (Wi-Fi, 4G/5G) and devices (smartphones, wearables, sensors) and (iii) new digital
services (e.g., eHealth, multimedia) have led to the so-called Big Data era. Companies are finding new ways to optimize their oper-
ations by making the most of data: we can measure (almost) anything (almost) everywhere, take the measurements to a data center
and analyze the data in order to improve our understanding of the underlying process, regardless it is an industrial line or a socio-
logical or biological phenomenon. To this regard, the development of Big Data is closely related to the advent of the Internet of
Things (IoT): the connection of an increasing diversity of sensing devices to the Internet. The IoT creates an avatar of the physical
world in the digital world, making everything reachable (and thus analyzable) from everywhere.
As for the third quarter of 2019, current estimates of Internet traffic1,2 approach 100 terabytes (TB, 212 bytes) of data per second
and 7 exabytes (EB, 218 bytes) per day. This explosion of data includes data generated by humans and machines3: from Instagram
photos, Youtube videos and Facebook entries to Google searches and automated economic transactions. The increase of data
production is exponential: 90% of the total volume of data was generated in the last 2 years.4
The availability of new sources of data has led to the apparition of a large number of Big Data applications.5 Big Data has been
defined in different ways. The worldwide renowned consulting firm Gartner6 proposes the following definition ”Big Data is
high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information pro-
cessing that enable enhanced, insight, decision making, and process automation”. An aspect that is common to most Big Data defi-
nitions, also present in the previous one, is the need of new processing tools to handle the so-called V’s7:
• Variety: Big Data is diverse in nature. Different sources, including unstructured and structured information, need to be properly
combined in order to make the most of the analysis. Structured information is composed of records with a fixed structure (e.g.,
and excel data sheet) while unstructured information is the opposite (e.g., email contents).
• Volume: Exabytes, zettabytes, and even higher amounts of data are described in Big Data applications. This amount of data
needs to be handled simultaneously and requires parallel processing means.
• Velocity: In Big Data problems, a high rate of sampling is common. This further complicates the analysis and makes parallel
processing even more necessary.
Initiatives for Big Data analysis like the open software Apache projects Hadoop (http://hadoop.apache.org/), Mahout (http://
mahout.apache.org/) or Spark (http://spark.apache.org/), among others, and the companies supporting them, have fostered the
apparition of a wide ecosystem of Big Data solutions8 ranging from data storage to processing to analysis. Within this landscape,
machine learning, visual analytics and data analysis tools are principal resources. The promising results of early Big Data applica-
tions has led to a golden age for data-driven methods. However, after the typical period of inflated expectations, the integration of
machine learning in successful mainstream use cases has become a major concern.9
How has this trend impacted chemometrics? For some reason, chemometric applications have not been benefited from the Big
Data approach, where Volume and Velocity are typically defined in terms of massive amounts of observations. In chemometrics,
data can be massive, but typically in terms of variables, except for industrial process applications and the like. Still, approaches to Big
Data like distributed (parallel) processing can be of interest for chemometrics, as we try to illustrate with the examples in this article.
There has been a reduced number of contributions related to Big Data within the chemometrics literature. In 2014, Qin10 dis-
cussed what the Big Data model can contribute to industrial process applications. The same year, Camacho11 introduced the
Compressed Score Plots (CSPs), to visualize score plots with unlimited number of observations. With the CSPs, we can extend
the exploratory data analysis approach in chemometrics to Big Data problems. Some months after, already in 2015, Camacho
and coworkers presented the Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab,12 with a Big Data processing
module using Principal Component Analysis (PCA) and Partial Least Squares (PLS) models. The same year, Martens13 discussed
the contribution of the chemometrics approach to Big Data. In 2016, Offroy and Duponchel14 discussed the application of Topo-
logical Data Analysis to Big Data in chemistry applications. In 2017, Vitale et al.15 presented an approach to compute multivariate
models on-the-fly for exploratory analysis, very similar to the approach in the MEDA Toolbox.
In this article, we describe a methodological extension of both PCA and PLS modeling and associated visualizations to the Big
Data scenario based on the original approaches of.11,12 The rest of the article is organized as follows. Sections “Modeling Massive
Volumes of Multivariate Data With PCA and PLS” and “Visualizing the Big Mode” present the modeling and visualization
approaches followed, respectively. In “Software: The Multivariate Exploration Data Analysis Toolbox” section we introduce the
MEDA Toolbox, a free package of Matlab routines that can be used to analyze and visualize Big Data. Sections “Case Study I: Cyber-
security Data” and “Case Study II: DNA Methylation Data” illustrate these approaches on two case studies. The first case study
relates to the application of chemometric tools to computer network security data: while this is not strictly chemometrics, it is
a nice example of what chemometrics can contribute to the Big Data arena; this case study can be fully reproduced by the reader
using the MEDA Toolbox. The second case study pertains the analysis of large data set containing DNA methylation measurements.
Section “Challenges and the Future” offers prospective and directions for future work and extensions of the approaches presented.
4.18.2 Modeling Massive Volumes of Multivariate Data With PCA and PLS
Most algorithms for PCA (or PLS) model fitting take the N M data matrix X (and the N O response matrix Y for PLS) as input.
Due to limited computer resources, in particular limited memory, this approach is infeasible when N or M grow beyond a certain
size, like in the case of Big Data sets. For instance, a 109 103 matrix, containing 1012 data points, would require 8 1012 bytes
(8TB) of RAM, if working at double precision. That requirement cannot be met with common hardware. The challenge then is to
compute the model out-of-core,16 that is, without loading and retaining the whole data set in the computer memory.
A viable solution to tackle this problem is to use the cross-product matrices. For instance, in the previous example, for X of dimen-
sions 109 103, only 8 MB are needed to store X0 X, which has dimensions 1000 1000, and this matrix can be computed iteratively,
so that the complete X does not need to be stored in memory. Substituting X by its cross-product removes one of the dimensions and
this can be conveniently used to hide one single Big mode (observations or variables), making possible to deploy chemometric tools,
like PCA and PLS, on Big Data applications without the need of high-performance hardware. Thus, the loading vectors of PCA can be
identified using the eigendecomposition (ED) of the cross-product matrix X0 X for any size of N. Similarly, the loadings and weights in
PLS regression can be identified from matrices X0 X and X0 Y using the kernel algorithm.17–19 Conversely, if M is huge, we can compute
the PCA scores from the ED of the cross-product matrix XX0 , and PLS scores can be computed from XX0 and YY0 .
The computation of cross-product matrices has been an extended choice for the iterative computation of PCA20 and it is suitable
for parallelization,21 a must in Big Data analysis. Also, this is a suitable approach for continuous model update. An obvious alter-
native is data sampling, which results in an approximated model rather than an exact one.
In the following, we discuss the algorithmic details to fit a PCA/PLS model whether the Big Data mode are the observations or the
variables. Without loss of generality, we will assume the data is pre-processed by auto-scaling (i.e., mean centered and scaled to unit
variance), but other pre-processing methods may be derived following a similar procedure.
where xti represents the i-th observation in Xt and M0x is a vector of 0’s. After all the T batches are considered, the total mean, of
dimension 1 M, is computed
mx ¼ ð1=NÞ$MxT (2)
Chemometrics Analysis of Big Data 439
Fig. 1 Illustration of the iterative approach for Big Data sets in the observations.
with:
X
T
N¼ Bt (3)
t¼1
If auto-scaling is desired, the standard deviation of the blocks needs to be computed. First, a 1 M estimate of the accumulated
variance is obtained
2 2 X
Bt 2
sxt ¼ sxt1 þ xit mx (4)
i¼1
for (s0x)2 a vector of 0’s. From this estimate computed from the T batches, the standard deviation, of dimension 1 M, can be
derived as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffi
sx ¼ ð1=ðN 1ÞÞ$ sxT (5)
for (X0 Y)
0 equal to the suitable matrix of 0’s.
Finally, an exact PC A model of the data can be fitted with the ED of (X0 X)T, and an exact PLS model with the kernel algorithm of
(X0 X)T and (X0 Y)T.
1 XN
mxt ¼ xi (9)
N i¼1 t
where xti is the i-th row in batch Xt. The corresponding 1 Bt vector of standard deviation stx is given by
440 Chemometrics Analysis of Big Data
Fig. 2 Illustration of the iterative approach for Big Data sets in the variables.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u1 X N 2
sxt ¼t x it mxt (10)
N i¼1
x it of xt i is given by
and the auto-scaled version e
xit ¼ x it mxt Bsxt
e (11)
T
e t $Y
ðYY 0 Þt ¼ ðYY 0 Þt1 þ Y e (13)
t
for M0x a vector of 0’s. Then, the actual mean is computed as:
1 x
mxt ¼ M (15)
Nt t
with Nt also computed using an EWMA, starting from N0 ¼ 0:
Chemometrics Analysis of Big Data 441
Nt ¼ l$Nt1 þ Bt (16)
Following Eq. (16), if data batches are of the same size (i.e., Bt ¼ B), Nt converges to B/(1 l).
If data are to be auto-scaled, the standard deviation of the blocks needs to be computed. First, an EWMA estimate of the 1 M
vector of accumulated variance is calculated as:
2 2 X
Bt 2
sxt ¼ l$ sxt1 þ x it mxt (17)
i¼1
with (s0x)2 a vector of 0’s. From this estimate, the standard deviation is calculated:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
sxt ¼ ðsx Þ2 (18)
Nt 1 t
xit of the i-th column of Xt is given by
The auto-scaled version e
i
xit ¼ e
e x t mxt Bsxt (19)
The xit
e e t and the cross-product matrix is computed after preprocessing from:
are arranged in a Bt M matrix X
e T $X
ðX 0 XÞt ¼ l$ðX 0 XÞt1 þ X et (20)
t
In the previous section we have discussed how to compute PCA/PLS models from Big Data sets using cross-product matrices. The
trick is to hide the Big Data mode (columns or rows) within the cross-product. A limitation is that we cannot obtain the correspond-
ing factors in the Big Data mode from the cross-product: either the scores for Big Data in the observations or loadings for Big Data in
the variables. One solution is to perform another iteration through the T batches of data to compute these factors. However, with
this solution the number of factors (scores or loadings) per component remains Big, and in practice they cannot be visualized.
We illustrate this limitation using a data set from an industrial continuous process collected by Perceptive Engineering LTD
(http://www.perceptiveapc.com/).25 The data set was collected during a period of more than 4 days of continuous operation of
a fluidized bed reactor fed with four reactants. The collection rate is 20 s. The data consists of 18.887 observations on 36 process
variables including feed flows, temperatures, pressures, vent flow and steam flow; the observations contain 22 different operational
points of the process.
Fig. 4 shows the score plot of the first two principal components. The plot contains 18.887 points, which, strictly speaking,
cannot be considered Big Data. However, this number is already too large for proper visualization. There are clouds of dots for
each operational point, some of them overlapping and hiding part of the others: this makes the plot poorly interpretable. In
a true Big Data scenario, with millions or billions of dots, the computer cannot even render the plot. A solution to solve this visu-
alization problems is based on the use of Compressed Scatter Plots11 which will be illustrated in the following section.
442 Chemometrics Analysis of Big Data
15
10
PC 2
0
−5
−10
−30 −20 −10 0 10 20 30 40
PC 1
Fig. 4 Score plot for the first two PCs of the PCA model of the Perceptive data set.
15
10
5
PC 2
−5
−10
−30 −20 −10 0 10 20 30 40
PC 1
Fig. 5 Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers. Camacho, J. Visualizing Big Data
With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 443
2500
2000
1500
1000
500
10
PC 2 0 40
20
0
−10 −20
PC 1
Fig. 6 Compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in a third dimension. Camacho, J. Visualizing Big Data With
Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
15
10
5
PC 2
15
0
10
−5
5
mult = 52
−10
−30 −20 −10 0 10 20 30 40
0
PC 1
−5
−10
−30 −20 −10 0 10 20 30 40
Fig. 7 Recovering the 52 original observations corresponding to a cluster in the compressed score plot of the Perceptive data set and PCA. Cama-
cho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
For comparison purposes, in Fig. 8 two views of a bivariate histogram based on a regular grid are shown. A histogram is much
faster to compute than a clustered plot. However, the fidelity of the former is also reduced in comparison to the latter. Furthermore,
a principal limitation is how to represent class information, in particular when there is a high number of classes. Neither the marker
form (Fig. 8A) nor a third dimension (Fig. 8B) are adequate choices for that purpose. Most popular forms of bivariate plots are
density plots and hexagonal binning plots. An example of the corresponding plots for the PCA subspace of the Perceptive data
set, without class information, are shown in Fig. 9. Again, main shortcomings are the limited fidelity and the difficulty to include
the class information.
(A) (B)
15
25
10
20
5
PC 2
15
0 10
5
−5
0
−10 10
−30 −20 −10 0 10 20 30 40 0 20 40
PC 1 −20 0
−10
PC 2 PC 1
Fig. 8 Bivariate histogram of the Perceptive data set and PCA. Multiplicity is shown in the size of the markers. Classes are shown in the form of the
markers and in a third dimension: (A) X-axis vs. Y-axis and (B) rotated view to inspect the classes. Camacho, J. Visualizing Big Data With
Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
500
0
10
0
−10 20 40
−20 0
15
10
5
0
−5
−10
−30 −20 −10 0 10 20 30 40
28 33
Fig. 9 Classless density plot (top) and hexagonal binning plot computed with the EDA toolbox (bottom) of the Perceptive data set and PCA.
Camacho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
• Distribute computing.
• Incremental clustering.
• Sampling-based methods.
The clustering algorithm defined in Ref. 11, which we use here, is a variant of the algorithm of Bradley et al.35 It belongs to the
incremental clustering category and only requires one scan of the data set. Moreover, it can be easily extended to distribute
computing (parallelization), or combined with sampling-based or summarization methods.
To define the grouping criterion, a measure of the similarity between observations/variables is employed, which is usually
distance-based. Common distance metrics are the Euclidean distance and the Mahalanobis distance, and should be consistent
with the subspace of the plot, in order to minimize the distortion due to the clustering. In general, the quadratic distance can be
expressed as:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0
dK x i ; x j ¼ xi xj RK1 R0 x i x j (21)
where xi and xj can either refer to observations or variables, depending on the data mode that is Big. R are the parameters of the
model fitted following the previous section. These can be either loadings or scores, again depending on the data mode that is Big:
observations or variables, respectively. The matrix K is a A A identity matrix if the Euclidean distance is considered or the suitable
covariance matrix in the case of the Mahalanobis distance. The choice between Euclidean and Mahalanobis distances depends on
the procedure to depict the score/loading plots. When both dimensions in the plot are of similar size regardless the magnitude in the
Chemometrics Analysis of Big Data 445
margins (i.e., of the variance of components), Mahalanobis distance is a more suitable choice. In Fig. 5, where we illustrated the
clustering approach with the Perceptive data set, the Mahalanobis distance in the PCA subspace corresponding to the first 2 PCs was
used to perform the clustering. Also, we accommodate for the display aspect ratio.
The features of the clustering are summarized as follows:
• Incremental clustering of the data set, previously arranged into batches of data and pre-processed.
• Mahalanobis or Euclidean distance in the model subspace following (21).
• When the data sets contain different classes (in the Big Data mode), the clustering is applied for each one separately.
The clustering algorithm is presented in Algorithm 1. In coherence with the model building approach in the previous section, the
input data set X is divided in T batches of Bt observations or variables, data is pre-processed and the model is fitted. Then, the
distance is selected. From the batches of data, a set of centroids is iteratively computed using the merge() routine. The computation
of centroids is optimized for a given definition of K. The number of original elements (observations or variables in the Big Data
mode) represented by each centroid, i.e., its multiplicity, is stored in m. The multiplicity of individual elements is 1. Therefore,
each time a new batch of data Xt is joined to the previously computed clusters, a vector of Bt ones, 1Bt, is included in the vector
of multiplicities m. An illustration of the merge() procedure is in Fig. 10.
The merge() routine in the clustering algorithm is presented in Algorithm 2. In the routine, the pair of elements/centroids with
minimum distance in C are iteratively replaced by their centroid and multiplicities are conveniently recomputed. For elements/
centroids in different classes, the distance is set to infinitum. This replacement operation is repeated until only Lend elements remain
in C.
When all the batches of data have been already processed by the clustering algorithm, a reduced data set of centroids is provided,
along with the associated multiplicity. The number of remaining centroids is Lend. Lend is user-defined and should be chosen so that
the visualization of the plot is adequate. Typically, a total of 100–300 points are adequately visualized in a compressed plot.
When the input is a Big Data stream, an EWMA approach for the compressed plots can also be used. The clustering algorithm can
be straightforwardly extended to the EWMA law in the updating of the multiplicities. For each batch of data, the merge() is launched
after the following update of the multiplicities:
2 2
2 2
9 9
1.5 1.5
1 1
0.5 0.5
0 7
0 7
8 10 8 10
−0.5 1 4 −0.5
4
5 Cluster
3 3
−1 6 −1 6
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
Fig. 10 Illustration of the merge procedure: the two closest elements are combined in a cluster of multiplicity 2. Camacho, J. Visualizing Big Data
With Compressed Score Plots: Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
446 Chemometrics Analysis of Big Data
The Multivariate Exploratory Data Analysis (MEDA) Toolbox12 is a set of multivariate analysis tools for data exploration written in
Matlab. In the MEDA Toolbox, traditional exploratory plots based on PCA or PLS(-DA), such as score, loading and residual plots,
are combined with methods like MEDA,36 oMEDA,37 SVI plots,38 sparse PCA39 and sparse PLS,40 ASCA41,42 or the Group-wise
models,43–45 a recent class of methods to calibrate sparse component models. Moreover, other useful tools such as cross-
validation and double cross-validation algorithms, Multivariate Statistical Process Control (MSPC) charts and data simulation/
approximation algorithms (ADICOV, SimuleMV) are included in the toolbox. Finally, several of the aforementioned exploratory
tools are extended for their use with Big Data with unlimited number of observations. In this article, for the first time, we also extend
some of the functionality to unlimited number of variables, a functionality that will be incorporated in version 1.3 of the toolbox. A
view of the MEDA Toolbox is shown in Fig. 13.
There are two ways to work with the MEDA toolbox: using the GUI (starting users) and using the commands (expert user). The
GUI is self-explanatory. The commands provide of more functionality. Each command includes helping information, that is dis-
played by typing ”help < command >” in the command line of Matlab. This helping information includes examples of use.
Also, in the ”Examples” directory within the toolbox, several real data examples are included.
The toolbox is freely available in Github at https://github.com/josecamachop/MEDA-Toolbox. You may download the last
version or track any code change using the ’git’ command: a version-control system for tracking changes in source code, which
use is straightforward using the Github desktop application. Tracking changes is recommended since the MEDA Toolbox is in
continuous evolution. Installation instructions and tutorial information are provided with the Toolbox. The MEDA Toolbox is
free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3. Contributions
to the Toolbox are always welcome.
Fig. 11 Illustration of three consecutive CSPs from PC A using the EWMA law. Camacho, J. Visualizing Big Data With Compressed Score Plots:
Approach and Research Challenges. Chemom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 447
PC 2
0
−1
−2
−3
−4
−10 0 10 20 30
PC 1
Fig. 12 Exponentially weighted moving average compressed score plot of the Perceptive data set and PCA. Multiplicity is shown in the size of the
markers and l ¼ 0.9.
Fig. 13 Illustration of the MEDA Toolbox for Matlab: the main GUIs at the left and some examples of visualization at the right: a score plot and
a diagnosis plot.
The data set considered in this first Case Study was generated by the 1998 DARPA Intrusion Detection evaluation Program, prepared
and managed by MIT Lincoln Labs.46,47 The objective of this program was to survey and evaluate research in networking intrusion
detection to improve the security of communication networks (e.g., the Internet). For that, a large data set with network traffic simu-
lated in a military network environment, including a wide variety of intrusions, was provided. While this data set is not related to
chemo-metrics, data is highly multivariate and Big in the observations mode, providing a good illustration of our approach. Besides,
there is a recent interest in the application of chemometric tools in the area of cybersecurity.48,49
448 Chemometrics Analysis of Big Data
The data set includes 4.844.253 observations. The observations belong to 22 different classes, one class for normal traffic and the
remaining for different types of network attacks. For illustrative purposes, the analysis will be restricted to two types of attacks, smurf
and neptune, and normal traffic. These three classes represent a 99.3% of the total traffic in the data set. For each connection, 42
variables are computed, including numerical and categorical variables. To consider categorical variables, one dummy variable per
category is included in the data set. The resulting data set is 4, 844, 253 122 variables. This was split in 489 batches of data, with
10, 000 122 per batch, except for the last batch with 4, 253 122.
The data is included as an example in the MEDA Toolbox, so reproducibility of this case study is straightforward. A glimpse of the
code needed to run the example is shown in Fig. 14. This code can be found under the example folder of the MEDA Toolbox, in
Networkmetrics/KDD/run.m. The first part of the code set the principal choices of the analysis, namely: type of model (PCA/PLS),
Fig. 14 Glimpse of the code for the section “Case Study I: Cybersecurity Data”. The code can be found under the example folder of the MEDA
Toolbox, in Networkmetrics/KDD/run.m. Camacho, J. Visualizing Big Data With Compressed Score Plots: Approach and Research Challenges. Che-
mom. Intell. Lab. Syst. 2014, 135, 110–125.
Chemometrics Analysis of Big Data 449
type of data (iterative for Big Data in the observations, EWMA for Big Data streams), number of latent variables, preprocessing
method and number of clusters in the compressed plots. After this, the main code is the part for Model Building. We can see
that only a few lines of code are needed. The routine ’update_iterative’ is used for Big Data in the observations, and ’update_ewma’
for Big Data streams. The remaining of the code is used to visualize the models and data.
Let us go through the code in detail, so that the interested reader can reproduce the example. We will start with ’update_iterative’,
for which ’Lmodel.update’ (line 4 in the code in Fig. 14) should be set to 2. For command line help information type:
[ help update_iterative.
As for the other parameters in the Lmodel, we leave them as in Fig. 14. The code will fit a PLS model of the data. We also add a line in
the first part of the script with:
Lmodel.path ¼ ’out/’
in order to specify the output directory. (The directory needs to exist, so we need to create it before the analysis is started.)
The first argument of ’update_iterative’ is ’short_list’. This includes a subset of the data, so that the computational time is reduced
in the example. We will change it to ’list’, and the complete data will be considered. The second argument is the path for the input
files, which we leave to ’ ’. The third argument is the Large Model (’Lmodel’) we have initialized in the first part of the code. The forth
argument is the updating step in each file, and we set it to a 1% (0.01). We will modify the fifth argument to 1 in order to create a file
system with the clusters information (note this will require around 7GB of free storage). This is necessary to inspect data in detail,
e.g., to investigate outliers. Finally, the last argument controls the amount of debugging info. We leave it to 1.
Therefore, the call should look as follows:
[ Lmodel ¼ update_iterative(list,’ ’,Lmodel,step,1,1);
This computation takes 90 min in a regular computer, Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, 16GB of RAM and Windows
10, when the fifth argument is set to 0, and more than 5 h if the new file system is created. The most computationally intensive parts
are the file system creation and the clustering. Future improvements of this approach include the parallelization.48 The resulting
Lmodel structure is depicted in Fig. 15. The parameters are:
the bigger the more observations in the cluster, and by labels starting with ’MEDA’ followed by ’< #batch > o< #observation > c<
#class >’, where < #batch > is the index of the batch for the first observation of the cluster, < #observation > is the index of that
observation and < #class > the class. For example, the largest green circle has the label ’MEDA52o5940cll’, meaning that it was orig-
inally started by observation 5940 in batch 52, and that it belongs to class 11. We can also identify individual observations, like
’1756’ or ’7776’.
To get more detail on cluster ’MEDA52o5940cll’, we can go to the corresponding files in the file system, as illustrated in Fig. 17.
The first file, in the upper left corner, contains pointers to a set of secondary files, which store the original observations that conform
the cluster. Each file contains three numbers in the first line: the type of content (0: raw observations, 1: file list), the number of
elements stored and the class. Since the file with pointers contains more than 8 K of them, and the number of elements in each
secondary file contains 100 observations, the cluster in the figure represents more than 800 K observations. Looking at the detail
of the secondary file, we can see that observations ’5940’ and ’5959’ belong to the cluster. With this structure, we retain the original
information but organized in an optimal way, following the clustering performed, so that we can easily retrieve the observations/
clusters with interesting behavior for further inspection, e.g., using figures of cluster scores like Fig. 18. Individual observations that
were not added to any cluster during the iterative modeling phase are directly stored in ’Lmodel.centr’ (see Fig. 15).
In Fig. 19 we show the MEDA plot36 of the PLS-DA Lmodel. With MEDA we can inspect the relationships among variables. The
code to obtain this plot is:
[ [map,ind,ord] ¼ meda_Lpls(Lmodel,[],111);
where the second argument is selected by default (’[]’ means ’by default’ in the MEDA Toolbox) and the third argument specifies the
plotting options: reorder variables and plot only the most relevant variables. Please, refer to the command line help for more
Chemometrics Analysis of Big Data 451
5
MEDA52o5940c11
4108 MEDA8o8201c19
0
Scores LV 2 (5%) -5
−10 MEDA1o8325c1
6298
−15
MEDA449o3709c1
MEDA85o1187c1
−20 8765
8735
−25
1756
−30
7776
−35
−8 −6 −4 −2 0 2 4
Scores LV 1 (11%)
Fig. 16 PLS Compressed Score Plot of the first 2 LVs in the KDD data.
mult = 853560
5
−5
−10
−20
−25
−30
−35
−8 −6 −4 −2 0 2 4
Scores LV 1 (11%)
Fig. 18 Score plot for cluster “MEDA52o5940c11.”
fg5 1
srr
rr
dhrr
dhsrr
pt1
dhdsr
dsr 0.5
dhsr
dhssr
fg3
sr
ssr
srv57
fg10
ssr2 0
dhssr
dhsc
pt2
dhsspr
pt0
srv60
sc -0.5
cnt
srv69
li
dhc
dhsdhr
sdhr
srv68
-1
Fig. 19 MEDA Plot of the _rst 2 LVs in the KDD data.
information. The plot shows only one fourth of the 122 variables, and two groups of variables are highlighted. We can use this
information to derive sparse models.43,44 This grouping is also useful to interpret the loading plot. The output of MEDA provides
the complete MEDA map (with all 122 variables), the variables selected and the new ordering.
The loading plot of the first 2 LVs is shown in Fig. 20, with the groups found in MEDA annotated. Combining this with the score
plot in Fig. 16, we can see that the green group of features marks the difference between the green class and the red class of scores.
Similarly, the blue group of features marks the difference between the blue class of scores and the red class.
Let us show how we can also use the MEDA output to derive group-sparse models. First, we apply the variables reordering and
selection obtained in MEDA to both the map and the Lmodel:
[ map2 ¼ map(ind(ord), ind(ord));
[ Lmodel2 ¼ select_vars(Lmodel, ind(ord));
Then we apply the GIA algorithm43 over the map, which identifies the groups of variables, and then GPCA:
[ [bel,states] ¼ gia(map,0.5);
[ [P,T,bel,E] ¼ Lgpca(Lmodel,states);
Chemometrics Analysis of Big Data 453
dhc cnt
0.3
0.2 fg3 sc
srv60
dhssr
sr
ssr
dhsr pt0
srv57
0.1 dhsspr
dsrrr
fg5
fg4
Weights LV 2
srcd
0 dhrr
dhdsr dstb
dur
-0.1 srv55
pt2 srv67
srv68 dhsc
dhssr
-0.2 dhsdhr
sdhr
pt1 fg10
ssr2
-0.3
srv69
-0.4 li
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
Weights LV 1
Fig. 20 PLS Compressed Score Plot of the first 2 LVs in the KDD data.
The result is shown in Fig. 21, from which we can conclude the same as before but in a clearer way. The first sparse PC makes a good
separation of the blue class from the rest, and the corresponding loading contains only 6 variables out of the 122 related to that
separation. The second sparse PC does the same for the green class, selecting 10 variables. The benefit of sparse models is that
they are easier to interpret than the combination of loading/score plots. Yet, both methodologies are available for Big Data in
the MEDA Toolbox.
The second case study is a DNA methylation dataset extracted from the Cancer Genome Atlas for Breast Invasive Carcinoma (BRCA).
Data was originally collected with the Illumina Infinium Human DNA Methylation 450 platform (HumanMethylation450),
including the status of 450K CpG sites.50 The data set contains 897 observations of 485,812 variables, which can be considered
a Big Data set in the variables.
Observations and variables with more than 30% of missing data were discarded (first observations and then variables). The rest
of missing elements were imputed with unconditional mean replacement and auto-scaled. Data was split in 49 batches of data, with
897 10,000 per batch, except for the last batch with 897 5, 812. The residual variance in terms of the number of PCs is shown in
Fig. 22. We can see that the variance is distributed across the PCs, as often found in massive data.
The Compressed Loading Plot (CLP) of the first 2 PCs is shown in Fig. 23. The z500 K variables are homogeneously distributed
across the subspace with a fish shape. The 2 PCs only account for 19% of the variance. The corresponding score plot is shown in
Fig. 24. We see a group of individuals deviating from the rest towards the left side of the plot. We would like to put some light on
this deviation. The plot shows the specific disease sub-type of the individuals by using different colors. A clear dominance of one
disease sub-type, ’Ductal and Lobular Neoplasms’, is manifesting. We can see that the separation between the group of individuals
and the rest cannot be attributed to disease sub-type, since both groups share the same dominant sub-type. A similar observation is
concluded if we show the colors of individuals in terms of gender or ethnicity (not shown). Thus, the separation cannot be attrib-
uted to them either.
We can use contribution plots to identify the variables (composite elements) related to the deviation between groups of indi-
viduals. For that we use oMEDA.37 It is a bar plot of the variables, built to compare two groups of observations. Each bar represents
the contribution of the variable to the difference between both groups. A positive bar implies that the first group of observations
presents a higher value in the corresponding variable than the second group. A negative bar reflects the opposite. A bar close to zero
for a variable means that both groups of observations have a similar value in that variable. The MEDA Toolbox includes the oMEDA
routine and an extension to Big Data, which is the one we use here.
Fig. 25 illustrates the oMEDA plot that compares the two groups of individuals in the score plot. Since variables can be clusters of
composites, the multiplicity of the cluster is visualized using the area of circles in the base of the plot, so that the higher the multi-
plicity the larger the area of the circle. Blue color bars correspond to clusters of composites, and red color bars to individual
454 Chemometrics Analysis of Big Data
(A) (B)
0.6 0.4
0.5
0.3
0.4
0.2
0.3
0.1
Loadings PC 2
Loadings PC 1
0.2
0.1 0
0
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
-0.4 -0.4
5 10 15 20 25 30 5 10 15 20 25 30
(C) 3 (D) 2
1
2
0
1 -1
-2
Scores PC 2
Scores PC 1
0
-3
-1
-4
-2 -5
-6
-3
-7
-4 -8
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Fig. 21 GPCA model in the KDD data: (A) loadings of the first sparse PC, (B) loadings of the second sparse PC, (C) scores of the first sparse PC,
and (D) scores of the second sparse PC.
0.9
0.8
0.7
% Residual Variance
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20
#PCs
Fig. 22 Residual variance in terms of the number of PCs in PCA in the BRCA data.
Chemometrics Analysis of Big Data 455
25 MEDA49o5868c1
MEDA1o51c1
20 cg04904099
15
-5 MEDA1o5706c1
MEDA2o2241c1 cg04396454
-10 MEDA6o5578c1
cg04490024
-15
MEDA47o4277c1
MEDA43o6958c1 MEDA30o7372c1
-20
MEDA47o2570c1
-25 MEDA47o1991c1
-25 -20 -15 -10 -5 0 5 10 15 20
Compressed Loadings PC 2 (9%)
Fig. 23 PCA Compressed Loading Plot of the first 2 PCs in the BRCA data.
0.1 625
424
582
71455
444
423
0.08 186 862
435 227561 291 802
823 85 147
345 263 741
0.06 151
70 851
760 590 348
201
431 51
473 177 261
145 346 350
4 116 161
534
765
0.04
Scores PC 1 (10%)
580
396728828 700
50
703 253568 627
0.02 233 17 859
408
288589 495373 717
675
0 712
362 602
303 108 619 152
557 144
-0.02 512140
433 792 143 76
265 155
874 520662
445
-0.04 882
644
613 522 121
259
289
870
504
235 531320
-0.06 841341 605
665 347
774 411 626
434
44 608
-0.08 185
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15
Scores PC 2 (9%)
Fig. 24 PCA Score Plot of the first 2 LVs in the BRCA: classes according to type of sub-disease.
composites. Let us focus on the four topmost bars, those with an oMEDA score above the dashed line. These are expected to hold the
main discrepancy between the groups of individuals. The top-most cluster in the plot (bar number 8) has a multiplicity of 13,262
composites (i.e., contains this number of original composites). The second one (number 70) has a multiplicity of 106 composites.
The third one (number 94) is an individual composite (cg04396454) and the fourth one (number 93) represents only 5 compos-
ites. If we mark those clusters in the CLP we see that they are actually located in the extreme positions along the horizontal axis
(Fig. 26). Also, Fig. 27 shows the scores of the individuals for the selected clusters of composites, highlighting one group of indi-
viduals with circles. Clearly, the selection by oMEDA is useful to determine a subset of interesting variables. Finally, we retrieved the
data corresponding to the first cluster and made a normal PCA from it. The result is displayed in Fig. 28, where we can see that the
selection of composites in that cluster is related to the deviation of individuals: individuals in the group deviating towards the right
in the new score plot show a higher value for those composites.
456 Chemometrics Analysis of Big Data
700
600
500
Diagnosis
400
300
200
100
0
20 40 60 80 100
Composite Element
Fig. 25 oMEDA plot of the cluster: red bars represent individual variables and blue bars clusters of variables.
MEDA49o5868c1
MEDA1o51c1
20 cg04904099
Compressed Loadings PC 1 (10%)
MEDA4o3370c1
10 cg04988869
MEDA10o7955c1
MEDA1o40c1 MEDA2o6845c1
0
MEDA1o5706c1
MEDA2o2241c1 cg04396454
-10 MEDA6o5578c1
cg04490024
MEDA47o4277c1
MEDA43o6958c1 MEDA30o7372c1
-20
MEDA47o2570c1
MEDA47o1991c1
-25 -20 -15 -10 -5 0 5 10 15 20
Compressed Loadings PC 2 (9%)
Fig. 26 PCA Compressed Loading Plot of the first 2 PCs in the BRCA data.
In this article, we have illustrated how extensions of chemometric tools to Big Data can be useful to derive insights on this data.
However, there are a number of challenges11 that need to be addressed in order to make this approach of practical use. The
following are those that the authors see the most relevant:
• Interactivity with Data. Chemometric tools are especially useful if we can interact with data. For instance, every time we find an
outlier or cluster of anomalous items, it would be interesting to investigate them in depth and then continue the analysis with
the rest of the data. While in the approach described in this article models can be recomputed easily by properly updating the
cross-product matrices, the clustering for the visualization needs to be recomputed from scratch. This is, unfortunately,
a computational demanding operation. An alternative based on data approximation techniques is explored in Ref. 11, but its use
in practice is complex.
• Big Data in the two modes. The approach described in this article cannot handle Big Data sets in the two modes. However, to the
best of our knowledge, such Big Data sets are very rare.
• Use of High Performance Computing (HPC). We can use full parallelization to speed up computation. The computation of
cross-product matrices can still be exact with this approach, but the effect on the clustering performance for the visualization
needs to be determined.
Chemometrics Analysis of Big Data 457
(A) (B)
6 1
5 0
MEDA1o25c1 (autoscaled)
4 -1
MEDA2o6845c1 (autoscaled)
3 -2
2 -3
1 -4
0 -5
-1 -6
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900
Individuals Individuals
(C) 2 (D) 6
0 5
MEDA48o5996c1 (autoscaled)
cg04396454 (autoscaled)
-2 4
-4 3
-6 2
-8 1
-10 0
-12 -1
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900
Individuals Individuals
Fig. 27 Scores of the individuals for the clusters of composites with the highest diagnosis value in Fig. 25: (A) cluster with the highest value,
(B) cluster with the second highest value, (C) single composite with the third highest value, and (D) cluster with the fourth highest value. Individuals
that belong to the group deviating towards the left in Fig. 24 (the score plot) are highlighted with circles.
(A) (B)
40 0.03
TCGA-E9-A2JT
TCGA-AC-A2QH cg12001592:
30
0.02 cg17330983:
TCGA-A7-A26J
20
Loadings PC 2 (2%)
Scores PC 2 (2%)
Acknowledgment
This work is partly supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds through project TIN2017-83494-R and the
“Plan Propio de la Universidad de Granada,” grant number PPVS2018-06.
References