You are on page 1of 4

International Conference on Computer, Communication, Chemical,

Materials and Electronic Engineering (IC4ME2), 11-12 July, 2019

Identification of Metabolomic Biomarker


using Multiple Statistical Techniques
and Recursive Feature Elimination
Tahsin Masrur∗ , Md. Al Mehedi Hasan†
Department of Computer Science and Engineering
Rajshahi University of Engineering and Technology
Rajshahi, Bangladesh
Email: ∗ tahsin86m@gmail.com, † mehedi ru@yahoo.com

Abstract—Mortality rate of diseases like lung cancer can lung cancer [4], prostate cancer [5], pancreatic cancer [6]
be decreased significantly by increasing the chance of early etc. However, metabolomics dataset are often affected by
diagnosis. Identifying differentially expressed (DE) metabolites outliers and might not follow normal distribution which para-
may contribute remarkably in this concern, and also in drug
design. In the past, several kinds of approaches were attempted metric tests assume [7]. For that reason, we used 2 non-
to discover biomarkers for diseases. Nonetheless, discovering parametric statistical tests as well, Kruskal-Wallis test and
compact-sized biomarkers while maintaining satisfactory clas- Mann-Whitney-Wilcoxon test, in a previous work [8]. If a
sification performance is still a challenge. Therefore, for further metabolite was differently expressed in either of these three
contribution in this sector, we have declared biomarkers from our tests, we considered it as a DE metabolite. Then for practical
identified DE metabolites in plasma and serum blood sample of
lung cancer. Student’s t-test, Kruskal-Wallis and Mann-Whitney- purposes, we identified a small number of DE metabolites to
Wilcoxon test were applied to distinguish the DE metabolites. select as biomarkers [9].
Cluster heatmap plot and fold change values were used to differ- For our analysis, plasma and serum samples of both healthy
entiate between up and down-regulated metabolites. Finally, RFE and lung cancer patients were used. We found 28 DE plasma
method was used to order the metabolites and select biomarkers
metabolites and from serum, 13 were found. As shown in our
from them. To assess the performance with our DE metabolites or
biomarkers, SVM classifier was utilized. We found 28 DE metabo- previous work, Support Vector Machine (SVM) classifier with
lites from plasma dataset and 13 from serum (p-value<0.05). the DE metabolites results in 87.5% and 83.33% accuracy for
In the end, 8 metabolites were selected from plasma sample plasma and serum respectively and biomarkers were selected
and 5 were selected from serum sample as the metabolomic using ROC curve analyses [8]. In this paper, we have used
biomarkers. The relevant files and codes of our work can be found
Recursive Feature Elimination (RFE) to rank the DE metabo-
at https://github.com/Zeronfinity/LungCancerBiomarkerRFE.
lites and selected smaller biomarkers (8 plasma and 5 serum
Index Terms—Differentially Expressed Metabolites, Biomark- metabolites) with a performance of AUC (Area Under Curve)
ers, T-Test, Kruskal-Wallis, Mann-Whitney-Wilcoxon, Fold 0.879 and 0.870 with Class Intervals of 0.672-0.930 and 0.656-
Change, Cluster Heatmap, SVM, RFE 0.927 respectively.
I. I NTRODUCTION II. M ATERIALS AND M ETHODS
Compared to all cancer varieties, lung cancer claims the
largest percentage of casualty within the male population and All the analysis in this paper has been done using R
second largest percentage within female worldwide [1]. During language. t.test, kruskal.test and wilcox.test functions were
first diagnosis, around 75% patients discover stage III/IV can- utilized for the statistical tests. e1071 [10], gplots, viridis, caret
cer [2]. In 2014, stage IV patients were reported to have a 15– and pROC packages were used to implement SVM, heatmap
19% survival chance of only 1 year, whereas stage I patients plots, RFE and ROC curve.
had a survival chance of 81–85% [3]. Thus early diagnosis
A. Dataset Description
is important where metabolomics can contribute a lot. In
biofluids, tissues, cells etc, metabolomics technology measures GC-TOF-MS technique was used to produce the dataset.
and quantifies metabolites, the intermediate substances in There were 82 subjects, 41 among them had lung cancer. Fred
cellular metabolism [4]. For early diagnosis, our focus is on Hutch and University of California procured these samples,
the metabolites expressed in different amount between healthy with consent from the subjects and under the approval of
patients and lung cancer patients i.e. differentially expressed IRB following protocols. The dataset was published by Oliver
(DE) metabolites. Fiehn, with the study ID ST000392 [11]. Finally, we had a
To identify DE metabolites, parametrical test like Student’s dataset of 82 subjects and 158 metabolites. On this dataset,
t-test has been applied before on several kinds of cancer, e.g. we have applied binary logarithm scaling and auto-scaling.

978-1-7281-3060-6/19/$31.00 2019
c IEEE
B. Student’s T-Test, Kruskal-Wallis and Mann-Whitney-
Wilcoxon Test
These statistical tests are used to find if two distributions
are statistically different. In t-test, if two random samples
x11 , ...x1n1 and x21 , ...x2n2 follow normal distributions with
mean μ1 and μ2 respectively, the hypothesis to test is H0 :
μ1 = μ2 vs H1 : μ1 = μ2 . The test statistics are,
x1 − x2 x1 − x2
t=  ,t =  2
s s2
s2 ( n11 + n12 ) ( n11 + n22 )
(n1 − 1)s21 + (n2 − 1)s22
s2 =
n1 + n2 − 2 Fig. 1: Heatmap plot for serum sample. Upper 16 metabolites
where x1 and x2 are the means of sample 1 and 2 re- (red-ish colored) are up-regulated, lower 12 (blue-ish colored)
spectively and s21 , s22 are the respective variances. The left are down-regulated.
test statistic corresponds to equal variance and the right one
corresponds to unequal. Finally, the p-value is calculated with E. Recursive Feature Elimination and ROC curves
respect to the derived t value where the degree of freedom RFE is a backward selection method built on the concept of
was n1 + n2 − 2 [4]. eliminating least important features in numerous stages [15].
In the non-parametric Kruskal-Wallis, the test statistic is In Caret library, resampling methods are factored in, due to
12
C Rj2 the variability caused [16]. For our work, we used Random
N (N +1) j=1 nj − 3(N + 1)
H= Forests and 10-fold cross-validation as the helper functions.
1 − NΣT
3 −N To gauge the performance of our selected biomarkers, ROC
Here, N = Σnj , Rj = the summation of ranks of the j-th curve analysis was used to calculate AUC (area under curve)
group, nj = the number of samples of the j-th group, C = the values with a bootstrap technique of 10000 iterations [17].
number of groups. For each group of ties, T = s3 − s where s
III. E XPERIMENTAL A NALYSIS
is the number of tied samples. The hypothesis H0 is that each
C group is from the same population [8]. We considered a metabolite as DE if it had p-value<0.05
In Mann-Whitney-Wilcoxon test, if xi is an element of the in either of the three statistical tests, adjusted with Benjamini-
1st group and yj is from the 2nd, the hypothesis is H0 : p(xi > Hochberg procedure [18]. From plasma sample, we found
yj ) = 1/2, H1 : p(xi > yj ) = 1/2. The Mann-Whitney U 28 DE metabolites, among which 24 came from Student’s
statistics for each group are t-test, 25 from Kruskal-Wallis and 26 from Mann-Whitney-
Wilcoxon. In serum sample, t-test, Kruskal-Wallis and Mann-
nx (nx + 1) ny (ny + 1)
U x = nx n y + −Rx ; Uy = nx ny + −Ry Whitney-Wilcoxon identified 12, 11 and 11 DE metabolites
2 2
respectively, 13 in total. These DE metabolites were then
Here, nx and ny are the numbers of samples in the 1st and
categorized into either up-regulated or down-regulated using
2nd group respectively while Rx and Ry are the sums of the
FC values, summarized in Table I. Cluster heatmap plots also
ranks in the two groups [8].
support these findings. In Figure 1, it is observed that all the
C. Up-regulated and Down-regulated Metabolites 7 up-regulated (according to FC values) DE metabolites of
To differentiate the up or down-regulated metabolites, both serum sample are in same cluster and the rest 6 down-regulated
FC values and cluster heatmap plot were used. A cluster DE metabolites belong to another cluster. The metabolites are
heatmap basically reorganizes a matrix rows/columns based color coded in Table I, where the red metabolites are the
on similarity [12]. We observed that the metabolites were up-regulated ones and the blue ones are down-regulated. For
separated in two individual clusters in the same way FC values plasma sample, same scenario occurs when a cluster heatmap
dictated [8]. plot is drawn. 16 metabolites are found as up-regulated and 12
as down-regulated. These up and down regulated metabolites
D. Support Vector Machine Classifier are also listed with color coding, in the plasma section of
SVM, a supervised machine learning classifier, was utilized Table I.
to obtain the classification performance of various sets of To assess the performance of our chosen set of metabolites,
features, metabolites in this case. The discriminant function multiple sets of metabolites were used as classification models
of SVM is [13], [14] and the results are shown in Table II. Separate tuning was done
 n for each model. Individual training and testing set were created
f (x) = αi yi k(xi , xj ) + b for each dataset. The independent test set was not used during
i=1 parameter tuning. To find an optimal pair of C and γ for the
where K(xi , xj ) = exp(γ|xi − xj |2 ) is the radial basis training set, grid search was applied after splitting the training
function (RBF) kernel [10]. set further to create cross-validation set.
TABLE I: In this table, the DE metabolites are listed with their respective adjusted p-values. The red colored ones are up-
regulated and the blue colored ones are down-regulated, based on their FC values. All the values are rounded to 4 digits.
BH Adjusted BH Adjusted BH Adjusted p-Value of
Binary logarithm
Metabolites p-Value of p-Value of Mann-Whitney-Wilcoxon
of FC Value
Student’s T-Test Kruskal-Wallis Test Test
Plasma Sample
3-phosphoglycerate 0.0001 0.0001 0.0001 0.8494
5-hydroxynorvaline NIST 0.0027 0.0034 0.0032 -0.5959
5-methoxytryptamine 6.76E-06 0.0001 0.0001 1.6127
adenosine-5-monophosphate 1.10E-09 1.85E-07 1.91E-07 1.7089
alpha-ketoglutarate 0.0296 0.0415 0.0421 0.3908
asparagine 0.0098 0.0184 0.0187 -0.3259
aspartic acid 0.0001 0.0002 0.0002 0.5576
benzoic acid 0.0135 0.0299 0.0303 -0.2047
citrulline 0.0070 0.0043 0.0044 -0.4460
glutamine 0.1206 0.0162 0.0152 -0.6734
hypoxanthine 0.0265 0.0163 0.0166 0.5208
inosine 0.0520 0.0385 0.0390 -0.9355
lactamide 0.0594 0.0478 0.0484 1.1348
lactic acid 0.0002 0.0004 0.0003 0.8123
malic acid 0.0115 0.0163 0.0166 0.3597
maltose 0.0019 0.0030 0.0031 1.3170
maltotriose 0.0375 0.0557 0.0565 0.4854
methionine sulfoxide 0.0043 0.0062 0.0063 -0.4996
nornicotine 0.0032 0.0037 0.0035 -0.6005
phenol 0.0016 0.0001 0.0001 0.6685
phosphoethanolamine 0.0233 0.0159 0.0152 0.6260
pyrophosphate 9.48E-07 8.25E-06 8.47E-06 1.3826
pyruvic acid 0.0070 0.0037 0.0032 1.0788
quinic acid 0.0154 0.0220 0.0223 -0.8705
taurine 6.07E-05 3.68E-05 3.77E-05 1.2491
threonine 0.1664 0.0506 0.0484 -0.1965
tryptophan 0.0420 0.0626 0.0606 -0.2356
uric acid 0.0220 0.0291 0.0268 -0.3787
Serum Sample
5-hydroxynorvaline NIST 0.0424 0.0433 0.0411 -0.5056
aspartic acid 6.50E-05 0.0001 0.0001 1.4069
cholesterol 0.0429 0.0451 0.0411 -0.2357
deoxypentitol 0.1279 0.04454 0.0411 0.6217
glutamic acid 0.0481 0.0725 0.0683 0.6119
hypoxanthine 0.0120 0.0002 0.0001 0.6109
inosine 0.0286 0.0156 0.0136 -0.8138
lactic acid 0.0429 0.0433 0.0411 0.4116
N-methylalanine 0.0429 0.0451 0.0411 -0.3894
nornicotine 0.0429 0.0548 0.0557 -0.5073
phenol 0.0429 0.0149 0.0136 0.4224
quinic acid 0.0429 0.0445 0.0411 -0.8869
taurine 0.0429 0.0168 0.0136 0.5852

TABLE II: The classification accuracies of the different types of feature selections are observed in this table.
Set of Features Cost (C) Gamma (γ) Classification Accuracy
Plasma Sample
Full set of 158 metabolites 1 0.0064 83.33%
DE in Student’s t-test 1 0.01 83.33%
DE in Kruskal-Wallis 1 0.005920768 83.33%
DE in Mann-Whitney-Wilcoxon 2.828427 0.005524272 75.00%
DE in each of the three tests (intersection) 1 0.0123 83.33%
DE in any of the three tests (union) 0.5 0.0078125 87.50%
Serum Sample
Full set of 158 metabolites 16 0.000976563 66.67%
DE in Student’s t-test 3.363586 0.00464534 79.17%
DE in Kruskal-Wallis 2 0.006569503 70.83%
DE in Mann-Whitney-Wilcoxon 4 0.02209709 75.00%
DE in each of the three tests (intersection) 1024 0.01104854 83.33%
DE in any of the three tests (union) 0.2102241 0.02209709 83.33%
entially expressed metabolites and metabolomic biomarkers.
We identified 28 DE metabolites from plasma blood sample
and 13 DE metabolites from serum sample using three statisti-
cal tests. Among these metabolites, the up-regulated and down-
regulated ones were differentiated. In the end, we used RFE on
the DE metabolites to rank them and choose the most suitable
subset among them as biomarkers. With our procedures, we
obtained 8 metabolites as plasma biomarker and 5 metabolites
as serum biomarkers for lung cancer. We believe that our
analyses may help metabolomics data science researches in
obtaining a deeper understanding of DE metabolites and their
effects and usefulness for lung cancer disease.
R EFERENCES
Fig. 2: Plot of ROC curve and AUC value with the selected
[1] L. A. Torre, R. L. Siegel, and A. Jemal, Lung Cancer Statistics.
biomarker of plasma sample Springer, Cham, 2016, pp. 1–19.
[2] S. Walters, C. Maringe, M. P. Coleman, M. D. Peake, J. Butler,
N. Young, S. Bergström, L. Hanna, E. Jakobsen, K. Kölbeck et al.,
“Lung cancer survival and stage at diagnosis in australia, canada,
denmark, norway, sweden and the uk: a population-based study, 2004–
2007,” Thorax, vol. 68, no. 6, pp. 551–564, 2013.
[3] J. Broggio and N. Bannister, “Cancer survival by stage at diagnosis for
england,” jun 2016.
[4] N. Kumar, M. Shahjaman, M. N. H. Mollah, S. S. Islam, and M. A.
Hoque, “Serum and plasma metabolomic biomarkers for lung cancer,”
Bioinformation, vol. 13, no. 6, p. 202, Jun 2017.
[5] A. Sreekumar et al., “Metabolomic profiles delineate potential role for
sarcosine in prostate cancer progression,” Nature, feb 2009, [PMID:
19212411].
[6] S. Nishiumi et al., “Serum metabolomics as a novel diagnostic approach
for pancreatic cancer,” Metabolomics, 2010, [DOI: 10.1007/s11306-010-
0224-9].
[7] L. Blanchet and A. Smolinska, “Statistical analysis in proteomics,” 2016.
[8] T. Masrur, M. A. M. Hasan, and M. N. I. Mondal, “Metabolomic
Fig. 3: Plot of ROC curve and AUC value with the selected biomarker identification for lung cancer by combining multiple statistical
biomarker of serum sample approaches,” in 2019 International Conference on Electrical, Computer
and Communication Engineering (ECCE). IEEE, 2019, pp. 1–6.
[9] J. Xia, D. I. Broadhurst, M. Wilson, and D. S. Wishart, “Translational
It is usually quite hard to develop one single assay to biomarker discovery in clinical metabolomics: an introductory tutorial,”
Metabolomics, vol. 9, no. 2, pp. 280–299, 2013.
quantify a moderately big number of metabolites reproducibly [10] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, C.-C.
in clinical settings. Thus, a shorter list of 1-10 biomarkers is Chang, C.-C. Lin, and M. D. Meyer, “Package ‘e1071’,” 2018.
more feasible for a clinical laboratory, usually mathematically [11] S. Miyamoto et al., “Systemic metabolomic changes in blood samples
of lung cancer patients identified by gas chromatography time-of-flight
more robust too [9]. Thus we consider using a wrapper mass spectrometry,” Metabolites, vol. 5, no. 2, pp. 192–210, apr 2015.
method on our DE metabolites, as the computational cost [12] K. Pollard, S. Engle, S. Whalen, A. Joshi, and K. Pollard, “Unboxing
issue is mitigated due to our comparatively smaller number of cluster heatmaps,” 2017.
[13] V. Vapnik, The nature of statistical learning theory. Springer science
DE metabolites. We chose recursive feature elimination with & business media, 2013.
random forest for that purpose. From plasma blood sample, [14] M. A. M. Hasan, S. Ahmad, and M. K. I. Molla, “Protein subcellular
we identified 8 metabolites as biomarkers and for serum blood localization prediction using multiple kernel learning based support
vector machine,” Molecular BioSystems, vol. 13, no. 4, pp. 785–795,
sample, we obtained 5 metabolites as biomarkers with AUC 2017.
values of 0.879 and 0.870 with CI intervals of 0.672-0.930 [15] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
and 0.656-0.927 respectively, shown in Figure 2 and 3. The cancer classification using support vector machines,” Machine learning,
vol. 46, no. 1-3, pp. 389–422, 2002.
plasma biomarkers are adenosine-5-monophosphate, taurine, [16] M. Kuhn, “Variable selection using the caret package,” URL http://cran.
pyrophosphate, 5-hydroxynorvaline NIST, aspartic acid, 3- cermin. lipi. go. id/web/packages/caret/vignettes/caretSelection. pdf,
phosphoglycerate, phenol and methionine sulfoxide, ranked 2012.
[17] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez,
according to importance. Red represents up-regulated metabo- and M. Müller, “proc: an open-source package for r and s+ to analyze
lites and blue for down-regulated. The serum biomarkers and compare roc curves,” BMC Bioinformatics, vol. 12, p. 77, 2011.
are aspartic acid, hypoxanthine, cholesterol, phenol and 5- [18] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate:
a practical and powerful approach to multiple testing,” Journal of the
hydroxynorvaline NIST. royal statistical society. Series B (Methodological), pp. 289–300, 1995.
IV. C ONCLUSION
In this paper, detailed analysis was done on plasma and
serum datasets of lung cancer, with the intent to identify differ-

You might also like