Using Gut Microbiota As A Diagnostic Tool For Colorectal Cancer Machine Learning Techniques Reveal Promising Results Compress

RESEARCH ARTICLE
Lu et al., Journal of Medical Microbiology 2023;72:001699

DOI 10.1099/jmm.0.001699
Using gut microbiota as a diagnostic tool for colorectal cancer:

machine learning techniques reveal promising results
Fang Lu1,2†, Ting Lei2,3†, Jie Zhou1,2†, Hao Liang1,2,4, Ping Cui2,4, Taiping Zuo2,3, Li Ye1,2,*, Hui Chen2,3,* and Jiegang Huang1,2,*
Abstract
Introduction. Increasing evidence suggests a correlation between gut microbiota and colorectal cancer (CRC).
Hypothesis/Gap Statement. However, few studies have used gut microbiota as a diagnostic biomarker for CRC.
Aim. The objective of this study was to explore whether a machine learning (ML) model based on gut microbiota could be used
to diagnose CRC and identify key biomarkers in the model.
Methodology. We sequenced the 16S rRNA gene from faecal samples of 38 participants, including 17 healthy subjects and 21
CRC patients. Eight supervised ML algorithms were used to diagnose CRC based on faecal microbiota operational taxonomic
units (OTUs), and the models were evaluated in terms of identification, calibration and clinical practicality for optimal modelling
parameters. Finally, the key gut microbiota was identified using the random forest (RF) algorithm.
Results. We found that CRC was associated with the dysregulation of gut microbiota. Through a comprehensive evaluation of
supervised ML algorithms, we found that different algorithms had significantly different prediction performance using faecal
microbiomes. Different data screening methods played an important role in optimization of the prediction models. We found
that naïve Bayes algorithms [NB, accuracy=0.917, area under the curve (AUC)=0.926], RF (accuracy=0.750, AUC=0.926) and
logistic regression (LR, accuracy=0.750, AUC=0.889) had high predictive potential for CRC. Furthermore, important features
in the model, namely s__metagenome_g__Lachnospiraceae_ND3007_group (AUC=0.814), s__Escherichia_coli_g__Escherichia-
Shigella (AUC=0.784) and s__unclassified_g__Prevotella (AUC=0.750), could each be used as diagnostic biomarkers of CRC.
Conclusions. Our results suggested an association between gut microbiota dysregulation and CRC, and demonstrated the fea-
sibility of the gut microbiota to diagnose cancer. The bacteria s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escheri-
chia_coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella were key biomarkers for CRC.
INTRODUCTION
The high incidence and mortality of colorectal cancer (CRC) make it one of the most concerning diseases in the world. According to
the Global Cancer Data 2020 report, there were 19.3 million new cancer cases worldwide, and the overall incidence of CRC rose from
fourth place in 2018 to third place [1]. Because effective drugs for CRC are still being developed and the only effective measures are
early detection and surgical removal of CRC, many countries recommend universal screening and prevention programmes. Currently,
Received 19 December 2022; Accepted 06 April 2023; Published 07 June 2023

Author affiliations: 1School of Public Health, Guangxi Medical University, Nanning, 530021, Guangxi, PR China; 2Guangxi Key Laboratory of AIDS
Prevention and Treatment & Guangxi Universities Key Laboratory of Prevention and Control of Highly Prevalent Disease, Nanning, 530021, Guangxi,
PR China; 3Geriatrics Digestion Department of Internal Medicine, The First Affiliated Hospital of Guangxi Medical University, Nanning, PR China; 4Life
Science Institute, Guangxi Medical University, Nanning, 530021, Guangxi, PR China.
*Correspondence: Li Ye, yeli@gxmu.edu.cn; Hui Chen, chenhuiyfy@gxmu.edu.cn; Jiegang Huang, jieganghuang@gxmu.edu.cn
Keywords: colorectal cancer; gut microbiome; 16S rRNA gene sequencing; diagnosis; machine learning; biomarker.
Abbreviations: AUC, area under the receiver operating characteristic curve; CRC, colorectal cancer; DCA, decision curve analysis; DT, decision tree;
FOBT, faecal occult blood test; HC, healthy control; KNN, k-nearest neighbours; LASSO, least absolute shrinkage and selection operator; LEfSe, linear
discriminant analysis effect size; LR, logistic regression; ML, machine learning; NB, naïve Bayes algorithms; NN, neural network; OTU, operational
taxonomic unit; RF, random forest; ROC, receiver operating characteristic curve; SVM, support vector machines; XGB, extreme gradient boosting.
Supplementary materials (Table S1 and File S1) are available with the online version of this article. The datasets presented in this study can
be found in online repositories. The names of the repository/repositories and accession number(s) are: BioSample database, BioProject ID:
PRJNA910989 (http://www.ncbi.nlm.nih.gov/bioproject/910989) and PRJNA933359 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA933359).
†
These authors contributed equally to this work and share the first authorship.
001699 © 2023 The Authors

1
one of the most widely used non-invasive screening procedures is the faecal occult blood test (FOBT), which can indicate the presence
of advanced adenomas and carcinomas in the colon by detecting blood in the stool [2]. However, because the FOBT has limited
sensitivity and specificity for CRC and does not reliably detect precancerous lesions, there is a need to develop a new non-invasive,
simple and effective CRC screening test [3].
The gut microbiota is a collection of microorganisms living in the gastrointestinal tract and is a potential source of biomarkers for
detecting colonic lesions. In human studies, patients with CRC have an abnormal gut microbiome structure when compared with
healthy patients [4, 5]. Experiments in animal models have also shown that such alterations have the potential to accelerate tumo-
rigenesis [6]. Thus, the detection of these pathogenic bacteria in gut microbiota could be a promising method for CRC screening.
Although some members of the gut microbiota have been shown to contribute to the onset and progression of CRC through various
mechanisms, they are not present in all cases [4, 7, 8]. It is unclear how many cases of CRC can be attributed to these pathogens, and
whether changes in microbial abundance could provide the basis for an accurate CRC screening test.
Machine learning (ML), a major branch of artificial intelligence (AI), can be used to increase our understanding of changes in existing
data structures and to make predictions about new data. It has been used in a wide variety of studies, such as DNA methylation
associated with genetic diseases [9], the diagnosis of Alzheimer’s disease using imaging data [10], the prediction of gastrointestinal
disease development using continuous variable fitting techniques [11], and the automatic detection of gastrointestinal lesions by
computer vision in endoscopes [12]. ML algorithms and new computational models offer the opportunity to generate computational
drug networks to diagnose the efficacy of approved drugs relative to relevant oncogenic targets, as well as to select patients with better
responses or better disease biomarkers [10]. In the field of digital pathology, the emergence of AI and ML tools makes it possible to
mine new morphological phenotypes and improve patient management for a variety of cancer types [13]. It enables computer programs
to automatically analyse large amounts of data and determine which information is most relevant.
At present, several studies have identified and elucidated the pathogenicity of certain intestinal microorganisms. For example, entero-
toxigenic Bacteroides fragilis is the typical pathogen that causes CRC by upregulating inflammatory factors, releasing reactive oxygen
species, inducing intestinal inflammation, and promoting the formation of polyps and tumours [14, 15]. In the study by Yachida et
al. [16], principal component analysis (PCA) was used to select Bacteroides and Prevotella, two types of bacteria with the greatest
variation in abundance, from the faeces of CRC patients and a healthy control (HC) group. Both bacteria are major contributors to
the gut flora of CRC patients. Guo et al. [17] reported that a highly accurate CRC diagnostic model was developed by combining the
results of quantitative PCR (qPCR) of the abundance of three gut bacteria, Fusobacterium nucleatum, Faecalibacterium praus-nitzii and
Bifidobacterium spp. In another study, Fusobacterium, Porphyromonas and Peptostreptococcus were all enriched in CRC patients based
on using a metagenomic classifier [18]. However, some quite significantly differently expressed bacteria between CRC and normal
controls can be recognized by many different algorithms and used as a key parameter for prediction. For example, in the Chinese
population, Methanosphaera_stadtmanae_DSM_3091 was identified and used by filtered classifier, sequential minimal optimization
(SMO), logistic and naïve Bayes models as key parameters. Another dominant bacterium, Blautia_uncultured_Firmicutes_bacterium,
was taken by the random tree, J48 and PART algorithms as key parameters [19].
Although there is increasing recognition of the potential of the faecal microbiome in the detection of CRC, the choice of classification
models is diverse. Due to the nature of the algorithms themselves, each algorithm has its default parameters, so it is unclear which
modelling algorithm is more suitable for CRC diagnostic screening studies. In this study, we systematically evaluated the performance
of the supervised classifiers to diagnose CRC based on gut microbiota. We recruited 38 participants and sequenced the hypervariable
regions of the 16S rRNA gene from the faeces of each individual, used different supervised ML algorithms to test their performance
in the diagnosis of CRC based on gut microbiota, and identified several potential bacteria associated with the dysbiosis of CRC.
METHODS
Participants and sample collection
Patients and healthy volunteers were recruited from the First Affiliated Hospital of Guangxi Medical University (Guangxi, China)
between August 2020 and February 2021. The inclusion criteria for the CRC group were as follows: (1) tumour site was clear
and biopsy-confirmed; (2) no radiation or chemotherapy before sampling; (3) no antibiotics or probiotics within 1 month;
and (4) complete case data were available. Healthy volunteers of the age and gender of the subjects were recruited as the HC
group in the Guangxi Medical University Center for Physical Examination. The inclusion criteria for HCs were as follows: (1)
no gastrointestinal-related diseases; and (2) no antibiotics or probiotics within 1 month. The exclusion criteria for both groups
were as follows: (1) have diseases related to intestinal flora, such as inflammatory bowel disease, diabetes, peptic ulcers, etc.; (2)
pregnant or lactating women; and (3) a family history of bowel disease, such as familial adenomatous polyposis. All the volunteers
understood and signed informed consent before inclusion in the group. This study was approved by the Ethics Committee of
Guangxi Medical University.
Faecal samples were collected from HCs and patients after CRC surgery, placed in sterile boxes on ice and transported immediately
to the laboratory. Each sample was evenly divided into sterile tubes and immediately frozen at −80 °C.
2
DNA extraction and sequencing

DNA extraction from faecal samples was performed using the FastDNA Spin Kit for Soil (MP Biomedicals), according to the
manufacturer’s instructions. DNA integrity was assessed using 1 % agarose gel electrophoresis, and purity and concentration were
assessed using a NanoDrop2000 UV spectrophotometer (Thermo Fisher Scientific). The V3–V4 hypervariable regions of the
bacterial 16S rRNA gene were amplified in a thermocycler PCR system (ABI GeneAmp 9700) using the following primer pairs:
forward 338-ACTCCTACGGGAGGCAGCAG and reverse 806-GGACTACHVGGGTWTCTAAT.
Purified amplicons were pooled in equimolar concentrations and sequenced on an Illumina MiSeq platform (Illumina) in PE300 mode,
according to standard protocols provided by Majorbio Bio-Pharm Technology. Raw FASTQ files were demultiplexed, quality filtered
via Trimmomatic and merged by FLASH, according to the following criteria: (i) the reads were truncated at any site and received
an average quality score <20 over a 50 bp sliding window; (ii) primers were matched exactly, allowing two-nucleotide mismatching,
and reads containing ambiguous bases were removed; and (iii) sequences with overlaps longer than 10 bp were merged according to
their overlapping sequence. The 16S rRNA gene sequence information in this study has been submitted to the NCBI BioProject, with
accession numbers PRJNA910989 and PRJNA933359.
Data analysis
Trimmomatic software was used for quality control of the original sequence, and FLASH software was used for splicing. We uesd the
UPARSE 7.1 software (http://drive5.com/uparse/) to perform OTU clustering on the sequences, only clustering sequences with at least
97 % identical nucleotides into OTUs. Chimeras were removed during clustering using the UCHIME software (Table S1, available
in the online Supplementary Material). Each sequence was annotated for species classification using the RDP classifier (http://rdp.
cme.msu.edu/) and compared with the Silva 16S rRNA database (Release 138, http://www.arb-silva.de). The comparison threshold
was set to 70 %.
Supervised ML modelling
All ML analyses were performed using R (v.4.2.2). Linear discriminant analysis effect size (LEfSe, LDA>2, P<0.05）based on the
microeco package was used to iden btify the OTU difference between CRC and HC groups, and then the ML model was built. Eight
different supervised ML algorithms in the caret package were used to train the microbiota OTU table: k-nearest neighbours (KNN),
decision tree (DT), naïve Bayes (NB), neural network (NN), random forest (RF), support vector machines (SVM), logistic regression
(LR) and extreme gradient boosting (XGB). Data were assigned into training (70%) and testing (30%) datasets after the whole dataset
was shuffled. The training effects of different ML models were evaluated through ten cross-validations, and the process was repeated
ten times to obtain the optimal modelling parameters. To further improve model performance, we used the glmnet package-based
regression of the least absolute shrinkage and selection operator (LASSO) to further screen differential OTUs identified by LEfSe
before modelling. All models were evaluated from three perspectives: identification [accuracy, area under the receiver operating
characteristic curve (AUC), sensitivity, specificity], calibration (Brier score) and clinical practicability [decision curve analysis (DCA)].
The RF algorithm was used to identify the top ten most important OTUs in the model. Further, the resulting AUCs were plotted using
GraphPad Prism 9.0 to identify potential microbial biomarkers for CRC. Model parameters and code are available in Supplementary
File 1.
Statistical analysis
Clinical data were analysed using SPSS Statistics v26.0. Quantitative data were presented as mean and standard deviation (sd), and
the differences in age and body mass index (BMI) between the CRC and HC group were compared using t-tests. Qualitative data were
presented as percentages, and differences in gender and race between different groups were compared using Chi-square tests. All data
with two-sided P<0.05 were considered statistically significant.
RESULTS
A total of 38 subjects were included in this study: 17 healthy individuals in the control group and 21 individuals in the CRC group.
There were no significant differences in gender, age, BMI or race between the two groups (Table 1).
CRC was related to the dysregulation of various gut microbiota

Significant differences in gut microbiota were observed between the CRC and HC subjects (Fig. 1a, b). A total of 55 taxonomic features
(LDA>3.2) were found to be enriched in the CRC or HC group (Fig. 1a). Differential enrichment in several major bacterial taxa in the
CRC and HC groups and their phylogenetic relationships were revealed using the cladogram (Fig. 1b). At the level of bacterial genus,
Escherichia-Shigella, Ralstonia, Peptostreptococcus, Gemella and Fusobacterium were more abundant in the CRC group. In contrast,
Haemophilus, Ruminococcus, Faecalibacterium, Coprococcus, Lachnospiraceae_ND3007_group, Lachnospiraceae_UCG-001, Lachnospira
and Fusicatenibacter were more abundant in the HC group (Fig. 1b).
3
Table 1. Basic information about the study population
CRC (N ＝ 21) HC (N ＝ 17) P-value
Age, year (mean, sd) 60.43 (10.03) 55.94 (17.04) 0.147
BMI (mean, sd) 21.69 (2.98) 22.53 (2.79) 0.381
Gender (n, %) 0.275
Men 19 (90.48 %) 14 (82.35 %)
Women 2(9.52 %) 3(17.65 %)
Race 0.342
Han 15 (71.43 %) 13 (76.47 %)
Other 6 (28.57 %) 4 (23.53 %)
Clinical stage
Ⅰ
4 –
Ⅱ
3 –
Ⅲ
8 –
Ⅳ
6 –
BMI, body mass index; CRC, colorectal cancer; HC, health control.
ML models for diagnosis and screening based on the gut microbiota

Supervised ML models trained with LEfSe
Based on the LEfSe (LDA>2, P<0.05), 67 different OTUs between the CRC and HC groups were identified to construct the
ML model. Fig. 2 shows the performance measures of the eight different ML algorithms evaluated on the test dataset for the
CRC versus HC classification, including KNN, DT, NB, NN, RF, SVM, LR and XGB. Among them, based on the criterion
of AUC>0.7, KNN, DT, RF, SVM, LR and XGB performed better than other models. DT had higher precision, sensitivity
and specificity than other supervised ML models (Table 2; Fig. 2a). Regarding the Brier score, SVM had the lowest value
and the calibration degree was the highest. Decision curve analysis of the simulated data sets under different models are
shown in Fig. 2b. When the threshold probability was ≥0.7, the clinical net benefits of DT, LR, RF, XGB and SVM were
higher than the All curve and None curve. In summary, the DT, RF and SVM models constructed by LEfSe screening had
better performance and were more conducive to predicting and identifying subjects with CRC.
Supervised ML models trained with LEfSe and LASSO regression model

To further improve the performance of the model, a regression analysis of gut microbiota was performed using ML models.
LEfSe and LASSO were performed. Eighteen biomarkers (OTU620, OTU171, OTU459, OTU462, OTU732, OTU844,
OTU745, OTU1111, OTU796, OTU692, OTU1203, OTU852, OTU1110, OTU714, OTU897, OTU1090, OTU618 and
OTU1062) were successfully identified as optimal for the diagnosis of CRC, which could be potentially non-invasive
tools for the early diagnosis of CRC. Interestingly, based on the LEfSe and LASSO regression analysis, the AUC of NB
improved improved from 0.593 to 0.926, and its accuracy, sensitivity, and calibration were also improved significantly, but
its specificity decreased to 0.66. Similarly, the AUC of RF and LR was also improved, and their accuracy and sensitivity were
increased significantly. However, DT performance measures decreased significantly with no significant improvements in
the performance measures (Table 2; Fig. 3a). When the risk threshold was >0.5, the clinical net benefit of the NB, SVM, RF
and LR models was higher than that of the All curve and the None curve, and the range of the clinical net benefit threshold
was large. However, it should be noted that the net clinical benefit of the four models decreased with increasing threshold
(Fig. 3b). In conclusion, OTUs obtained through LASSO screening had better performance in the construction of the NB,
RF, and LR models, which were more conducive to the prediction and identification of CRC subjects.
Identifying the top ten most important OTUs in the RF model

We selected the top ten optimal OTU biomarkers based on their importance scores using the RF model. LEfSe analysis
showed that, in order of importance, the bacterial species in the RF model were OTU732 (s__metagenome_g__Lachno-
spiraceae_ND3007_group), OTU163 (s__uncultured_organism_g__Gemella), OTU243 (s__Ralstonia_pickettii), OTU1149
(s__uncultured_organism_g__Fusicatenibacter), OTU620 (s__Escherichia_coli_g__Escherichia-Shigella), OTU171
4
Fig. 1. Analysis of differences in gut microbial abundance between colorectal cancer patients and the healthy control group. (a) Linear discriminant
analysis effect size bar graph showing different bacterial taxa. (b) Cladogram showing the phylogenetic relationships of different bacterial taxa.
5
Fig. 2. OTU screening by LEfSe analysis. (a) ROC curves showing the test performance of eight different ML algorithm models trained using LEfSe
analysis. (b) Decision curve analysis of CRC diagnosis by eight ML algorithms based on LEfSe analysis. OTU, operational taxonomic unit; LEfSe, effect
size of linear discriminant analysis effect size; ROC, resulting area under the receiver operating characteristic curve; KNN, k-nearest neighbours; DT,
decision tree; NB, naïve Bayes; NN, neural network; RF, random forest; SVM, support vector machines; LR, logistic regression; XGB, extreme gradient
boosting.
6
Table 2. Summary of performance of algorithms
Feature Algorithm Accuracy AUC Sensitivity Specificity Brier score
LEfSe KNN 0.333 0.815 0.111 1.000 0.316
DT 0.750 0.833 0.667 1.000 0.270
NB 0.250 0.593 0.000 1.000 0.417
NN 0.417 0.556 0.333 0.667 0.340
RF 0.583 0.889 0.444 1.000 0.205
SVM 0.583 0.926 0.444 1.000 0.178
LR 0.250 0.815 0.222 0.333 0.250
XGB 0.500 0.889 0.333 1.000 0.203
LEfSe and LASSO KNN 0.333 0.907 0.111 1.000 0.267
DT 0.417 0.500 0.333 0.667 0.449
NB 0.917 0.926 1.000 0.667 0.083
NN 0.667 0.778 0.778 0.333 0.273
RF 0.750 0.926 0.778 0.667 0.159
SVM 0.417 1.000 0.222 1.000 0.121
LR 0.750 0.889 0.778 0.667 0.250
XGB 0.667 0.593 0.556 1.000 0.302
AUC, area under the receiver operating characteristic curve; DT, decision tree; KNN, k-nearest neighbours; LASSO, the least absolute shrinkage
and selection operator; LEfSe, effect size of linear discriminant analysis effect size; LR, logistic regression; NB, naïve Bayes; NN, neural network;
RF, random forest; SMO, sequential minimal optimization; SVM, support vector machines; XGB, extreme gradient boosting.
(s__unclassified_g__Prevotella), OTU1067 (s__unclassified_f__Lachnospiraceae), OTU742 (s__unclassified_g__Lachno-

spira), OTU137 (s__Prevotella_intermedia) and OTU616 (s__unclassified_g__Burkholderia-Caballeronia-Paraburkholderia)
(Fig. 4a). Further LASSO regression analysis was performed on LEfSe results, and ranked by importance, the bacteria in
the RF model were OTU732 (s__metagenome_g__Lachnospiraceae_ND3007_group), OTU171 (s__unclassified_g__Prevo-
tella), OTU620 (s__Escherichia_coli_g__Escherichia-Shigella), OTU459 (s__unclassified_g__Ruminococcus_torques_group),
OTU462 (s__Haemophilus_parainfluenzae), OTU1111 (s__uncultured_bacterium_g__Family_XIII_AD3011_group),
OTU796 (s__unclassified_g__Bacteroides), OTU844 (s__unclassified_g__norank_f__norank_o__Clostridia_UCG-014),
OTU1062 (s__Dialister_invisus_DSM_15470) and OTU852 (s__unclassified_g__Eubacterium). Among them, the common
differential microbiota were s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-
Shigella and s__unclassified_g__Prevotella (Fig. 4b). To further investigate whether these gut microbiota can be used as
key microbiota for identification and diagnosis, we compared the results of the ROC curves. The results illustrated that
s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__
Prevotella, based on the criterion of AUC>0.7, had the potential to diagnose CRC (AUC>0.7, P<0.05) (Fig. 4c). We found
that this result was consistent with Fig. 1a. The s__metagenome_g__Lachnospiraceae_ND3007_group was mainly enriched in
healthy controls, while s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella were mainly enriched
in CRC. These validations thus provide evidence that gut microbiota biomarkers could be used as effective non-invasive
clinical indicators for CRC and HCs.
DISCUSSION
The colon contains the highest density of metabolically active microbiota, and it is becoming increasingly clear that changes
in the composition of the gut microbiota are associated with CRC, and that the microbiome may play a crucial role in the
aetiology of CRC [20, 21]. Many studies have reported an increase or decrease in the abundance of certain gut bacteria in
patients with CRC [22]. Therefore, changes in the gut microbiota may be useful in screening for CRC. In this study, we have
demonstrated that the gut microbiome could be used as a non-invasive diagnostic tool for CRC. Through a comprehensive
evaluation of supervised ML algorithms, we found that different algorithms showed significant differences in diagnostic
performance when using gut microbiota, and the RF model had better overall performance.
7
Fig. 3. OTU screening by LEfSe and LASSO regression analysis can moderately improve the diagnostic ability of CRC. (a) ROC curves showing the test
performance of eight different ML algorithm models trained by LEfSe and LASSO regression analysis. (b) Decision curve analysis of CRC diagnosis
using eight ML algorithms based on LEfSe and LASSO regression analysis.
8
Fig. 4. Identified important gut microbiota in the RF model. (a) The top ten OTUs selected by LEfSe analysis. (b) The top ten OTUs selected by LEfSe and
LASSO regression analysis. (c) ROC curve analysis of gut microbiota in CRC.
In microbiome research, the lack of justification in selecting a modelling approach has often been due to the implicit assumption that
the more complex the models, the better. This has led to a tendency to use non-linear models such as RF and deep NN over simpler
models such as LR or other linear models. Although in some cases complex models may capture important non-linear relationships
and therefore yield better predictions, they can also result in black boxes that lack interpretability [23]. Many studies have also proposed
the application of deep learning methods in clinical practice [24, 25]. Deep learning methods hold promise, but microbiome datasets
often suffer from having many features and small sample sizes, which result in overfitting.
In terms of predictive performance, the RF model showed moderate superiority in both LEfSe and LASSO classifiers, with
an AUC of 0.858 and 0.926, respectively. Consistent with our result, Feng et al. [26] also reported the good predictive
performance of the RF algorithm for CRC (AUC=0.96). However, they did not report sensitivity or specificity. In our
study, we found, through the LASSO regression analysis based on LEfSe results, that the RF model not only had a satisfac-
tory AUC value, but also had high sensitivity, accuracy and specificity. Another important finding was that the bacteria
s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__
Prevotella could be used as key microbiota for CRC identification and diagnosis in the RF model. LEfSe analysis showed
that s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella were highly abundant in CRC, while
9
the s__metagenome_g__Lachnospiraceae_ND3007_group was the dominant bacteria in HCs, which was consistent with
other studies [27–30].
Escherichia–Shigella are the most representative bacteria and can be considered a potential biomarker for malignancy
progression [31]. The role of these bacteria in CRC progression remains to be evaluated, being considered both as a driver
and as a passenger [32, 33]. We noted that the microbiota of CRC patients contained higher levels of bacteria that have
traditionally been considered oral pathogens, such as Prevotella. Periodontal pathogens have been shown to promote the
progression of oral cancer, so these taxa may influence the progression of CRC through a similar mechanism. Because the
structure of the oral microbiota was related to the structure of the gut, changes in the oral microbiota may be a potential
proxy for ongoing or future changes in the gut microbiota [34, 35]. In addition, Prevotella was highly enriched in proximal
colon cancer that appeared to be linked to elevated IL17-producing cells in the mucosa of CRC patients [5]. As highlighted
above, the abundance of pathogenic bacterial populations is significantly correlated with the occurrence of CRC. However,
the potential protective bacterial depletion in this study may play a similar role in the pathology of CRC, as previous studies
found that OTU associated with the family Lachnospiraceae showed significant depletion in CRC individuals. The family
Lachnospiraceae produces short-chain fatty acids (SCFAs). SCFAs are important microbial metabolites that provide nutrients
for the colon and that maintain homeostasis, and have been shown to have significant antitumour properties. The deficit in
the number of butyrate-producing bacteria can have detrimental consequences in the progression of the disease [36–38].
Similarly, Baxter et al. [34] observed a reduction in potentially butyric acid-producing Lachnospiraceae in patients with CRC.
These findings indicate that ML approaches can improve the accuracy and convenience of diagnosing CRC, and gut
microbial markers show promising potential as non-invasive tools for the diagnosis of CRC. However, the deployment of
microbial-based models for clinical diagnosis or prediction is a more challenging and unique undertaking [39]. Similarly,
our study also has the following limitations. First, we lacked preoperative faecal samples from patients with CRC. We
acknowledge that surgery has a certain impact on gut microbiota changes in CRC, and collecting postoperative samples may
introduce potential confounding factors. While some studies [40, 41] have shown changes in gut microbiota before and after
CRC surgery, another study [42] found no difference in gut microbiota before and after surgery, and there is currently no
consensus. In addition, chemotherapy or radiation therapy may be the main cause of postoperative changes in gut microbiota
in CRC patients [43, 44]. To minimize the impact of confounding factors, the present study only included postoperative
CRC patients who did not receive chemotherapy or radiation therapy. However, the reliability of these findings is limited
due to the lack of preoperative samples. We hope that future studies can strictly control potential confounding factors,
evaluate changes in gut microbiota before and after surgery more clearly, further expand the sample size, and improve the
reliability of the research. Second, our sample size is too small, and it remains to be verified whether the best modelling
parameters of this study can be applied to the identification and diagnosis of a wide range of samples. We also need a separate
validation queue to test the performance of the diagnostic model. Meanwhile, we should also consider using multi-omics
technologies to further explore the predictive ability of the microbiota in CRC in future studies. Third, our study did not
include adenoma samples, which is important for studying the development process of CRC. Finally, due to the limitation of
detection technology, short research time and lack of depth, many studies still have a limited understanding of the potential
pathogenic mechanisms of the pathogens, and the theoretical basis for clinical application needs futher verification through
experimentation. On the other hand, our study indicated that pathogenic bacteria have great potential as antagonistic and
predictive targets in the diagnosis and treatment of CRC. While ML approaches are still evolving, they have already shown
promising capabilities, both for early diagnosis and for identifying underlying pathogenesis that can guide mechanistic
studies. Meanwhile, microbiome-based biomarkers have irreplaceable advantages and need to be continuously explored to
accelerate the application of precision medicine.
CONCLUSION
Taken together, our results suggest that CRC is associated with dysbiosis of the microbiota, and the dysregulated microbiota
could be used for disease diagnosis. In the model, s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_
coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella could be identified as key microbiota for CRC diagnosis.
These findings, once confirmed on larger cohorts of patients, may represent an important step towards the development
of more effective diagnostic strategies.
Funding information
This work was supported by the National Natural Science Foundation of China (NSFC, 82060366, 82273694, 82160385) and the Guangxi Natural
Science Foundation (2018GXNSFAA050099).
Author contributions
Methodology, J.Z.; Validation, P.C.; Investigation, T.L.; Resources, H.L.; Data Curation, T.Z.; Writing – Original Draft Preparation, F.L.; Writing – Review and
Editing, L.Y.; Supervision, J.H.; Project Administration, H.C.
10
Conflicts of interest
The authors declare no conflicts of interest in this work.
Ethical statement
All procedures performed in studies involving human participants following the ethical standards of the medical ethics committee of Guangxi Medical
University (no. 20200095).
References to probiotics populations, based on their antagonistic effect. Clin

1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Chem 2018;64:1327–1337.
et al. Global cancer statistics 2020: GLOBOCAN estimates of inci- 18. Zeller G, Tap J, Voigt AY, Sunagawa S, Kultima JR, et al. Potential of
dence and mortality worldwide for 36 cancers in 185 countries. CA fecal microbiota for early‐stage detection of colorectal cancer. Mol
Cancer J Clin 2021;71:209–249. Syst Biol 2014;10:766.
2. Collins JF, Lieberman DA, Durbin TE, Weiss DG. Accuracy of 19. Ai L, Tian H, Chen Z, Chen H, Xu J, et al. Systematic evaluation of
screening for fecal occult blood on a single stool sample obtained supervised classifiers for fecal microbiota- based prediction of
by digital rectal examination: A comparison with recommended colorectal cancer. Oncotarget 2017;8:9546–9556.
sampling practice. Ann Intern Med 2005;142:81–85.
20. Wong SH, Yu J. Gut microbiota in colorectal cancer: mechanisms
3. Japanese Society for Cancer of the Colon and Rectum. Japanese
of action and clinical applications. Nat Rev Gastroenterol Hepatol
classification of colorectal, appendiceal, and anal carcinoma: The
2019;16:690–704.
3d English Edition [Secondary Publication]. J Anus Rectum Colon
2019;3:175–195. 21. Sears CL, Garrett WS. Microbes, microbiota, and colon cancer. Cell
Host Microbe 2014;15:317–328.
4. Ahn J, Sinha R, Pei Z, Dominianni C, Wu J, et al. Human gut
microbiome and risk for colorectal cancer. J Natl Cancer Inst 22. Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, et al. Meta-analysis
2013;105:1907–1911. of fecal metagenomes reveals global microbial signatures that are
5. Sobhani I, Tap J, Roudot-Thoraval F, Roperch JP, Letulle S, et al. specific for colorectal cancer. Nat Med 2019;25:679–689.
Microbial dysbiosis in colorectal cancer (CRC) patients. PLoS One 23. Topçuoğlu BD, Lesniak NA, Ruffin MT IV, Wiens J, Schloss PD,
2011;6:e16393. et al. A framework for effective application of machine learning to
6. Zackular JP, Baxter NT, Iverson KD, Sadler WD, Petrosino JF, microbiome-based classification problems. mBio 2020;11.
et al. The gut microbiome modulates colon tumorigenesis. mBio 24. Kim M, Oh I, Ahn J. An improved method for prediction of cancer
2013;4:e00692–13. prognosis by network learning. Genes 2018;9:478.
7. Kostic AD, Chun E, Robertson L, Glickman JN, Gallini CA, et al. 25. Kong Y, Yu T, Wren J. A graph- embedded deep feedforward
Fusobacterium nucleatum potentiates intestinal tumorigenesis network for disease outcome classification and feature selection
and modulates the tumor-immune microenvironment. Cell Host & using gene expression data. Bioinformatics 2018;34:3727–3737.
Microbe 2013;14:207–215. 26. Feng Q, Liang S, Jia H, Stadlmayr A, Tang L, et al. Gut microbiome
8. Long X, Wong CC, Tong L, Chu ESH, Ho Szeto C, et al. Peptostrepto- development along the colorectal adenoma–carcinoma sequence.
coccus anaerobius promotes colorectal carcinogenesis and modu- Nat Commun 2015;6:6528.
lates tumour immunity. Nat Microbiol 2019;4:2319–2330.
27. Zhang C, Hu A, Li J, Zhang F, Zhong P, et al. Combined non-invasive
9. Ghorbani M, Taylor SJE, Pook MA, Payne A. Comparative (compu- prediction and new biomarkers of oral and fecal microbiota in
tational) analysis of the DNA methylation status of trinucleotide patients with gastric and colorectal cancer. Front Cell Infect Micro-
repeat expansion diseases. J Nucleic Acids 2013;2013:689798. biol 2022;12:830684.
10. Lebedev AV, Westman E, Van Westen GJP, Kramberger MG, 28. Wang T, Cai G, Qiu Y, Fei N, Zhang M, et al. Structural segregation
Lundervold A, et al. Random forest ensembles for detection and of gut microbiota between colorectal cancer patients and healthy
prediction of Alzheimer’s disease with a good between-cohort volunteers. ISME J 2012;6:320–329.
robustness. Neuroimage Clin 2014;6:115–125.
29. Tortora SC, Bodiwala VM, Quinn A, Martello LA, Vignesh S. Micro-
11. Ruffle JK, Farmer AD, Aziz Q. Artificial intelligence-
assisted biome and colorectal carcinogenesis: Linked mechanisms and
gastroenterology- promises and pitfalls. Am J Gastroenterol racial differences. World J Gastrointest Oncol 2022;14:375–395.
2019;114:422–428.
30. Li Z, Li Z, Zhu L, Dai N, Sun G, et al. Effects of xylo-oligosaccharide
12. Lui TKL, Guo CG, Leung WK. Accuracy of artificial intelligence on on the gut microbiota of patients with ulcerative colitis in clinical
histology prediction and detection of colorectal polyps: A system- remission. Front Nutr 2021;8:778542.
atic review and meta-analysis. Gastrointest Endosc 2020;92:11–22.
e6. 31. Mori G, Rampelli S, Orena BS, Rengucci C, De Maio G, et al. Shifts
of faecal microbiota during sporadic colorectal carcinogenesis. Sci
13. Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artifi- Rep 2018;8:10329.
cial intelligence in digital pathology - new tools for diagnosis and
precision oncology. Nat Rev Clin Oncol 2019;16:703–715. 32. Tjalsma H, Boleij A, Marchesi JR, Dutilh BE. A bacterial driver–
passenger model for colorectal cancer: Beyond the usual suspects.
14. Cao Y, Wang Z, Yan Y, Ji L, He J, et al. Enterotoxigenic bacte-
Nat Rev Microbiol 2012;10:575–582.
roidesfragilis promotes intestinal inflammation and malignancy
by inhibiting exosome-packaged miR-149-3p. Gastroenterology 33. Tomkovich S, Yang Y, Winglee K, Gauthier J, Mühlbauer M, et al.
2021;161:1552–1566.e12. Locoregional effects of microbiota in a preclinical model of colon
carcinogenesis. Cancer Res 2017;77:2620–2632.
15. Chung L, Thiele Orberg E, Geis AL, Chan JL, Fu K, et al. Bacte-
roides fragilis toxin coordinates a pro-carcinogenic inflammatory 34. Baxter NT, Ruffin MTt, Rogers MAM, Schloss PD. Microbiota-based
cascade via targeting of colonic epithelial cells. Cell Host Microbe model improves the sensitivity of fecal immunochemical test for
2018;23:203–214.e5. detecting colonic lesions. Genome Med 2016;8:37.
16. Yachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, et al. 35. Ding T, Schloss PD. Dynamics and associations of micro-
Metagenomic and metabolomic analyses reveal distinct stage- bial community types across the human body. Nature
specific phenotypes of the gut microbiota in colorectal cancer. Nat 2014;509:357–360.
Med 2019;25:968–976. 36. Zackular JP, Rogers MAM, Ruffin MTt, Schloss PD. The human gut
17. Guo S, Li L, Xu B, Li M, Zeng Q, et al. A simple and novel fecal microbiome as a screening tool for colorectal cancer. Cancer Prev
biomarker for colorectal cancer: Ratio of Fusobacterium Nucleatum Res (Phila) 2014;7:1112–1121.
11
37. Pryde SE, Duncan SH, Hold GL, Stewart CS, Flint HJ. The micro- 41. Sze MA, Baxter NT, Ruffin MT 4th, Rogers MAM, Schloss PD.
biology of butyrate formation in the human colon. FEMS Microbiol Normalization of the microbiota in patients after treatment for
Lett 2002;217:133–139. colonic lesions. Microbiome 2017;5:150.
38. D’Argenio G, Cosenza V, Delle Cave M, Iovino P, 42. Huang R, He K, Duan X, Xiao J, Wang H, et al. Changes of intestinal
Delle Valle N, et al. Butyrate enemas in experimental colitis and microflora in colorectal cancer patients after surgical resection
protection against large bowel cancer in a rat model. Gastroenter- and chemotherapy. Comput Math Methods Med 2022;2022:1940846.
ology 1996;110:1727–1734. 43. Yixia Y, Sripetchwandee J, Chattipakorn N, Chattipakorn SC. The
alterations of microbiota and pathological conditions in the gut
39. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, et al. Do no harm: of patients with colorectal cancer undergoing chemotherapy.
a roadmap for responsible machine learning for health care. Nat Anaerobe 2021;68:102361.
Med 2019;25:1337–1340.
44. Shuwen H, Xi Y, Yuefen P, Jiamin X, Quan Q, et al. Effects of post-
40. Jin Y, Liu Y, Zhao L, Zhao F, Feng J, et al. Gut microbiota in patients operative adjuvant chemotherapy and palliative chemotherapy
after surgical treatment for colorectal cancer. Environ Microbiol on the gut microbiome in colorectal cancer. Microb Pathog
2019;21:772–783. 2020;149:104343.
Five reasons to publish your next article with a Microbiology Society journal
1. When you submit to our journals, you are supporting Society activities for your community.
2. Experience a fair, transparent process and critical, constructive review.
3. If you are at a Publish and Read institution, you’ll enjoy the benefits of Open Access across
our journal portfolio.
4. Author feedback says our Editors are ‘thorough and fair’ and ‘patient and caring’.
5. Increase your reach and impact and share your research more widely.
Find out more and submit your article at microbiologyresearch.org.
12

Using Gut Microbiota As A Diagnostic Tool For Colorectal Cancer Machine Learning Techniques Reveal Promising Results Compress

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Gut Microbiota As A Diagnostic Tool For Colorectal Cancer Machine Learning Techniques Reveal Promising Results Compress

Uploaded by

Copyright:

Available Formats

RESEARCH ARTICLE

Lu et al., Journal of Medical Microbiology 2023;72:001699

Using gut microbiota as a diagnostic tool for colorectal cancer:

Received 19 December 2022; Accepted 06 April 2023; Published 07 June 2023

001699 © 2023 The Authors

DNA extraction and sequencing

CRC was related to the dysregulation of various gut microbiota

Table 1. Basic information about the study population

CRC (N ＝ 21) HC (N ＝ 17) P-­value

Age, year (mean, sd) 60.43 (10.03) 55.94 (17.04) 0.147

BMI (mean, sd) 21.69 (2.98) 22.53 (2.79) 0.381

Gender (n, %) 0.275

Men 19 (90.48 %) 14 (82.35 %)

Women 2(9.52 %) 3(17.65 %)

Han 15 (71.43 %) 13 (76.47 %)

Other 6 (28.57 %) 4 (23.53 %)

ML models for diagnosis and screening based on the gut microbiota

Supervised ML models trained with LEfSe and LASSO regression model

Identifying the top ten most important OTUs in the RF model

Table 2. Summary of performance of algorithms

Feature Algorithm Accuracy AUC Sensitivity Specificity Brier score

LEfSe KNN 0.333 0.815 0.111 1.000 0.316

DT 0.750 0.833 0.667 1.000 0.270

NB 0.250 0.593 0.000 1.000 0.417

NN 0.417 0.556 0.333 0.667 0.340

RF 0.583 0.889 0.444 1.000 0.205

SVM 0.583 0.926 0.444 1.000 0.178

LR 0.250 0.815 0.222 0.333 0.250

XGB 0.500 0.889 0.333 1.000 0.203

LEfSe and LASSO KNN 0.333 0.907 0.111 1.000 0.267

DT 0.417 0.500 0.333 0.667 0.449

NB 0.917 0.926 1.000 0.667 0.083

NN 0.667 0.778 0.778 0.333 0.273

RF 0.750 0.926 0.778 0.667 0.159

SVM 0.417 1.000 0.222 1.000 0.121

LR 0.750 0.889 0.778 0.667 0.250

XGB 0.667 0.593 0.556 1.000 0.302

(s__unclassified_g__Prevotella), OTU1067 (s__unclassified_f__Lachnospiraceae), OTU742 (s__unclassified_g__Lachno-

References to probiotics populations, based on their antagonistic effect. Clin

Find out more and submit your article at microbiologyresearch.org.

You might also like

CRC (N ＝ 21) HC (N ＝ 17) P-value

Men 19 (90.48 %) 14 (82.35 %)

Women 2(9.52 %) 3(17.65 %)

Han 15 (71.43 %) 13 (76.47 %)

Other 6 (28.57 %) 4 (23.53 %)

(s__unclassified_gPrevotella), OTU1067 (sunclassified_fLachnospiraceae), OTU742 (sunclassified_g__Lachno-