Professional Documents
Culture Documents
Abstract
Introduction. Increasing evidence suggests a correlation between gut microbiota and colorectal cancer (CRC).
Hypothesis/Gap Statement. However, few studies have used gut microbiota as a diagnostic biomarker for CRC.
Aim. The objective of this study was to explore whether a machine learning (ML) model based on gut microbiota could be used
to diagnose CRC and identify key biomarkers in the model.
Methodology. We sequenced the 16S rRNA gene from faecal samples of 38 participants, including 17 healthy subjects and 21
CRC patients. Eight supervised ML algorithms were used to diagnose CRC based on faecal microbiota operational taxonomic
units (OTUs), and the models were evaluated in terms of identification, calibration and clinical practicality for optimal modelling
parameters. Finally, the key gut microbiota was identified using the random forest (RF) algorithm.
Results. We found that CRC was associated with the dysregulation of gut microbiota. Through a comprehensive evaluation of
supervised ML algorithms, we found that different algorithms had significantly different prediction performance using faecal
microbiomes. Different data screening methods played an important role in optimization of the prediction models. We found
that naïve Bayes algorithms [NB, accuracy=0.917, area under the curve (AUC)=0.926], RF (accuracy=0.750, AUC=0.926) and
logistic regression (LR, accuracy=0.750, AUC=0.889) had high predictive potential for CRC. Furthermore, important features
in the model, namely s__metagenome_g__Lachnospiraceae_ND3007_group (AUC=0.814), s__Escherichia_coli_g__Escherichia-
Shigella (AUC=0.784) and s__unclassified_g__Prevotella (AUC=0.750), could each be used as diagnostic biomarkers of CRC.
Conclusions. Our results suggested an association between gut microbiota dysregulation and CRC, and demonstrated the fea-
sibility of the gut microbiota to diagnose cancer. The bacteria s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escheri-
chia_coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella were key biomarkers for CRC.
INTRODUCTION
The high incidence and mortality of colorectal cancer (CRC) make it one of the most concerning diseases in the world. According to
the Global Cancer Data 2020 report, there were 19.3 million new cancer cases worldwide, and the overall incidence of CRC rose from
fourth place in 2018 to third place [1]. Because effective drugs for CRC are still being developed and the only effective measures are
early detection and surgical removal of CRC, many countries recommend universal screening and prevention programmes. Currently,
one of the most widely used non-invasive screening procedures is the faecal occult blood test (FOBT), which can indicate the presence
of advanced adenomas and carcinomas in the colon by detecting blood in the stool [2]. However, because the FOBT has limited
sensitivity and specificity for CRC and does not reliably detect precancerous lesions, there is a need to develop a new non-invasive,
simple and effective CRC screening test [3].
The gut microbiota is a collection of microorganisms living in the gastrointestinal tract and is a potential source of biomarkers for
detecting colonic lesions. In human studies, patients with CRC have an abnormal gut microbiome structure when compared with
healthy patients [4, 5]. Experiments in animal models have also shown that such alterations have the potential to accelerate tumo-
rigenesis [6]. Thus, the detection of these pathogenic bacteria in gut microbiota could be a promising method for CRC screening.
Although some members of the gut microbiota have been shown to contribute to the onset and progression of CRC through various
mechanisms, they are not present in all cases [4, 7, 8]. It is unclear how many cases of CRC can be attributed to these pathogens, and
whether changes in microbial abundance could provide the basis for an accurate CRC screening test.
Machine learning (ML), a major branch of artificial intelligence (AI), can be used to increase our understanding of changes in existing
data structures and to make predictions about new data. It has been used in a wide variety of studies, such as DNA methylation
associated with genetic diseases [9], the diagnosis of Alzheimer’s disease using imaging data [10], the prediction of gastrointestinal
disease development using continuous variable fitting techniques [11], and the automatic detection of gastrointestinal lesions by
computer vision in endoscopes [12]. ML algorithms and new computational models offer the opportunity to generate computational
drug networks to diagnose the efficacy of approved drugs relative to relevant oncogenic targets, as well as to select patients with better
responses or better disease biomarkers [10]. In the field of digital pathology, the emergence of AI and ML tools makes it possible to
mine new morphological phenotypes and improve patient management for a variety of cancer types [13]. It enables computer programs
to automatically analyse large amounts of data and determine which information is most relevant.
At present, several studies have identified and elucidated the pathogenicity of certain intestinal microorganisms. For example, entero-
toxigenic Bacteroides fragilis is the typical pathogen that causes CRC by upregulating inflammatory factors, releasing reactive oxygen
species, inducing intestinal inflammation, and promoting the formation of polyps and tumours [14, 15]. In the study by Yachida et
al. [16], principal component analysis (PCA) was used to select Bacteroides and Prevotella, two types of bacteria with the greatest
variation in abundance, from the faeces of CRC patients and a healthy control (HC) group. Both bacteria are major contributors to
the gut flora of CRC patients. Guo et al. [17] reported that a highly accurate CRC diagnostic model was developed by combining the
results of quantitative PCR (qPCR) of the abundance of three gut bacteria, Fusobacterium nucleatum, Faecalibacterium praus-nitzii and
Bifidobacterium spp. In another study, Fusobacterium, Porphyromonas and Peptostreptococcus were all enriched in CRC patients based
on using a metagenomic classifier [18]. However, some quite significantly differently expressed bacteria between CRC and normal
controls can be recognized by many different algorithms and used as a key parameter for prediction. For example, in the Chinese
population, Methanosphaera_stadtmanae_DSM_3091 was identified and used by filtered classifier, sequential minimal optimization
(SMO), logistic and naïve Bayes models as key parameters. Another dominant bacterium, Blautia_uncultured_Firmicutes_bacterium,
was taken by the random tree, J48 and PART algorithms as key parameters [19].
Although there is increasing recognition of the potential of the faecal microbiome in the detection of CRC, the choice of classification
models is diverse. Due to the nature of the algorithms themselves, each algorithm has its default parameters, so it is unclear which
modelling algorithm is more suitable for CRC diagnostic screening studies. In this study, we systematically evaluated the performance
of the supervised classifiers to diagnose CRC based on gut microbiota. We recruited 38 participants and sequenced the hypervariable
regions of the 16S rRNA gene from the faeces of each individual, used different supervised ML algorithms to test their performance
in the diagnosis of CRC based on gut microbiota, and identified several potential bacteria associated with the dysbiosis of CRC.
METHODS
Participants and sample collection
Patients and healthy volunteers were recruited from the First Affiliated Hospital of Guangxi Medical University (Guangxi, China)
between August 2020 and February 2021. The inclusion criteria for the CRC group were as follows: (1) tumour site was clear
and biopsy-confirmed; (2) no radiation or chemotherapy before sampling; (3) no antibiotics or probiotics within 1 month;
and (4) complete case data were available. Healthy volunteers of the age and gender of the subjects were recruited as the HC
group in the Guangxi Medical University Center for Physical Examination. The inclusion criteria for HCs were as follows: (1)
no gastrointestinal-related diseases; and (2) no antibiotics or probiotics within 1 month. The exclusion criteria for both groups
were as follows: (1) have diseases related to intestinal flora, such as inflammatory bowel disease, diabetes, peptic ulcers, etc.; (2)
pregnant or lactating women; and (3) a family history of bowel disease, such as familial adenomatous polyposis. All the volunteers
understood and signed informed consent before inclusion in the group. This study was approved by the Ethics Committee of
Guangxi Medical University.
Faecal samples were collected from HCs and patients after CRC surgery, placed in sterile boxes on ice and transported immediately
to the laboratory. Each sample was evenly divided into sterile tubes and immediately frozen at −80 °C.
2
Lu et al., Journal of Medical Microbiology 2023;72:001699
Data analysis
Trimmomatic software was used for quality control of the original sequence, and FLASH software was used for splicing. We uesd the
UPARSE 7.1 software (http://drive5.com/uparse/) to perform OTU clustering on the sequences, only clustering sequences with at least
97 % identical nucleotides into OTUs. Chimeras were removed during clustering using the UCHIME software (Table S1, available
in the online Supplementary Material). Each sequence was annotated for species classification using the RDP classifier (http://rdp.
cme.msu.edu/) and compared with the Silva 16S rRNA database (Release 138, http://www.arb-silva.de). The comparison threshold
was set to 70 %.
Supervised ML modelling
All ML analyses were performed using R (v.4.2.2). Linear discriminant analysis effect size (LEfSe, LDA>2, P<0.05)based on the
microeco package was used to iden btify the OTU difference between CRC and HC groups, and then the ML model was built. Eight
different supervised ML algorithms in the caret package were used to train the microbiota OTU table: k-nearest neighbours (KNN),
decision tree (DT), naïve Bayes (NB), neural network (NN), random forest (RF), support vector machines (SVM), logistic regression
(LR) and extreme gradient boosting (XGB). Data were assigned into training (70%) and testing (30%) datasets after the whole dataset
was shuffled. The training effects of different ML models were evaluated through ten cross-validations, and the process was repeated
ten times to obtain the optimal modelling parameters. To further improve model performance, we used the glmnet package-based
regression of the least absolute shrinkage and selection operator (LASSO) to further screen differential OTUs identified by LEfSe
before modelling. All models were evaluated from three perspectives: identification [accuracy, area under the receiver operating
characteristic curve (AUC), sensitivity, specificity], calibration (Brier score) and clinical practicability [decision curve analysis (DCA)].
The RF algorithm was used to identify the top ten most important OTUs in the model. Further, the resulting AUCs were plotted using
GraphPad Prism 9.0 to identify potential microbial biomarkers for CRC. Model parameters and code are available in Supplementary
File 1.
Statistical analysis
Clinical data were analysed using SPSS Statistics v26.0. Quantitative data were presented as mean and standard deviation (sd), and
the differences in age and body mass index (BMI) between the CRC and HC group were compared using t-tests. Qualitative data were
presented as percentages, and differences in gender and race between different groups were compared using Chi-square tests. All data
with two-sided P<0.05 were considered statistically significant.
RESULTS
A total of 38 subjects were included in this study: 17 healthy individuals in the control group and 21 individuals in the CRC group.
There were no significant differences in gender, age, BMI or race between the two groups (Table 1).
3
Lu et al., Journal of Medical Microbiology 2023;72:001699
Race 0.342
Clinical stage
Ⅰ
4 –
Ⅱ
3 –
Ⅲ
8 –
Ⅳ
6 –
BMI, body mass index; CRC, colorectal cancer; HC, health control.
4
Lu et al., Journal of Medical Microbiology 2023;72:001699
Fig. 1. Analysis of differences in gut microbial abundance between colorectal cancer patients and the healthy control group. (a) Linear discriminant
analysis effect size bar graph showing different bacterial taxa. (b) Cladogram showing the phylogenetic relationships of different bacterial taxa.
5
Lu et al., Journal of Medical Microbiology 2023;72:001699
Fig. 2. OTU screening by LEfSe analysis. (a) ROC curves showing the test performance of eight different ML algorithm models trained using LEfSe
analysis. (b) Decision curve analysis of CRC diagnosis by eight ML algorithms based on LEfSe analysis. OTU, operational taxonomic unit; LEfSe, effect
size of linear discriminant analysis effect size; ROC, resulting area under the receiver operating characteristic curve; KNN, k-nearest neighbours; DT,
decision tree; NB, naïve Bayes; NN, neural network; RF, random forest; SVM, support vector machines; LR, logistic regression; XGB, extreme gradient
boosting.
6
Lu et al., Journal of Medical Microbiology 2023;72:001699
AUC, area under the receiver operating characteristic curve; DT, decision tree; KNN, k-nearest neighbours; LASSO, the least absolute shrinkage
and selection operator; LEfSe, effect size of linear discriminant analysis effect size; LR, logistic regression; NB, naïve Bayes; NN, neural network;
RF, random forest; SMO, sequential minimal optimization; SVM, support vector machines; XGB, extreme gradient boosting.
DISCUSSION
The colon contains the highest density of metabolically active microbiota, and it is becoming increasingly clear that changes
in the composition of the gut microbiota are associated with CRC, and that the microbiome may play a crucial role in the
aetiology of CRC [20, 21]. Many studies have reported an increase or decrease in the abundance of certain gut bacteria in
patients with CRC [22]. Therefore, changes in the gut microbiota may be useful in screening for CRC. In this study, we have
demonstrated that the gut microbiome could be used as a non-invasive diagnostic tool for CRC. Through a comprehensive
evaluation of supervised ML algorithms, we found that different algorithms showed significant differences in diagnostic
performance when using gut microbiota, and the RF model had better overall performance.
7
Lu et al., Journal of Medical Microbiology 2023;72:001699
Fig. 3. OTU screening by LEfSe and LASSO regression analysis can moderately improve the diagnostic ability of CRC. (a) ROC curves showing the test
performance of eight different ML algorithm models trained by LEfSe and LASSO regression analysis. (b) Decision curve analysis of CRC diagnosis
using eight ML algorithms based on LEfSe and LASSO regression analysis.
8
Lu et al., Journal of Medical Microbiology 2023;72:001699
Fig. 4. Identified important gut microbiota in the RF model. (a) The top ten OTUs selected by LEfSe analysis. (b) The top ten OTUs selected by LEfSe and
LASSO regression analysis. (c) ROC curve analysis of gut microbiota in CRC.
In microbiome research, the lack of justification in selecting a modelling approach has often been due to the implicit assumption that
the more complex the models, the better. This has led to a tendency to use non-linear models such as RF and deep NN over simpler
models such as LR or other linear models. Although in some cases complex models may capture important non-linear relationships
and therefore yield better predictions, they can also result in black boxes that lack interpretability [23]. Many studies have also proposed
the application of deep learning methods in clinical practice [24, 25]. Deep learning methods hold promise, but microbiome datasets
often suffer from having many features and small sample sizes, which result in overfitting.
In terms of predictive performance, the RF model showed moderate superiority in both LEfSe and LASSO classifiers, with
an AUC of 0.858 and 0.926, respectively. Consistent with our result, Feng et al. [26] also reported the good predictive
performance of the RF algorithm for CRC (AUC=0.96). However, they did not report sensitivity or specificity. In our
study, we found, through the LASSO regression analysis based on LEfSe results, that the RF model not only had a satisfac-
tory AUC value, but also had high sensitivity, accuracy and specificity. Another important finding was that the bacteria
s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__
Prevotella could be used as key microbiota for CRC identification and diagnosis in the RF model. LEfSe analysis showed
that s__Escherichia_coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella were highly abundant in CRC, while
9
Lu et al., Journal of Medical Microbiology 2023;72:001699
the s__metagenome_g__Lachnospiraceae_ND3007_group was the dominant bacteria in HCs, which was consistent with
other studies [27–30].
Escherichia–Shigella are the most representative bacteria and can be considered a potential biomarker for malignancy
progression [31]. The role of these bacteria in CRC progression remains to be evaluated, being considered both as a driver
and as a passenger [32, 33]. We noted that the microbiota of CRC patients contained higher levels of bacteria that have
traditionally been considered oral pathogens, such as Prevotella. Periodontal pathogens have been shown to promote the
progression of oral cancer, so these taxa may influence the progression of CRC through a similar mechanism. Because the
structure of the oral microbiota was related to the structure of the gut, changes in the oral microbiota may be a potential
proxy for ongoing or future changes in the gut microbiota [34, 35]. In addition, Prevotella was highly enriched in proximal
colon cancer that appeared to be linked to elevated IL17-producing cells in the mucosa of CRC patients [5]. As highlighted
above, the abundance of pathogenic bacterial populations is significantly correlated with the occurrence of CRC. However,
the potential protective bacterial depletion in this study may play a similar role in the pathology of CRC, as previous studies
found that OTU associated with the family Lachnospiraceae showed significant depletion in CRC individuals. The family
Lachnospiraceae produces short-chain fatty acids (SCFAs). SCFAs are important microbial metabolites that provide nutrients
for the colon and that maintain homeostasis, and have been shown to have significant antitumour properties. The deficit in
the number of butyrate-producing bacteria can have detrimental consequences in the progression of the disease [36–38].
Similarly, Baxter et al. [34] observed a reduction in potentially butyric acid-producing Lachnospiraceae in patients with CRC.
These findings indicate that ML approaches can improve the accuracy and convenience of diagnosing CRC, and gut
microbial markers show promising potential as non-invasive tools for the diagnosis of CRC. However, the deployment of
microbial-based models for clinical diagnosis or prediction is a more challenging and unique undertaking [39]. Similarly,
our study also has the following limitations. First, we lacked preoperative faecal samples from patients with CRC. We
acknowledge that surgery has a certain impact on gut microbiota changes in CRC, and collecting postoperative samples may
introduce potential confounding factors. While some studies [40, 41] have shown changes in gut microbiota before and after
CRC surgery, another study [42] found no difference in gut microbiota before and after surgery, and there is currently no
consensus. In addition, chemotherapy or radiation therapy may be the main cause of postoperative changes in gut microbiota
in CRC patients [43, 44]. To minimize the impact of confounding factors, the present study only included postoperative
CRC patients who did not receive chemotherapy or radiation therapy. However, the reliability of these findings is limited
due to the lack of preoperative samples. We hope that future studies can strictly control potential confounding factors,
evaluate changes in gut microbiota before and after surgery more clearly, further expand the sample size, and improve the
reliability of the research. Second, our sample size is too small, and it remains to be verified whether the best modelling
parameters of this study can be applied to the identification and diagnosis of a wide range of samples. We also need a separate
validation queue to test the performance of the diagnostic model. Meanwhile, we should also consider using multi-omics
technologies to further explore the predictive ability of the microbiota in CRC in future studies. Third, our study did not
include adenoma samples, which is important for studying the development process of CRC. Finally, due to the limitation of
detection technology, short research time and lack of depth, many studies still have a limited understanding of the potential
pathogenic mechanisms of the pathogens, and the theoretical basis for clinical application needs futher verification through
experimentation. On the other hand, our study indicated that pathogenic bacteria have great potential as antagonistic and
predictive targets in the diagnosis and treatment of CRC. While ML approaches are still evolving, they have already shown
promising capabilities, both for early diagnosis and for identifying underlying pathogenesis that can guide mechanistic
studies. Meanwhile, microbiome-based biomarkers have irreplaceable advantages and need to be continuously explored to
accelerate the application of precision medicine.
CONCLUSION
Taken together, our results suggest that CRC is associated with dysbiosis of the microbiota, and the dysregulated microbiota
could be used for disease diagnosis. In the model, s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_
coli_g__Escherichia-Shigella and s__unclassified_g__Prevotella could be identified as key microbiota for CRC diagnosis.
These findings, once confirmed on larger cohorts of patients, may represent an important step towards the development
of more effective diagnostic strategies.
Funding information
This work was supported by the National Natural Science Foundation of China (NSFC, 82060366, 82273694, 82160385) and the Guangxi Natural
Science Foundation (2018GXNSFAA050099).
Author contributions
Methodology, J.Z.; Validation, P.C.; Investigation, T.L.; Resources, H.L.; Data Curation, T.Z.; Writing – Original Draft Preparation, F.L.; Writing – Review and
Editing, L.Y.; Supervision, J.H.; Project Administration, H.C.
10
Lu et al., Journal of Medical Microbiology 2023;72:001699
Conflicts of interest
The authors declare no conflicts of interest in this work.
Ethical statement
All procedures performed in studies involving human participants following the ethical standards of the medical ethics committee of Guangxi Medical
University (no. 20200095).
11
Lu et al., Journal of Medical Microbiology 2023;72:001699
37. Pryde SE, Duncan SH, Hold GL, Stewart CS, Flint HJ. The micro- 41. Sze MA, Baxter NT, Ruffin MT 4th, Rogers MAM, Schloss PD.
biology of butyrate formation in the human colon. FEMS Microbiol Normalization of the microbiota in patients after treatment for
Lett 2002;217:133–139. colonic lesions. Microbiome 2017;5:150.
38. D’Argenio G, Cosenza V, Delle Cave M, Iovino P, 42. Huang R, He K, Duan X, Xiao J, Wang H, et al. Changes of intestinal
Delle Valle N, et al. Butyrate enemas in experimental colitis and microflora in colorectal cancer patients after surgical resection
protection against large bowel cancer in a rat model. Gastroenter- and chemotherapy. Comput Math Methods Med 2022;2022:1940846.
ology 1996;110:1727–1734. 43. Yixia Y, Sripetchwandee J, Chattipakorn N, Chattipakorn SC. The
alterations of microbiota and pathological conditions in the gut
39. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, et al. Do no harm: of patients with colorectal cancer undergoing chemotherapy.
a roadmap for responsible machine learning for health care. Nat Anaerobe 2021;68:102361.
Med 2019;25:1337–1340.
44. Shuwen H, Xi Y, Yuefen P, Jiamin X, Quan Q, et al. Effects of post-
40. Jin Y, Liu Y, Zhao L, Zhao F, Feng J, et al. Gut microbiota in patients operative adjuvant chemotherapy and palliative chemotherapy
after surgical treatment for colorectal cancer. Environ Microbiol on the gut microbiome in colorectal cancer. Microb Pathog
2019;21:772–783. 2020;149:104343.
Five reasons to publish your next article with a Microbiology Society journal
1. When you submit to our journals, you are supporting Society activities for your community.
2. Experience a fair, transparent process and critical, constructive review.
3. If you are at a Publish and Read institution, you’ll enjoy the benefits of Open Access across
our journal portfolio.
4. Author feedback says our Editors are ‘thorough and fair’ and ‘patient and caring’.
5. Increase your reach and impact and share your research more widely.
12