Ensemble1-2022-Analysis and Prediction of Gestational Diabetes Mellitus

International Journal of Computational Intelligence Systems (2022) 15:72
https://doi.org/10.1007/s44196-022-00110-8
RESEARCH ARTICLE
Analysis and Prediction of Gestational Diabetes Mellitus

by the Ensemble Learning Method
Xiaojia Wang1 · Yurong Wang1 · Shanshan Zhang2,3 · Lushi Yao1 · Sheng Xu1
Received: 17 November 2021 / Accepted: 18 July 2022

© The Author(s) 2022
Abstract
Gestational diabetes mellitus (GDM) is the most common disease in pregnancy and can cause a series of maternal and infant
complications. A new study shows that GDM affects one in six deliveries. Identifying and screening for risk factors for GDM
can effectively help intervene and improve the condition of women and their children. Therefore, the aim of this paper is to
determine the risk factors for GDM and to use the ensemble learning method to judge whether pregnant women suffer from
GDM more accurately. First, this study involves six commonly used machine learning algorithms to analyze the GDM data
from the Tianchi competition, selects the risk factors according to the ranking of each model, and uses the Shapley additive
interpreter method to determine the importance of the selected risk factors. Second, the combined weighting method was
used to analyze and evaluate the risk factors for gestational diabetes and to determine a group of important factors. Lastly, a
new integrated light gradient-boosting machine-extreme gradient boosting-gradient boosting tree (LightGBM-Xgboost-GB)
learning method is proposed to determine whether pregnant women have gestational diabetes mellitus. We used the gray
correlation degree to calculate the weight and used a genetic algorithm for optimization. In terms of prediction accuracy and
comprehensive effects, the final model is better than the commonly used machine learning model. The ensemble learning
model is comprehensive and flexible and can be used to determine whether pregnant women suffer from GDM. In addition
to disease prediction, the model can also be extended for use to many other areas of research.
Keywords Gestational diabetes mellitus (GDM) · Machine learning model · Ensemble learning method · The identification
of risk factors for GDM
Abbreviations Xgboost Extreme gradient boosting

GDM Gestational diabetes mellitus RF Random forest
MCH Maternal and child health hospital GB Gradient boosting tree
T2DM Type II diabetes mellitus LightGBM Light gradient boosting machine
IDF International diabetes federation MLP Multilayer perceptron
HIP Hyperglycemia in pregnancy ANN Artificial neural network
DT Decision tree GRA Gray relational network
CART Classification and regression trees GA Genetic algorithm
BMI Body mass index
HbA1c Hemoglobin a1c
* Sheng Xu SNP Single nucleotide polymorphism
xusheng@hfut.edu.cn DNA Deoxyribonucleic acid
1
Institute of Artificial Intelligence and Data Science, School RBP4 Retinol binding protein 4
of Management, Hefei University of Technology, No. 193, wbc White blood cell
Tunxi Road, Mailbox 270, Hefei 230009, Anhui, China ALT Alanine aminotransferase
2
Department of Clinical Teaching, The First Affiliated AST Alanine aminotransferase
Hospital of Anhui University of Chinese Medicine, Cr Creatinine
Hefei 230031, Anhui, China BUN Blood urea nitrogen
3
The National Chinese Medicine Clinical Research Base-Key CHO Cholesterol
Disease of Diabetes Mellitus Study, Hefei 23003, Anhui, TG Triglyceride
China
13
Vol.:(0123456789)
72 Page 2 of 20 International Journal of Computational Intelligence Systems (2022) 15:72
HDLC High density cholesterol recognition during pregnancy, has recently been identified as
LDLC Low density cholesterol a potential risk factor for type II diabetes mellitus (T2DM)
ApoA1 Apolipoprotein a1 [3–5]. Compared with normal women, women diagnosed
ApoB Apolipoprotein with GDM are 10 times more likely to develop type 2 dia-
Lpa Lipoprotein a betes, 2.5 times more likely to have ischemic heart disease,
hsCRP High-sensitivity C-reactive protein and twice as likely to have hypertension during their later
AUC Area under the receiver characteristic curve years [6, 7]. According to International Diabetes Federation
TP True positive-it was originally a positive (IDF) statistics, in 2019, one in six births was affected by
example, but it is classified as a positive GDM [8, 9]. A total of 15.8% (20.4 million) of live births
example were by women with hyperglycemia in pregnancy (HIP).
FP False positive-it was originally a positive China now has 116.4 million people with diabetes (exclud-
example, but it is classified as a positive ing Hong Kong, Macao and Taiwan), the highest number
example in the world [8]. The development of GDM has been very
FN False negative-it was originally a positive severe, and GDM will not only harm pregnant women but
example, but it is classified as a negative also newborns.
example GDM can cause severe complications such as excessive
TN False negative-it was originally a positive amniotic fluid, ketoacidosis, preeclampsia, spontaneous
example, but it is classified as a negative abortion, stillbirth, and secondary infection in pregnant
example women as well as fetal malformation hypoglycemia, hyper-
VAR00007 Physical examination indexes after desen- bilirubinemia, and hypocalcemia [3]. Women who develop
sitization treatment through reading the gestational diabetes are at higher risk of developing type 2
literature, we suspect that VAR00007 is an diabetes later in life. Children exposed in utero are at higher
insulin resistance index risk of macrosomia at birth, or of being significantly larger
than average, and of having obesity and type 2 diabetes in
childhood and adulthood. The impact of diabetes on preg-
1 Introduction nant women can be greatly reduced if the risk can be pre-
dicted during the first trimester and timely interventions are
This research was motivated by a maternal and child health made [10]. It is very important to identify the risk factors
care institution that is engaged in providing prenatal care to for GDM to predict the risk of pregnant women developing
pregnant women. The agency found that with the opening and suffering from gestational diabetes accurately; moreo-
up of the second-child policy and an increasing number of ver, timely control measures can effectively improve mater-
older pregnant women coupled with substantial improve- nal and infant outcomes. Therefore, the motivation of this
ments in standards of living and continuous improvement of study is to support MCH in identifying GDM risk factors
the dietary structure, the incidence of gestational diabetes among pregnant women seeking prenatal care and to pre-
is increasing every year. Therefore, the medical team at this dict whether the mother will develop gestational diabetes
institution must detect and determine the risk of gestational accurately.
diabetes in pregnant women as soon as possible and inter- In research on the pathogenic factors associated with
vene early to change the pregnancy outcome. For the sake gestational diabetes, researchers found that the pathogen-
of confidentiality, the institution is referred to using the fic- esis of gestational diabetes is complicated and caused by
titious name Maternal and Child Health Hospital (MCH). many factors. The classic explanation of the pathogenic
According to research surveys, in 2019, the total number mechanism is that pregnant women have an increased
of births in China was 14.65 million, and the annual birth need for glucose during pregnancy, but insulin resistance
rate was 1.048% [1]. China’s fertility rate is significantly and insulin secretion are relatively insufficient. Rissanen
lower than it was in the past. It is currently not only well [11] et al. study of insulin secretion levels in GDM and
below the global average of 2.45% but also below the level patients with type 2 diabetes found that the deletion of
of 1.67% in developed countries [2]. With the opening up of individual gene loci can trigger GDM and type 2 diabe-
the second-child policy and an increasing number of older tes. Bao [12] et al. proposed that body mass index (BMI)
pregnant women, coupled with substantial improvements values are closely related to pregnant women suffering
in standards of living and continuous improvement of the from gestational diabetes and found that GDM patients
dietary structure, the incidence of gestational diabetes is have a significant increase in the incidence of T2DM in
increasing year by year. the future. Minooee [13] et al. found that factors such as
Gestational diabetes mellitus (GDM), which is defined as prepregnancy BMI, age, family history of GDM, weight
carbohydrate intolerance of a variable degree with onset or gain during pregnancy and other factors were risk factors
13
International Journal of Computational Intelligence Systems (2022) 15:72 Page 3 of 20 72
for the occurrence of GDM. Research by Li and Hu et al. we first select and analyze the influencing factors, then ana-
found that advanced age, overweight prepregnancy BMI, lyze the acquired data by applying the model, and sum-
history of diabetes mellitus in first-degree relatives and a marize the calculation results to verify the model. We then
positive rate of thyroid peroxidase antibody are associated provide the study conclusions in Sect. 6
with an increased risk of GDM [14]. Kuzmicki [15] et al.
found that low-grade inflammation can affect the patho-
genesis of GDM. A study by Rezvan [16] et al. showed 2 Preliminary
that the low plasma adhesin concentration in the patient's
body is related to the blood glucose regulation ability of 2.1 Machine Learning Methods
hemoglobin A1c (HbA1c). Shaat [17] et al. showed that
polymorphisms of certain alleles affect the disease. There Disease prediction models are closely linked to epidemio-
are many factors that affect diabetes, and it is necessary to logical research and clinical practice. In establishing dis-
rank the importance of these factors because it is possible ease prediction models, the risk factors for disease onset are
to determine whether a pregnant woman has gestational linked to the incidence of disease. Some prediction model
diabetes according to the size of the top factors. Early diag- studies divide the population into low-risk, medium-risk
nosis of pregnant women with disease factors, improved and high-risk groups according to the probability of disease
GDM screening, and early intervention and management to provide a basis and direction for personalized diagno-
can improve pregnancy outcomes and improve the health sis and treatment and comprehensive intervention in clini-
status of newborns. cal practice. At present, big data management is not only
Many studies have been performed on disease predic- widely used in the public domain but has also provided many
tion, such as diagnosis, prediction, classification, and ther- studies on the association between big data and disease pre-
apy. Recent research shows that various machine learning diction and uses deep learning, transfer learning and other
algorithms have been used for disease identification and algorithms to establish the association between pathogenic
prediction. They have resulted in remarkable efficiency and factors and disease probability. With improvements in the
improvements in profound conventional and machine learn- parallel computing ability of big data and the in-depth study
ing [18, 19]. For example, Wu used a tree model and neural of machine learning models, research using machine learn-
network model algorithm for modeling and data mining ing models to improve the accuracy of disease prediction is
analysis for maternal physical examination indicators [10, gradually deepening. With the improvement of data dimen-
20]. Saloni used the idea of voting integration to predict sions, making full use of multisource and multidimensional
diabetes mellitus [21]. Wang used the basic idea of stack- data has become a trend in disease prediction.
ing ensemble learning and then adopted three basic mod- In a study on gestational diabetes prediction models
els, namely, random forests, Xgboost and CatBoost, as the at home and abroad, Anna Patrick Nomb [20] and other
basic learning device and constructed a novel integrated researchers used the multivariate logistic model in 2018 to
model for predicting and analyzing gestational diabetes model the probability of GDM risk and concluded that a
mellitus [22]. family history of stillbirth and type 2 diabetes increased the
Our main contribution is to use a new integrated Light- prevalence of GDM. Wu [10] et al. established a prediction
GBM-Xgboost-GB ensemble learning method for mater- model for gestational diabetes mellitus in 2017 by using a
nal and child health care providers to analyze the risk decision tree algorithm. According to the classification and
factors for GDM. This method uses a group of the most regression trees (CART) algorithm, we obtained a set of
important influencing factors selected through learning rule characteristics for the GDM high-risk group. In 2018,
algorithms to identify GDM risk factors accurately and Zhang [23] made the extreme gradient boosting (Xgboost)
to predict the possibility of gestational diabetes precisely. algorithm an accurate prediction model for the pathogenic
In response to the research problem, the proposed model factors of type two diabetes mellitus. Wang [24] and other
improves the existing methodology. This model provides researchers studied the application of the deep learning
a comprehensive, practical and flexible method for use model for predicting the risk of type two diabetes in 2017.
by medical staff during early prenatal care and disease With the development of machine learning and big data,
prediction. Therefore, our model was suggested to be very using machine learning algorithms to design diagnosis-aided
useful for medical staff to assess pregnant women with decision-making systems provides a prerequisite to meeting
gestational diabetes. this objective. Therefore, in recent years, many research-
The paper is organized as follows. In Sect. 2, we will ers have used machine learning algorithms to build disease
review some basic knowledge. Section 3 lays the theoretical prediction reasoning models to assist doctors in diagnosis.
foundation for the proposed method. Section 4 describes a A decision tree (DT) is a basic classifier for which the
new integrated LightGBM-Xgboost-GB method. In Sect. 5, two steps are learning and classification [25]. As a machine
13
learning method, decision trees create an effective model over fitting and improving the performance of the predic-
for data classification and regression [26, 27]. Random for- tion model. In addition to its stability and generalization
est (RF) is an algorithm developed by Breimen and Cut- ability, the ensemble learning model performs better than
ler in 2001. It runs by constructing multiple decision trees traditional models, and the final prediction accuracy is also
while training and outputting the class. It has improved per- relatively high. At present, some integrated learning models
formance over single decision trees, and it is much more are applied in other disease predictions, and the effect is
efficient than traditional machine learning techniques, remarkable. Therefore, it is possible to use an integrated
especially when the dataset is large [28, 29]. The gradient learning algorithm to study the relationship between vari-
boosting tree (GB) is a combination of regression and clas- ous indicators and the prediction model of the gestational
sification tree models. GB improves prediction power results diabetes mellitus incidence rate.
by progressively improving estimations. Additionally, GB
employs a nonlinear regression procedure that helps improve
the accuracy of the trees [30, 31]. Xgboost implements a
gradient boosting framework based on decision trees, which 3 Theoretical Foundation of the Main Model
was proposed by Chen2016 [32]. The library of Xgboost is
designed to be highly efficient, flexible and portable [33]. 3.1 Ensemble Learning Methods
The light gradient boosting machine (LightGBM) algorithm
is an efficient distributed gradient boosting tree algorithm In traditional machine learning algorithms, since the mode-
[34, 35]. Due to its fast speed, low memory consumption ling process of each single prediction model and the method
and relatively high accuracy, it is widely used in classifica- of data preprocessing are different, the prediction results of
tion and regression problems [36]. A multilayer perceptron each single prediction model are different, and the predic-
(MLP), also known as an artificial neural network (ANN), tion accuracy is also different [28]. A single learner may not
is a forward structure of artificial neural networks. MLP is achieve very good results, but if the results of multiple weak
one of the most popular networks and possesses a powerful learners are combined, the performance of the model may
ability to solve nonlinear problems and is highly efficient at be improved to some extent. Therefore, ensemble learning
calculation [37–39]. methods combine a series of different individual learners
These models have become commonly used machine through a specific strategy to achieve better learning results.
learning algorithms for predicting gestational diabetes mel- Ensemble learning methods are divided into stacking, blend-
litus in recent years. In addition, the selection of important ing and voting, which are powerful prediction techniques
factors in the follow-up research and the base model of the since they can increase the diversity of algorithms and
integration model are from this group of common models. reduce generalization errors to improve the accuracy of the
results.
2.2 Model Fusion Based on Ensemble Learning Ensemble learning methods have two basic elements:
one is that the correlation between single models should be
In terms of gestational diabetes prediction, research on ges- as small as possible, and the other is that the performance
tational diabetes was conducted on a single model predic- between single models is not too different. In practice, it is
tion until 2019. Later, Yu [40] applied an ensemble learning often the case that a single model with a low correlation
algorithm in the diagnosis and prediction of gestational dia- coefficient and good performance can significantly improve
betes mellitus for the first time by using a multi-model fusion the final prediction result. In the research for this article, the
method based on the cascading classifier method. This paper overall model was constructed using the idea of the blending
is focused on the use of three learning algorithms to predict ensemble method, and the equation is as follows:
the prevalence of gestational diabetes mellitus. These three n
∑
models are capable of handling small sample sizes, high G(x) = wi gi (x) (1)
missing values, and classification features. On this basis, i=1
while aiming at the optimization problem of model evalua-
tion index and threshold selection, we are the first to propose where n is the number of learners and wi is the weight of indi-
∑
that applying the gray correlation method can be used to vidual learner gi (x) , usually required wi ≥ 0, ni= 1 wi = 1 .
calculate the weight value to integrate the three learning When wi = n , the weighted average becomes a simple aver-
1
models, optimize the prediction model and improve the suc- age. In fact, the weighted average method can be regarded as
cess rate of diagnosis, which provides a solution for future the basic starting point of ensemble learning research. For a
research on disease prediction models with poor data quality. given base learner, different ensemble learning methods can
The ensemble learning algorithm is good at addressing be regarded as different ways to determine the weight of the
defective data by building multiple classifiers to prevent base learner in the weighted average method.
13
3.2 Traditional Gray Relational Analysis (e) Calculate the degree of association. The correlation
coefficient at each time is concentrated into one value;
Gray system theory is a relational analysis method for sys- that is, the average value is calculated as the quantita-
tem analysis. By comparing the shapes of several curves, it tive expression of the correlation degree between the
is concluded that the more similar the shapes are, the closer comparison series and the reference series. The correla-
the relation they have [41]. Gray relational analysis (GRA) tion degree formula is
is a very important part of gray system theory; it uses the N
similarity and nearness of the sequences to determine the 1∑
ri = 𝜉 (k), k = 1, 2, ..., N (3)
relations among the factors of the system. The original gray N k=1 i
relational analysis model was constructed by Deng Julong
[42], and Liu Sifeng developed the classic gray absolute (f) The correlation between the comparison series and the
degree of gray incidence [43, 44]. Gray relational analysis reference series can be regarded as the contribution of
is useful for managing poor, incomplete, and uncertain infor- a single prediction model to the prediction accuracy.
mation, so it has been widely used in decision-making and Then, calculating the proportion of the single model's
evaluation problems. correlation degree in the total of all the model's correla-
Because the extent of each model that contributes to the tion degrees is the contribution rate of the single model
construction of the ensemble model is different, it was bet- to the integrated model, that is, the weight of the inte-
ter to estimate the relative weight for each model. In model grated model. The weight coefficient of the integrated
fusion, theoretically, the higher the prediction accuracy of a model can be expressed as:
single model is, the greater its contribution to the integrated r
model should be. If a simple weighted average is adopted, wi = ∑n i (4)
r
i=1 i
the contribution of different models to the integration model
is not taken into account. The higher the relative weight was,
the greater the model contribution to the ensemble model Then, the expression of the ensemble model is changed
construction. In fact, GRA is a very useful technique for to:
preference analysis; it effectively measures the relationships �n
ri ri
between one reference sequence and the other comparative ∑n ≥ 0, ∑n = 1 (5)
r i=1 r
sequences. We were motivated to use GRA to estimate the i=1 i i=1 i
relative weight for each model. n
∑
where, ∑n i = 1.
r r
The weighting steps based on gray relational analysis are i=1 ri
≥ 0, ∑n i
i = 1 ri
as follows:
i=1
(a) Determine the analysis sequence. We use the true value 3.3 Genetic Algorithm
of the data as the reference sequence (also known as the
mother sequence) as Y = Y(k)|k = 1, 2, … n; the pre- The gray correlation analysis obtains the relative weight of
dicted value of the model as the comparison sequence each model, which improves the overall performance of the
(also known as the subsequence) is integrated model and can significantly overcome the defects
|
Xi = Xi (k)|k = 1, 2, … n, i = 1, 2, … , m; where k is the of a single model, which is due to the large hypothesis space
| of the learning task, resulting in poor generalization perfor-
number of samples and i is the number of models.
mance and falling into local minima. However, the predic-
(b) Nondimensionalization. Since the parent column and
tion performance of the integrated model still has room for
the child column are 0, 1 label data, no dimensionless
improvement.
processing is required.
The genetic algorithm (GA) was developed using the
(c) Calculate the correlation coefficient. The formula is as
evolutionary theory of biology and genetics. It is a type of
follows:
self-organizing and adaptive artificial intelligence technol-
min min ||x0 (k) − xi (k)|| + 𝜌 × max max ||x0 (k) − xi (k)|| ogy that simulates the evolution process and mechanism of
natural organisms to solve problems. The GA is a random
i k
(2)
i k
𝜉i (k) =
|x (k) − x (k)| + 𝜌 × max max |x (k) − x (k)|
| 0 | k | |
search algorithm that can better improve the optimization
i 0 i
i
(d) 𝜌 ∈ (0, ∞) is called the resolution coefficient. The of discrete data [45]. Goldberg summarized a unified and
smaller ρ is, the greater the resolution. Generally, most basic genetic algorithm that uses a selection opera-
the value interval of 𝜌 is (0, 1), and the specific value tor, crossover operator and mutation operator [45]. The GA
depends on the situation. When 𝜌 ≤ 0.5463 , the resolu- has three unique characteristics: adaptability and versatility,
tion is best, usually 𝜌 = 0.5. implicit parallelism, and scalability.
13
Our goal is to minimize the relative error between the studies have used prospective controlled cases to study risk
predicted value and the real value of a single prediction factors for gestational diabetes. However, case studies may
model and the relative error between the predicted value have some particularities, and there may be no association
and the real value of an integrated model. That is, the aver- between cases. In this study, we solved this dilemma. We
age relative error between the overall predicted value and studied a set of the most commonly used machine learning
the real value is the minimum. Therefore, we propose an algorithms to predict gestational diabetes. After comparing
ensemble learning model for target optimization based on their prediction accuracy, we used various classifiers and
genetic calculations: Shapley additive interpreters to identify a set of important
( risk factors to obtain a more objective and reasonable set
[ ] n | i |)
1 ∑ || G (x) − yi || ∑ || gj (x) − yi ||
N i
of routine features for the GDM population. We analyzed
min H gij (x), Gi (x) = | |+ | |
N i=1 || yi |
| j=1 ||
yi | and explained the impact of this group of important factors
|
(6) on the development of gestational diabetes. Then, we used
n ensemble learning methods in machine learning to build
� ri
Subject to Gi (x) = ensemble models and to train predictions. The result of the
∑n gj (x) (7)
j=1 r
i=1 i ensemble learning model is better than that of each indi-
vidual machine learning model.
ri
∑n >0 (8) 4.2 The Ensemble Learning Model
r
i=1 i
n
The ensemble learning model may bring benefits in three
� ri parts. First, from a statistical perspective, by considering
∑n = 1 (9)
j=1 r
i= 1 i
several models and averaging their predictions, we can
reduce the risk of choosing a very poor model; second, the
where i = 1, 2, … , n is the number of samples, j = 1, 2, … , N integration of multiple models brings the final result closer
is the number of models, gij (x) is the predicted value of a to the true value; finally, from the perspective of representa-
single model at the sample point, Gi (x) is the predicted value tion, the hypothesis space can be expanded, and it is possible
of the ensemble model at the sample point, and yi is the true to learn a better approximation.
value of the sample point. Ensemble learning methods have two basic elements:
one is that the correlation between single models should be
as small as possible, and the other is that the performance
4 Model Development for GDM Prediction between single models is not too different. In practice, it is
often the case that a single model with a low correlation
4.1 Problem Description of GDM coefficient and good performance can significantly improve
the final prediction result. The GB model is a combination
Maternal and child health care is related not only to the of regression and classification tree models. Xgboost imple-
health of mothers and babies but also to the health status of ments a gradient boosting framework based on decision
the birth population and to the prosperity of the country and trees. The LightGBM algorithm is an efficient distributed
the future of the nation. MCH medical staff are committed to gradient boosting tree algorithm. In addition, these three
basic medical services that are closely related to the health methods are based on the improved tree model. Therefore,
of women and children. Gestational diabetes is a common the three classes of algorithms satisfy the diversity, correla-
phenomenon in female pregnancy, but it has very negative tion, and performance requirements.
effects on both women and children, and it must be detected Therefore, we chose GB, Xgboost, and LightGBM as
and controlled early. The pathogenesis of gestational diabe- the base learners and propose a new integrated LightGBM-
tes is complicated and caused by a variety of factors. Among Xgboost-GB method, which uses gray correlation to calcu-
them, a classic explanation of the pathogenic mechanism late weights and genetic algorithms to optimize them. There-
is that pregnant women have an increased need for glucose fore, the n of the final formula for the model is 3.
during pregnancy, but insulin resistance and insulin secre- The specific steps are as follows:
tion are relatively insufficient. Therefore, the medical staff Step 1. Data preprocessing: the original data are preproc-
of MCH urgently need to know the risk factors for gesta- essed and divided into a training set and a testing set.
tional diabetes and to focus on detection and control to avoid Step 2. First, a single model is trained with the train-
adverse pregnancy outcomes. ing set, and then the trained model is used to predict the
The identification of key factors has important clini- predicted value of the sample set. Second, the gray corre-
cal significance in GDM risk assessment. Most previous lation coefficient between the predicted value of the j−th
13
prediction model and the observed value of the i−th pre- 0 indicating no disease and 1 indicating disease, both of
dicted sample point is calculated. Lastly, the gray correlation which require a prediction of gestational diabetes. (https://
degree between the predicted value of the model and the tianchi.aliyun.com/).
observed value is calculated, and the weight coefficient is Among the 83 original features, only 2 features have no
calculated. missing values, and the remaining 81 features are missing to
Step 3. Determine the fitness function. Calculate the fit- varying degrees. The general method for addressing miss-
ness of each individual and determine whether it meets the ing values is to delete the missing features, but it cannot be
optimization criteria. Calculate the average relative error applied in this sample. The reason is that more than 97% of
between the predicted value and the true value, so the fit- the features in this sample are missing, which indicates that
ness function is: the missing values here have a specific meaning and should
( be regarded as information rather than noise. Therefore, we
3 | i |)
1 ∑ || G (x) − yi || ∑ || gj (x) − yi ||
N i
use the mean to fill it in, since the mean will not affect the
MAPE = | |+ | | (10) overall situation but will also solve the problem of missing
N i=1 || yi |
| j=1 ||
yi |
| values.
If it is satisfied, output the best individual and the optimal
solution represented by it, and end the algorithm; if not, go 5.1.2 Basic Characteristics
to the next step.
Step 4. Initialize the parameters of the genetic algorithm, The data contain a total of 85 fields and 1,200 pieces of
such as the population size, crossover probability, and muta- data. “Id” is the patient code, and “label” is the classifica-
tion probability. Parameter optimizations: first, decode the tion label. Among the 83 variables, there were 55 genetic
chromosomes in the population; then, calculate the fitness variables, all of which were discrete variables, and 28 con-
value of each generation of the population and perform the ventional variables. Of the 28 regular variables, there are
survival of the fittest. Lastly, determine whether the popula- 3 discrete variables and 25 continuous variables. Discrete
tion performance satisfies the maximum number of genet- variables are single nucleotide polymorphism (SNP) gene
ics, and if so, the optimal parameter is output; otherwise, site information. SNP genes are widely present in the exist-
according to the genetic strategy, the selection, crossover ing human genome library, and the average probability is
and mutation operations are used to obtain the offspring. 1/500–1000 base pairs. Amniocentesis during pregnancy
Step 5. Result judgment: if the objective function is satis- extracts the baby's deoxyribonucleic acid (DNA) and SNP
fied, then the optimizations are finished. Otherwise, repeat corresponding gene sites for detection. The physiological
step (3). index of a continuous variable indicates that TG represents
Step 6. Input the test sample to obtain the best prediction triglyceride, which is an index to measure hyperlipidemia;
result. The detailed process is shown in (Fig. 1). RBP4 represents retinol binding protein-4. As an adipo-
cytokine, RBP4 participates in obesity and insulin resist-
ance and affects glucose and lipid metabolism in pregnant
women; the BMI index is often used to study whether the
5 Applying the LightGBM‑Xgboost‑GB patient is obese or not. In China, 24 and 28 are regarded as
Model to Objective Data the thresholds of overweight and obesity; HsCRP represents
high-sensitivity C-reactive protein. It is a plasma protein
5.1 Data Collection produced by the liver and is primarily used as a marker of
inflammation. White blood cells (WBCs) are a common
5.1.1 Research Data indicator of inflammation during routine blood examina-
tion. Apolipoproteins a1 (ApoA1) and b (ApoB) are apoli-
The data came from the Tianchi precision medicine competi- poproteins that are of great value in the diagnosis of diabe-
tion on artificial intelligence-assisted genetic risk prediction tes mellitus complicated with lipid metabolism disorders.
of diabetes, which was held by Aliyun United Qingwutong HDLC is high-density lipoprotein cholesterol, and LDLC
Health Technology Co. LTD. The first line of the data is the is low-density lipoprotein cholesterol. Alanine aminotrans-
field name, and each line represents an individual. Some ferase (ALT) and aspartate aminotransferase (AST) are two
field names have been identified. The data contain a total of types of aminotransferase. They are essential catalysts for
85 fields and 1200 pieces of data. The data do not include human metabolism. They are the two clinical indicators of
the time of physical examination and relevant geographic liver function testing and play a specific role in the occur-
information. Some field contents are missing for some indi- rence and development of gestational diabetes mellitus. Cr
viduals, among which the first column is the individual ID is creatinine, and CHO is total cholesterol, which are routine
number. The last column of the data is a label column, with physical examination indices.
13
Fig. 1 Calculation steps of the ensemble model
We analyzed and compared the basic characteris- 5.2 Collection of Important Variables

tics of 25 general variables, and the results are shown
in (Table 1). Continuous variables are expressed as the 5.2.1 Comparison of the Individual Learning Model
mean ± standard deviation (SD), and the characteristic dif-
ferences between the GMD group and the non-GMD group We used the accuracy, area under the receiver operating
were tested by using the T test. Compared with patients characteristic curve (AUC), precision, recall and f1-score
who did not have gestational diabetes, gestational diabetes to evaluate the performance of the individual models.
patients had a higher age, gestational age, prepregnancy The accuracy is the percentage of the data sample that
weight, prepregnancy BMI, diastolic blood pressure, sys- predicts the correct category.
tolic blood pressure, VAR00007, wbc, ALT, CHO, TG,
TP + TN
ApoB, and hsCRP. By contrast, a higher RBP4, BUN, Cr, Accuracy = (11)
HDLC, and LDLC were more common in those without TP + FP + TN + FN
gestational diabetes. The accuracy reflects the correct ability of the model's
category prediction, including two situations: a positive
example is predicted as a positive example, and a negative
13
Table 1 General characteristics Variable Total (n = 1200) Non-GDM (n1 = 634) GDM (n2 = 566) P-value
of the regular variables
RBP4 22.23 ± 3.23 22.45 ± 2.39 21.99 + 2.96 0.015136
Age 31.79 ± 3.89 31.13 ± 3.76 32.53 + 3.90 < 0.01
Number of pregnancies 1.98 ± 1.00 1.91 ± 0.95 2.05 + 2.06 0.011668
Number of births 1.05 ± 0.22 1.04 ± 0.20 1.06 + 1.25 0.062012
Height 162.26 ± 4.11 162.48 ± 4.01 162.02 + 1.20 0.055647
Prepregnancy weight 57.13 ± 7.83 56.06 ± 7.00 58.32 + 5.52 < 0.01
Prepregnancy BMI 21.67 ± 2.87 21.20 ± 2.40 22.18 + 2.06 < 0.01
Systolic blood pressure 112.99 ± 11.15 111.64 ± 10.69 114.50 + 1.48 < 0.01
Diastolic blood pressure 68.13 ± 7.82 67.88 ± 7.73 69.68 + 6.82 < 0.01
Blood pressure of delivery 73.12 ± 4.31 73.10 ± 4.58 73.15 + 7.98 0.841332
Sugar sieve gestational age 25.46 ± 2.30 25.45 ± 2.17 25.46 + 2.44 0.928622
VAR00007 1.54 ± 0.10 1.50 ± 0.07 1.58 + 1.12 < 0.01
wbc 9.36 ± 1.99 9.12 ± 1.88 9.62 + 9.09 < 0.01
ALT 25.39 ± 26.05 23.71 ± 20.57 27.27 + 2.97 0.018109
AST 37.97 ± 8.23 38.07 ± 7.97 37.86 + 3.52 0.668603
Cr 61.97 ± 5.89 62.04 ± 6.01 61.90 + 6.76 0.671191
BUN 2.84 ± 0.72 2.85 ± 0.70 2.83 + 2.73 0.60111
CHO 6.11 ± 1.54 6.09 ± 0.97 6.12 + 6.00 0.750929
TG 2.54 ± 0.99 2.38 ± 0.90 2.72 + 2.04 < 0.01
HDLC 2.11 ± 0.98 2.12 ± 0.50 2.08 + 2.33 0.464645
LDLC 3.41 ± 2.19 3.49 ± 2.90 3.30 + 3.88 0.13517
ApoA1 3.61 ± 5.71 3.64 ± 5.15 3.58 + 3.29 0.870965
ApoB 1.9 ± 6.55 1.81 ± 2.53 2.00 + 2.16 0.610646
Lpa 206.83 ± 196.87 206.80 ± 186.82 206.87 + 2.71 0.995141
hsCRP 4.16 ± 20.55 3.14 ± 2.36 5.30 + 5.80 0.069232
example is predicted as a negative example. When we pay The recall rate reflects the coverage of the model in the
the same attention to category 1 and category 0 (the category correct prediction of positive examples, which is a little like
is symmetric), the accuracy is a good evaluation index. “no fish is allowed to escape the net”. If we focus on the
The area under the receiver operating characteristic curve comprehensiveness of the prediction of positive samples, the
(AUC); as the name implies, the AUC value is the area under recall rate is a good indicator. The recall rate is not affected
the receiver operating characteristic curve. Typically, AUC by the sample proportion imbalance because it only focuses
values range from 0.5 to 1.0, and a larger AUC represents a on the prediction of the positive sample.
better performance. The F1-score is a model evaluation index that takes into
The precision rate is the proportion of the actual positive account both the accuracy and recall rate, which is defined as
examples in the sample predicted to be positive examples.
2 ∗ Precision ∗ Recall
f1 − score = (14)
TP Precision + Recall
Precision = (12)
TP + FP
When there is no special requirement for the precision or
The precision rate reflects the prediction ability of the recall rate, it is necessary to consider both the precision rate
model on the positive case, which focuses on the positive and recall rate when evaluating a model, and the F1-score
case. If we pay attention to the prediction accuracy of the can be considered.
positive case, the precision rate is a good indicator. Here, the data sample can be binarily classified into posi-
The recall rate is the proportion of the predicted positive tive and negative categories. Then, there are 4 combinations
sample in the real positive sample. of the predicted results and the real tags TP, FP, FN, and TN,
as shown in (Fig. 2).
TP
Recall = (13) Table 2 shows the comparison results for individual
TP + FN
machine learning algorithms. As indicated by the output
13
Fig. 2 The confusion matrix
results, the Xgboost model has high scores in terms of accu- Fig. 3 Comparison of the accuracy and the F1-score of each model
racy, AUC value, recall rate, recall accuracy and f1-score,
making it the best prediction model. When we pay the same
attention to category 1 and category 0 (the category is sym- We use the principle underlying the Shapley value
metric), the accuracy is a good evaluation index. When there method and use the ranking value of different factors in dif-
is no special requirement for the precision or recall rate, it is ferent models as the contribution value of a factor in differ-
necessary to consider both the precision rate and recall rate ent models, and then the weighting factor is one-sixth. The
when evaluating a model, and the f1-score can be consid- important factors from each model were re-ranked, and the
ered. Thus, we use the accuracy and the F1-score as the main final ranking results of importance were obtained, as shown
evaluation indicators of the model, as shown in (Fig. 3), for a in (Table 3). The risk factors for the GDM risk score include
comparison of the accuracy and the F1-score of each model. age, genetic factors, prepregnancy body mass index, number
of pregnancies, blood pressure and other factors, which are
5.2.2 Risk Factor Selection of GDM also listed as important variables in our study.
To study the importance of the risk factor variables in ges-
Table 3 shows the top 20 variables in the ranking of impor- tational diabetes, we projected twenty important variables that
tance for each algorithm variable. Variable VAR00007 were selected by the machine learning algorithm onto a four-
appears many times at the top of the list, in addition to other quadrant matrix. Figure 4 shows the quadrants of four important
factors, such as SNP34, TG, and SNP37. Pregestational variables based on Shapley’s ranking of the importance of vari-
weight and pregestational BMI were listed as important influ- ables (higher rankings are at the bottom) and comprehensive
encing factors of GDM in most models, which may reflect the weighting (smaller weights are on the left). The ranking of the
relationship between obesity and GDM. Systolic blood pres- Shapley method on the importance of variables is shown in
sure and diastolic blood pressure were also listed as important (Table 4). The comprehensive weight is shown in (Table 5)
factors in GDM in most models, indicating that hypertension (the comprehensive weight obtained by the expert scoring
has a specific influence on the development of GDM. method, where 0 is unimportant, 5 is generally important and
10 is very important). The horizontal line dividing the upper
Table 2 Comparison results Model Accuracy AUC Recall Precision f1-score

of individual machine learning
algorithms GB Train 0.8789 0.8761 0.8259 0.9093 0.8656
Test 0.7000 0.6909 0.5390 0.7525 0.6281
LightGBM Train 0.8533 0.8493 0.7765 0.8992 0.8333
Test 0.7200 0.7102 0.5461 0.7938 0.6471
MLP Train 0.5426 0.5537 0.7466 0.5101 0.6061
Test 0.5250 0.5351 0.7368 0.5000 0.5957
RF Train 0.7700 0.7613 0.6047 0.8682 0.7129
Test 0.6867 0.6759 0.4965 0.7527 0.5983
Xgboost Train 0.8889 0.8866 0.8467 0.9112 0.8778
Test 0.7533 0.7516 0.7183 0.7500 0.7338
DT Train 0.8057 0.8023 0.7435 0.8272 0.7831
Test 0.6367 0.6295 0.5106 0.6429 0.5692
13
Table 3 The top 20 ranked variables for each algorithm

Rank Machine-learning algorithms
GB LightGBM MLP RF Xgboost DT
1 VAR00007 VAR00007 VAR00007 VAR00007 VAR00007 VAR00007

2 SNP34 hsCRP SNP34 SNP34 Age SNP34
3 TG SNP37 Age TG wbc wbc
4 SNP37 TG BMI before pregnancy Age TG TG
5 Age Age TG BMI before pregnancy SNP37 ApoB
6 wbc wbc Weight before pregnancy SNP37 hsCRP HDLC
7 hsCRP SNP34 BMI classificatio hsCRP SNP34 AST
n
8 Cr Cr Systolic blood pressure Systolic blood pressure AST wbc
9 AST Lpa wbc Diastolic blood pressure Cr SNP42
10 HDLC BUN Diastolic Weight before BMI before Height
blood pregnancy pregnancy
pressure
11 SNP20 BMI before Number of LDLC HDLC CHO
pregnancy pregnancy
12 Systolic ALT ALT Cr ALT Number of
blood pregnancy
pressure
13 CHO ApoB SNP46 RBP4 Height hsCRP
14 SNP38 CHO SNP32 CHO BUN Cr
15 BMI before Diastolic SNP42 ALT ApoB BMI befor
pregnancy blood pregnancy
pressure
16 SNP42 Weight before SNP13 ApoA1 Systolic Age
pregnancy blood
pressure
17 SNP48 ApoA1 Number of Number of Diastolic SNP23
births births blood
pressure
18 Sugar sieve RBP4 SNP40 Number of SNP23 ApoA1
gestational pregnancy
age
19 SNP23 LDLC hsCRP HDLC RBP4 ALT
20 SNP27 AST SNP17 ApoB CHO RBP4
and lower quadrants is based on the median value of Shap- low ranking. Recent studies have shown that ApoA1 and
ley's ranking, while the vertical line is the median value of the ApoB are apolipoproteins. Apolipoproteins can play a role
weight. The lower right quadrant is the quadrant with the high- in stabilizing the structure of lipoproteins by binding and
est rankings and highest weights. The factors in this quadrant transporting lipids, regulating the activities of key lipopro-
are very important for the development of gestational diabetes. tein enzymes and participating in the key link of lipopro-
By contrast, the upper left quadrant contains factors with low tein metabolism. The combined detection of apolipoprotein
rankings and low weights. Comparatively speaking, these fac- ApoA1 and ApoB is of great value in diagnosing diabetes
tors are related to the development of gestational diabetes. The mellitus complicated with lipid metabolism disorders and
remaining two quadrants correspond to abnormal points with helps guide the treatment and prognosis of diabetes. Studies
high ranking and low weight or low ranking and high weight. have also shown that pregnant women have a greater risk
of having a higher prepregnancy BMI, and the interaction
5.2.3 Attractive Variables (First Quadrant) caused by the parity of pregnant women cannot be ignored.
Body mass index (BMI) and parity are the main risk factors
We found that ApoA1, ApoB, and the number of pregnan- for pregnant women to develop gestational diabetes [46].
cies were present in the first quadrant with high weight and Second, overweight or obesity is a risk factor for gestational
diabetes [47]. In addition, parity can increase the impact
13
Fig. 4 Quadrant distribution of
20 variables
Table 4 Variable ranking based Model GB LightGBM MLP RF Xgboost DT Mean rank

on the Shapley value
Feature importance rank VAR00007 1 1 1 1 1 1 1.00
SNP34 2 7 2 2 7 2 3.67
TG 3 4 5 3 4 4 3.83
SNP37 4 3 6 5 3 4.20
Age 5 5 3 4 2 16 5.83
wbc 6 6 9 3 8 6.40
hsCRP 7 2 19 7 6 13 9.00
BMI befor 15 11 4 5 10 15 10.00
pregnancy
Cr 8 8 12 9 14 10.20
Weight 16 6 10 10.67
before
pregnancy
Systolic 12 8 8 16 11.00
blood
pressure
HDLC 10 19 11 6 11.50
AST 9 20 17 8 7 12.20
Diastolic 15 10 9 17 12.75
blood
pressure
ApoB 13 20 15 5 13.25
Number of 11 18 12 13.67
pregnancy
ALT 12 12 15 12 19 14.00
CHO 13 14 14 20 11 14.40
ApoA1 17 16 18 17.00
RBP4 18 13 19 20 17.50
13
Table 5 Variable weight based Variable Expert1 Expert2 Expert3 Expert4 Mean Weight
on expert scoring
VAR00007 10 5 10 10 8.75 0.0875
SNP34 10 10 5 5 7.5 0.075
TG 5 10 5 10 7.5 0.075
SNP37 10 5 10 5 7.5 0.075
Age 10 15 5 10 10 0.1
wbc 5 5 5 5 5 0.05
hsCRP 5 5 5 5 5 0.05
Prepregnancy BMI 5 10 5 10 7.5 0.075
Cr 0 5 5 0 2.5 0.025
Prepregnancy Weight 5 0 5 10 5 0.05
Systolic blood pressure 0 5 5 5 3.75 0.0375
HDLC 0 10 5 0 3.75 0.0375
AST 5 0 0 0 1.25 0.0125
Diastolic blood pressure 0 5 5 5 3.75 0.0375
ApoB 0 0 5 0 1.25 0.0125
Number of pregnancies 5 5 5 10 6.25 0.0625
ALT 5 0 0 0 1.25 0.0125
CHO 5 0 0 0 1.25 0.0125
ApoA1 5 0 5 0 2.5 0.025
RBP4 10 5 10 10 8.75 0.0875
Total 100 100 100 100 100 1
of overweight and obesity on gestational diabetes. Pregnant search, we found that ALT, AST, diastolic blood pressure
women with more recent last pregnancies will have rela- and systolic blood pressure also play a role in the develop-
tively weak physical function and a relatively low metabolic ment of gestational diabetes (for details about the literature,
rate, thus impairing glucose tolerance. The incidence of dia- please refer to Appendix Tables 6, 7).
betes and other diseases is relatively high. The predictability
of these three factors on machine learning algorithms may 5.2.5 Potential Variables (Third Quadrant)
not be as good as other factors, but their impact on gesta-
tional diabetes has received more attention from scholars in There was only one factor in the third quadrant, Cr (creati-
recent years. nine). Previous studies have shown that Cr is susceptible to
interference from external stimuli. As long as the kidney has
5.2.4 Least Preferred Variables (Second Quadrant) a strong compensatory function, the Cr level can be main-
tained in the normal range. Therefore, this indicator can only
Women who are obese are not choosing a good time to be used as an auxiliary indicator and has little significance
become pregnant. Cholesterol is a common clinical lipid for early diagnosis. However, machine learning algorithm
metabolism hormone that can reflect abnormal lipid metabo- prediction research has shown that Cr has important predic-
lism. If the body’s lipid metabolism is disordered, it will tive value in predicting gestational diabetes [48]. In addition,
lead to abnormal glucose metabolism. The reason is that the recent studies by scholars have shown and believed that the
serum cholesterol level may cause the progression of insu- combined detection of serum albumin, beta-2-microglobu-
lin resistance. RBP4 is a new type of adipocytokine and a lin, non-esterified fatty acids and Cr, four indicators, has a
type of secreted retinol binding protein that transports retinol significant effect on GDM disease in addition to high diag-
through the liver and blood circulation. Studies have shown nostic efficiency, especially for the early diagnosis of dis-
that improving serum RBP4 levels is essential for improv- ease, dynamic monitoring of index factor expression differ-
ing insulin sensitivity and maintaining blood sugar stabil- ences and the sustainable monitoring of disease progression.
ity. HDLC can control glucose homeostasis through insulin
secretion and direct glucose uptake by amp-activated protein 5.2.6 Important Variables (Fourth Quadrant)
kinase in the muscles and possibly enhance insulin sensitiv-
ity. Studies have shown that the higher the HDLC is, the Factors such as the prepregnancy BMI, TG, age, VAR00007,
lower the risk of gestational diabetes. Through a literature hsCRP, wbc, SNP34 and SNP37 belong to the fourth
13
Fig. 5 a Monotonicity of Prepregnancy, TG, Age and VAR00007. b Monotonicity of hsCRP, wbc, SNP34 and SNP37
13
quadrant, which is a group of factors with higher weight pregnant women and pregnant women with lower physical
and higher ranking. From the data, we found that the higher fitness have a higher risk of gestational diabetes.
the value of the factors, the greater the risk of disease, as Therefore, the variables VAR00007, SNP34, TG, age,
shown in (Fig. 5). Through the study of previous research SNP37, wbc, hsCRP, and prepregnancy BMI were the fac-
literature, we have indeed found that these factors play an tors that had the greatest impact on pregnant women suffering
important role in the development of gestational diabetes from gestational diabetes. In addition, Cr is a latent variable
(for details about the literature, please refer to Appendix that requires more attention. Although scholars have recently
Tables 6, 7). In other words, obese pregnant women, older believed that ApoA1, ApoB and the number of pregnancies
also have an important impact on gestational diabetes, their
importance is far less than that of VAR00007, SNP34, TG
and other factors. When SNP34 and SNP37 were 2, the
incidence of GDM was relatively high, and the higher the
VAR00007 was, the more prone to gestational diabetes the
woman was; through reading the literature, we suspect that
VAR00007 is an insulin resistance index.
5.3 Analysis of Predicted Results
5.3.1 Result of the Ensemble LightGBM‑Xgboost‑GB Model
We calculated the gray correlation coefficient between

Fig. 6 Comparison results on the index accuracy and F1-score the predicted value of the j−th prediction model and the
observed value of the i−th predicted sample point, thereby
Fig. 7 Radar map showing a

comparative analysis of differ-
ent methods
13
calculating the gray correlation degree between the predicted disease to improve maternal and infant outcomes as early
value of the model and the observed value. as possible.
The correlations between the predicted value and the Gestational diabetes mellitus is a common disease during
true value of the LightGBM, Xgboost, and GB models pregnancy. Therefore, it is necessary to identify this dis-
calculated by gray correlation analysis are r1 = 0.631407 , ease as soon as possible. This research primarily verifies
r2 = 0.631814 , and r3 = 0.631407 , respectively. In using the accuracy of the commonly used machine learning algo-
the normalized value of the gray correlation degree as the rithms used in several gestational diabetes mellitus cases in
coefficient value of the combined model,w1 = 0.33262 , the past and obtains the best ensemble prediction method for
w2 = 0.33476 , and w3 = 0.33262 are obtained. Then, a predicting diabetes mellitus. The results also show that our
genetic algorithm is used to optimize the integrated model proposed ensemble method is better.
coefficient parameters, and the final weight values are
w1 = 0.2130 , w2 = 0.5320 and w3 = 0.2550.
After the combined model coefficients are obtained, 6 Discussion
the test sample is input to obtain the best prediction result.
Table 8 compares the prediction correctness of the ensemble This paper proposes a new integrated LightGBM-Xgboost-
learning model with each individual learning algorithm. The GB method to help MCH detect and determine the risk of
comparison results are shown in (Fig. 6). GDM in pregnant women and to take measures to control
Compared with the LightGBM, Xgboost and GB algo- pregnancy outcomes. We analyzed and compared several
rithms, the proposed ensemble blending learning method risk prediction models for characterizing the risk of devel-
has the best performance. oping GDM. To predict gestational diabetes, we paid more
attention to determining whether a pregnant woman has ges-
5.3.2 Comparative Analysis Using Other Ensemble tational diabetes according to the examination results. To our
Methods knowledge, this is the first study to assess the importance of
variables and to characterize the risk of developing GDM
Ensemble learning methods are divided into stack- using different machine learning methods. Our results were
ing, blending and voting, which are powerful predic- consistent with previous findings. From the research conclu-
tion techniques since they can increase the diversity of sions of many scholars, the age, BMI, WBC, hsCRP, RBP4,
algorithms and reduce generalization errors to improve weight, calpain-10 gene, and TG were important risk factors.
the accuracy of the results. We refer to other methods Our results also revealed their prominent presence on the top
about the ensemble model of pregnancy diabetes pre- 10 key factors for GDM.
diction [21, 22]. Unlike our ensemble blending method, The identification of key factors has important clinical
other articles use the voting method and the stacking significance in GDM risk assessment. We used Shapley's
method. Therefore, to show that the method for the idea to rank the importance of the feature variables of each
ensemble blending model is better, we ensemble Light- model comprehensively to obtain a more objective and rea-
GBM, Xgboost and GB models with the voting method sonable set of routine features for the GDM population. In
and stacking method and then predict the same dataset the past ten years, the detection of various indicators has
again. The final prediction results are shown in (Table 9 matured very rapidly. Therefore, when assessing the risk of
and Fig. 7). gestational diabetes, MCH medical staff need to pay more
attention to these important factors.
5.4 Conclusion Most importantly, we propose a new integrated model
that uses gray correlation to calculate weights and a genetic
From the results of Table 9 and Fig. 7, we can conclude algorithm to optimize them. Using an ensemble model com-
that the result of the ensemble model obtained by the mixed posed of the Xgboost, GB and LightGBM algorithms, the
mode is better than the other two methods. Therefore, we results are encouraging. In terms of prediction accuracy
believe that the new integrated model has a good effect in and comprehensive effects, the final model is better than the
predicting gestational diabetes mellitus. Although the pre- commonly used machine learning models. Our main con-
diction accuracy of the new integrated model is 2.56 per- tribution is reflected in two parts. First, some scholars have
centage points higher than that of a single best model, the used the traditional regression tree analysis method to study
prediction accuracy of the new integrated model is of great the prediction model of gestational diabetes mellitus, but
benefit to the prediction of gestational diabetes mellitus. It few studies have integrated multiple models to predict ges-
is possible to improve the accuracy of the diagnosis of the tational diabetes mellitus. Therefore, this research focuses on
the integration of LightGBM, Xgboost and GB algorithms.
13
These three models have strong processing abilities for small

samples, many missing values and classification features.
It provides a solution for disease prediction models with
poor data quality. Second, some scholars have established a
model explanation for gestational diabetes mellitus based on
physiological indicators but have not accounted for genetic
factors. Some studies have shown that genetic factors are
Fl-score
also the cause of gestational diabetes mellitus. In practice,

0.6471
0.7338
0.6281
0.7534 we hope to improve the diagnostic accuracy and work effi-
ciency of doctors and reduce the negative impact of missed
diagnosis and misdiagnosis. Second, this research idea
provides a research case in the direction of disease predic-
tion, enriches the application of artificial intelligence in the
medical field, and provides ideas for disease diagnosis and
prediction research.
However, several limitations should be mentioned. First,
Precision
0.7938
0.7500
0.7525
0.7570
the examination time and region of the patients in these data

were unknown, and the analysis results may be different by
time and region. Second, we must perform future research
with external validation and other machine learning methods
to assess the important characteristic variables. In addition,
it is difficult to explain the inherent complexity of variable
interactions and their impacts on outcomes because of the
“black box” nature of machine learning methods.
In conclusion, this study is based on machine learning
0.5461
0.7183
0.5390
0.7500
Recall
methods that were used to predict the early occurrence of

gestational diabetes, which will help MCH staff diagnose
the risk of diabetes in pregnant women in a timely manner
and provide a basis for preventive intervention and treatment
of GDM. By using a series of machine learning models, we
used Shapley's idea to rank the importance of the feature
variables of each model comprehensively to obtain a rule
feature set of GDM high-risk groups with higher objectiv-
ity and rationality. We analyzed and explained the impact of
0.7102
0.7516
0.6909
0.7765
AUC
this group of important factors on the development of gesta-

tional diabetes. Our results demonstrate the excellent ability
of machine learning to identify risk factors and to predict
the results of multidimensional data, which enables us to
have a deeper understanding of disease risk factors without
causation. This paper verifies the feasibility and effective-
ness of the multimodal combination prediction model for the
accurate diagnosis of gestational diabetes through theoretical
Accuracy
research and experimental analysis. It has made practical

0.7200
0.7533
0.7000
0.7792
Table 8 Comparison results of models
contributions to the study of intelligent recognition models

for gestational diabetes, but there are still deficiencies in
the research or some issues that are not well considered. In
today's medical industry, the digitalization of hospitals is
gradually increasing. The popularity of electronic medical
Ensemble methods
LightGBM
Xbboost
Model
GB
13
records and the digitization of medical equipment have

further increased the amount and scope of medical data.
With their rapid accumulation and scale growth, the era
of medical big data has arrived. In this case, the artificial
intelligence method represented by machine learning stud-
ied in this article can provide support for scientific medical
decision-making and promote the healthy development of
F1-score
China's medical service industry.

0.7195
0.6792
0.6982
Electronic supplementary material The online version of this article

(https://d oi.o rg/1 0.1 007/s 44196-0 22-0 0110-8) contains supplementary
material, which is available to authorized users.
Acknowledgements This work was supported by a grant from the

Key Disease of Diabetes Mellitus Study Center at the National Chi-
nese Medicine Clinical Research Base, the National Natural Science
Foundation of China grant Nos. U2001201 and 61876055 and National
Precision
Steering Committee for Graduate Education of Chinese Medicine and

0.7570
0.6316
0.7195
Traditional Chinese Medicine grant No. 20190723-FJ-B39.
Author contributions XW, YW, SZ, LY, and SX: contributed to the
conception of the study; YW: performed the simulation experiment;
XW: contributed significantly to the model; SX, YW, and SZ: per-
formed the data analyses and wrote the manuscript; and LY and SX:
helped perform the analysis with constructive discussions.
Data availability The data came from the Tianchi precision medicine
competition, artificial intelligence-assisted genetic risk prediction of
0.7500
0.6316
0.7087
diabetes, which was held by Aliyun United Qingwutong Health Tech-

Recall
nology Co. LTD. (https://tianchi.aliyun.com/).
Declarations
Conflict of Interest The authors declare that they have no conflict of

interest.
Ethical Approval and Consent to Participate The data used in this paper
is a public data set without ethical experiments.
Table 9 Comparison of prediction results for different ensemble methods
0.7765
0.7765
0.7297
AUC
Consent for Publication The authors involved in this paper agree to

publish this paper.
Open Access This article is licensed under a Creative Commons Attri-

bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
Accuracy
included in the article's Creative Commons licence, unless indicated

0.7792
0.7472
0.7306
otherwise in a credit line to the material. If material is not included in

the article's Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Blending
Stacking
1. National Bureau of Statistics. Birth Rate[DB/OL]. http://www.

Voting
Model
stats.gov.cn/2021. Accessed 2021
13
2. OCED.Global fertility in developed countries[DB/OL]. https:// 21. Kumari, S., Kumar, D., Mittal, M.: An ensemble approach for
www.oecd.org/.2021. Accessed 2021 classification and prediction of diabetes mellitus using soft voting
3. Stewart, Z.A.: Gestational diabetes[J]. Obstet. Gynaecol. Reprod. classifier[J]. Int. J. Cognitive Comput. Eng. 2(1), 40–46 (2021)
Med. 30(3), 79–83 (2020) 22. Wang X.: Application of integrated learning in gestational dia-
4. Wang, X., Chen, M., Xia, W., Zhu, K., et al.: Improving the betes mellitus prediction[D]. Master thesis. Chongqing Normal
risk management of Type 2 diabetes mellitus in China from the University, pp. 14–25 (2020)
perspective of social relationships[J]. Expert. Syst. 37(2), 1–18 23. Zhang, H., He, G., Wang, J.: Research on type 2 diabetes mellitus
(2020) precise prediction models based on XGBoost algorithm[J]. China.
5. Wang, X., Gong, W., Zhu, K., et al.: Sequential prediction of Exp. Diagn. 22(3), 408–412 (2018)
glycosylated hemoglobin based on long short-term memory with 24. Wang, X., Wang, X., Li, L.: Application of deep learning model
self-attention mechanism[J]. Int. J. Comput. Intell. Syst. 13(1), in predicting the risk of type 2 diabetes mellitus[J]. Elect. J. Clin.
1578–1589 (2020) Med. Liter. 4(84), 16460–16461 (2017)
6. Vounzoulaki, E., Khunti, K., Abner, S.C., et al.: Progression to 25. Lan, T., Hu, H., Jiang, C., et al.: A comparative study of decision
type 2 diabetes in women with a known history of gestational tree, random forest, and convolutional neural network for spread-F
diabetes: Systematic review and meta-analysis[J]. Br. Med. J. identification[J]. Adv. Space Res. 65(8), 2052–2061 (2020)
369(1361), 1–11 (2020) 26. Begenova, S., Avdeenko, T.: Building of fuzzy decision trees
7. Zheng, W.: Case control study of gestational diabetes mellitus using ID3 algorithm[J]. J. Phys: Conf. Ser. 1015(2), 22002–22009
influential factors and maternal and fetal outcomes[D]. Master (2018)
thesis. China Medical University, pp. 1–10 (2009) 27. Qiao, W., Tian, W., Tian, Y., et al.: The forecasting of PM2.5 using
8. Care, D., Suppl, S.: Classification and diagnosis of diabetes: a hybrid model based on wavelet transform and an improved deep
Standards of medical care in diabetesd-2019[J]. Diabetes. Care. learning algorithm[J]. IEEE Access 7(7), 142814–142825 (2019)
42(1), 13–28 (2019) 28. Lu, Y., Fu, X., Chen, F., et al.: Prediction of fetal weight at var-
9. Cheruku, R., Edla, D.R., Kuppili, V.: Diabetes classification using ying gestational age in the absence of ultrasound examination
radial basis function network by combining cluster validity index using ensemble learning[J]. Artif. Intell. Med. 102(101748), 1–10
and BAT optimization with novel fitness function[J]. Int. J. Com- (2020)
put. Intell. Syst. 10(1), 247–265 (2017) 29. Li, X.: Using, “ random forest ” for classification and regression[J].
10. Wu, B., Huang, H., Yao, Q., et al.: The application of big data and Chin. J. Appl. Entomol 50(4), 1190–1197 (2013)
artificial intelligence methods in prediction of GDM[J]. Chin J. 30. Lombardo, L., Cama, M., Conoscenti, C., et al.: Binary logistic
Health. Inform. Manag. 114(6), 832–837 (2017) regression versus stochastic gradient boosted decision trees in
11. Rissanen, J., Markkanen, A., et al.: Sulfonylurea receptor 1 gene assessing landslide susceptibility for multiple-occurring landslide
variants are associated with gestational diabetes and type 2 dia- events: application to the 2009 storm event in Messina (Sicily,
betes but not with altered secretion of insulin[J]. Diabetes. Care. southern Italy)[J]. Nat. Hazards 79(3), 1621–1648 (2015)
23(1), 70–73 (2000) 31. Ye, J, Chow, J-H, Chen, J.: Stochastic Gradient Boosted Distrib-
12. Bao, W., Yeung, E., Tobias, D.K., et al.: Long-term risk of type uted Decision Trees[C]. Proceedings of the 18th ACM Conference
2 diabetes mellitus in relation to BMI and weight change among on Information and Knowledge Management, 2061–2064 (2009)
women with a history of gestational diabetes mellitus: a prospec- 32. Chen, T, Guestrin, C.: XGBoost: A scalable tree boosting
tive cohort study[J]. Diabetologia 58(6), 1212–1219 (2015) system[C]. International Conference on Knowledge Discovery
13. Minooee, S., Ramezani Tehrani, F., et al.: Diabetes incidence and Data Mining, 785–794 (2016)
and influencing factors in women with and without gestational 33. Yue, L., Yi, Z., Pan, J., et al.: Identify M Subdwarfs from M-type
diabetes mellitus: A 15 year population-based follow-up cohort Spectra using XGBoost[J]. Optik 225(2), 165535.1-165535.6
study[J]. Diabetes Res. Clin. Pract. 128(1), 24–31 (2017) (2021)
14. Li, F., Hu, Y., Zeng, J., et al.: Analysis of risk factors related to 34. Sharma, V., Mir, R.N.: An enhanced time efficient technique for
gestational diabetes mellitus[J]. Taiwan. J. Obstet. Gynecol. 59(5), image watermarking using ant colony optimization and light gra-
718–722 (2020) dient boosting algorithm[J]. J. King Saud Univ – Comput. Inf. Sci.
15. Kuzmicki, M., Telejko, B., Szamatowicz, J., et al.: High resistin 34(3), 615–626 (2019)
and interleukin-6 levels are associated with gestational diabetes 35. Ke, G, Meng, Q, Finley, T.: LightGBM: A Highly Efficient Gra-
mellitus[J]. Gynecol. Endocrinol 25(4), 258–263 (2009) dient Boosting Decision Tree[C]. Adv Neural Inf Process Syst,
16. Rezvan, N., Hosseinzadeh Attar, M.J., Masoudkabir, F., et al.: 3146–3154 (2017)
Serum visfatin concentrations in gestational diabetes mellitus and 36. Del Ser, J., Rokach, L., Herrera, F., et al.: A practical tutorial
normal pregnancy[J]. Arch. Gynecol. Obstet. 285(5), 1257–1262 on bagging and boosting based ensembles for machine learning:
(2011) Algorithms, software tools, performance study, practical perspec-
17. Shaat, N., Karlsson, E., Lernmark, A., et al.: Common variants in tives and opportunities[J]. Inf. Fusion. 64(1), 205–237 (2020)
MODY genes increase the risk of gestational diabetes mellitus[J]. 37. Zeng, X., Yeung, D.S.: Hidden neuron pruning of multilayer per-
Diabetologia 49(7), 1545–1551 (2006) ceptrons using a quantified sensitivity measure[J]. Neurocomput-
18. Kumar, D., Jain, N., Khurana, A., et al.: Automatic detection of ing 69(4), 825–837 (2006)
white blood cancer from bone marrow microscopic images using 38. Shadkani, S., Abbaspour, A., Samadianfard, S., et al.: Compara-
convolutional neural networks[J]. IEEE Access 8(1), 142521– tive study of multilayer perceptron-stochastic gradient descent and
142531 (2020) gradient boosted trees for predicting daily suspended sediment
19. Mittal, M., Arora, M., Pandey, T., Goyal, L.M.: Image segmenta- load: The case study of the Mississippi River, U.S.[J]. Int. J. Sedi-
tion using deep learning techniques in medical images[M]. Algor. ment. Res. 36(4), 512–523 (2021)
Intell. Syst. (2019). https://d oi.o rg/1 0.1 007/9 78-9 81-1 5-1 100-4_3 39. Wang, X., Wang, J., Zhang, K., et al.: Convergence and objective
20. Nombo, A.P., Mwanri, A.W., et al.: Gestational diabetes mellitus functions of noise-injected multilayer perceptrons with hidden
risk score: a practical tool to predict gestational diabetes mellitus multipliers[J]. Neurocomputing 452(7), 796–812 (2020)
risk in Tanzania[J]. Diabetes Res. Clin. Pract. 145(8), 130–137 40. Jianyu Y.: Research on Predictive Model of Gestational Diabetes
(2018) Based on Integrated Learning Algorithm[D]. Master thesis. Har-
bin Institute of Technology, pp. 37–46 (2019)
13
41. Yang, M., Deng, M.H., et al.: (2010) Research on index weight 46. Huo, Z., Li, H., Du, W.: The effect of pre-pregnancy BMI and
based on improved grey relational analysis[J]. Int. Conf. Mach. parity on gestational diabetes mellitus among pregnant women[J].
Learning Cybern. 4(1), 1967–1970 (2010) J. Clin. Pathol. Res. 36(2), 161–167 (2016)
42. Deng, J.: Grey information space[J]. J. Grey. Syst. 1(2), 103–117 47. Paula Bertoli, J.P., Schulz, M.A., et al.: Obesity in patients with
(1989) gestational diabetes: Impact on newborn outcomes[J]. Obes. Med.
43. Fang, Z., Liu, S., Forrest, J.: A new definition for the degree of 20(1), 100296.1-100296.5 (2020)
grey incidence[J]. Sci. Inq. 7(2), 111–124 (2006) 48. Mishra, S., Shetty, A., Rao, C.R., et al.: Risk factors for gestational
44. Jana, C., Pal, M.: A dynamical hybrid method to design decision diabetes mellitus: A prospective case-control study from coastal
making process based on GRA approach for multiple attributes Karnataka[J]. Clin. Epidemiol. Glob. Health. 8(4), 1082–1088
problem[J]. Eng. Appl. Artif. Intell. 100(82), 104203.1-104203.10 (2020)
(2021)
45. Dong, X., Zhang, H., et al.: Hybrid genetic algorithm with varia- Publisher's Note Springer Nature remains neutral with regard to
ble neighborhood search for multi-scale multiple bottleneck trave- jurisdictional claims in published maps and institutional affiliations.
ling salesmen problem[J]. Futur. Gener. Comput. Syst. 114(3),
229–242 (2021)
13

Ensemble1-2022-Analysis and Prediction of Gestational Diabetes Mellitus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ensemble1-2022-Analysis and Prediction of Gestational Diabetes Mellitus

Uploaded by

Copyright:

Available Formats

International Journal of Computational Intelligence Systems (2022) 15:72

Analysis and Prediction of Gestational Diabetes Mellitus

Received: 17 November 2021 / Accepted: 18 July 2022

Abbreviations Xgboost Extreme gradient boosting

Fig. 1 Calculation steps of the ensemble model

We analyzed and compared the basic characteris- 5.2 Collection of Important Variables

Fig. 2 The confusion matrix

Table 2 Comparison results Model Accuracy AUC​ Recall Precision f1-score

Table 3 The top 20 ranked variables for each algorithm

1 VAR00007 VAR00007 VAR00007 VAR00007 VAR00007 VAR00007

Table 4 Variable ranking based Model GB LightGBM MLP RF Xgboost DT Mean rank

5.3 Analysis of Predicted Results

5.3.1 Result of the Ensemble LightGBM‑Xgboost‑GB Model

We calculated the gray correlation coefficient between

Fig. 7 Radar map showing a

These three models have strong processing abilities for small

also the cause of gestational diabetes mellitus. In practice,

the examination time and region of the patients in these data

methods that were used to predict the early occurrence of

this group of important factors on the development of gesta-

research and experimental analysis. It has made practical

contributions to the study of intelligent recognition models

records and the digitization of medical equipment have

China's medical service industry.

Electronic supplementary material The online version of this article

Acknowledgements This work was supported by a grant from the

Steering Committee for Graduate Education of Chinese Medicine and

Traditional Chinese Medicine grant No. 20190723-FJ-B39.

diabetes, which was held by Aliyun United Qingwutong Health Tech-

nology Co. LTD. (https://​tianc​hi.​aliyun.​com/).

Conflict of Interest The authors declare that they have no conflict of

Consent for Publication The authors involved in this paper agree to

Open Access This article is licensed under a Creative Commons Attri-

included in the article's Creative Commons licence, unless indicated

otherwise in a credit line to the material. If material is not included in

1. National Bureau of Statistics. Birth Rate[DB/OL]. http://​www.​

stats.​gov.​cn/​2021. Accessed 2021

You might also like

Abbreviations Xgboost Extreme gradient boosting

Fig. 1 Calculation steps of the ensemble model

We analyzed and compared the basic characteris- 5.2 Collection of Important Variables

Fig. 2 The confusion matrix

Table 2 Comparison results Model Accuracy AUC Recall Precision f1-score

Table 3 The top 20 ranked variables for each algorithm

Table 4 Variable ranking based Model GB LightGBM MLP RF Xgboost DT Mean rank

5.3 Analysis of Predicted Results

5.3.1 Result of the Ensemble LightGBM‑Xgboost‑GB Model

Fig. 7 Radar map showing a

nology Co. LTD. (https://tianchi.aliyun.com/).

1. National Bureau of Statistics. Birth Rate[DB/OL]. http://www.

stats.gov.cn/2021. Accessed 2021