1 s2.0 S0010482523009630 Main

Computers in Biology and Medicine 166 (2023) 107498
Contents lists available at ScienceDirect
Computers in Biology and Medicine

journal homepage: www.elsevier.com/locate/compbiomed
Identification of cell-type-specific genes in multimodal single-cell data

using deep neural network algorithm
Weiye Qian, Zhiyuan Yang *
School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou, PR China
A R T I C L E I N F O A B S T R A C T
Keywords: The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to measure DNA, RNA,
Bioinformatics and protein in a single cell. Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq) is a
Multimodal single-cell study powerful multimodal single-cell research innovation, allowing researchers to capture RNA and surface protein
Deep learning
expression on the same cells. Currently, identification of cell-type-specific genes in CITE-seq data is still chal
lenging. In this study, we obtained a set of CITE-seq datasets from Kaggle database, which included the
sequencing dataset of seven cell types during bone marrow stem cell differentiation. We used Student’s t-test to
analyze these transcription RNAs and pick out 133 significantly differentially expressed genes (DEGs) among all
cell types. Functional enrichment revealed that these DEGs were strongly associated with blood-related diseases,
providing important insights into the cellular heterogeneity within bone marrow stem cells. The relation between
RNA and protein levels was performed by deep neural network (DNN) model and achieved a high prediction
score of 0.867. Based on their coefficients in the DNN model, three genes (LGALS1, CENPV, TRIM24) were
identified as cell-type-specific genes in erythrocyte progenitor. Our works provide a novel perspective regarding
the differentiation of stem cells in the bone marrow and provide valuable insights for further research in this
field.
1. Introduction appropriate algorithms to tackle batch effect, methods for

cell-type-specific analysis of multimodal CITE-seq data remain chal
Single-cell RNA sequencing (scRNA-seq) technology is a cutting-edge lenging [6]. With this in mind, multimodal machine learning could serve
technology for cell state discovery and characterization, since it can as the solution to this problem.
simultaneously detect the RNA expression in a single cell. Multimodal Multimodal machine learning leverages diverse biological data
data refers to the data composed of different types of multi-omics in sources to extract informative features and integrate them, thereby
formation, which are semantically related and provide complementary enhancing the performance of learning tasks. In the context of single-cell
information to each other [1]. Cellular Indexing of Transcriptomes and analysis, where understanding the intricate relationships between
Epitopes by sequencing (CITE-seq), represents the latest innovation of different biological molecules is crucial, the fusion of information from
scRNA-seq technology for multimodal study developed by Stoeckius et various sources becomes particularly relevant. Multimodal machine
al. [2]. By CITE-seq technology, transcription RNA and surface protein learning methods can be classified into three categories: multimodal
can be quantified simultaneously in a single cell [3]. The presence of fusion, cross-modal learning, and shared representation learning [7]. In
unique variation factors between data sourced from different modes this study, we employed the second category of methods because only
must be considered in analyzing CITE-seq data. Large-scale multimodal one modality of data, namely RNA gene expression data, was available
CITE-seq datasets sourced from different samples at different times often during the testing phase. Some machine learning methods, such as
contain batch effects, making the result undesirable [4]. Scientists have LightGBM and TabNet showed good performance in many biology
also reported that a key challenge in multimodal single-cell analysis is studies. LightGBM employs various data structures and memory opti
the design of appropriate methods to reconstruct the relation between mization techniques to handle large-scale data efficiently without
transcription RNAs and proteins [5]. Currently, due to the lack of requiring data partitioning. TabNet incorporates a specific feature
* Corresponding author.
E-mail address: yangzhiyuan@hdu.edu.cn (Z. Yang).
https://doi.org/10.1016/j.compbiomed.2023.107498
Received 25 March 2023; Received in revised form 15 August 2023; Accepted 15 September 2023
Available online 16 September 2023
0010-4825/© 2023 Elsevier Ltd. All rights reserved.
W. Qian and Z. Yang Computers in Biology and Medicine 166 (2023) 107498
interaction mechanism that effectively captures the interactions be prediction of 0.867 and demonstrated good performance for predicting
tween features in tabular data. TabNet also exhibits strong representa protein levels from transcription RNAs. The highest-performing pa
tion learning and end-to-end training statistical characteristics. Based on rameters allowed more accurate selection of cell-type-specific genes.
their previous performance, LightGBM and TabNet algorithms were
applied to analyze the CITE-seq data in this study. 2. Materials and Methods
In addition, cell-type-specific genes can be identified by under
standing the relations between transcription RNA and their corre 2.1. Datasets and materials
sponding proteins. Relation models for RNAs and proteins for existing
samples have been often constructed using a deep learning algorithm. High-quality data are the basic component for high-quality analysis.
Deep learning is mostly built on very large datasets of unlabeled and The multimodal single-cell datasets of this study were obtained from
labeled data. In both industry and academia, the application of deep Kaggle (http://www.kaggle.com/competitions/open-problems-mult
learning algorithms in bioinformatics has led to new data features being imodal), a platform that hosts many datasets for machine learning
mined [8]. Deep neural network (DNN) algorithm is one of the deep analysis [11]. These datasets included 70,988 cells measured by
learning algorithms. DNN has developed rapidly in recent years and has CITE-seq technology. The samples of these datasets were isolated from
demonstrated advanced performance in numerous fields. Thus, the DNN bone marrow stem cells of blood in four healthy donors at three different
model was also applied in identification of cell-type-specific genes in time points. The flowchart of our work is shown in Fig. 1.
CITE-seq data. By machine learning and DNN model, we could deter
mine the changes in genetic dynamics in this multimodal single-cell 2.2. Analysis of gene expression at different time points
data.
In this study, we obtained a set of multimodal single-cell data by To analyze the differences among samples, we divided our CITE-seq
CITE-seq technology. This dataset included the sequenced information dataset into three groups according to their time points. The mean and
from bone marrow stem cells collected from diverse samples obtained at standard deviation (SD) of transcription RNA (gene-transcribed RNA) at
various time points. These stem cells consisted of seven cell types: He each time point were calculated. Because the RNAs were transcribed by
matopoietic Stem Cell (HSC), Mast Cell Progenitor (MasP), Megakar their genes, thus the corresponding gene names of transcribed RNAs
yocyte Progenitor (MkP), Neutrophil Progenitor (NeuP), Monocyte were used to facilitate the following analysis in this study. The propor
Progenitor (MoP), Erythrocyte Progenitor (EryP), and B-Cell Progenitors tion of gene expression level in each cell type was also calculated.
(BP). HSCs are one of the most important stem cells in blood with the
ability to renew and differentiate [9]. These cells are finely controlled by 2.3. Identification of DEGs among different cell types
many signals in the bone marrow, and give rise to all types of mature
blood cells [10]. During the differentiation, the cells in the bone marrow Due to the presence of possible noise in the dataset, feature selection
change cell states dynamically. is a necessary process before training in the machine learning model
We screened out 133 differentially expressed genes (DEGs) by con [12]. Differential expression analysis was frequently used to identify
ducting Student’s t-test comparisons among all cell types. DNN algo specific genes in disease [13]. Based on the giving tags in the CITE-seq
rithm was applied to analyze the interaction between transcription RNA dataset, these 70,988 cells were classified into seven cell types: Mast
and proteins in bone marrow stem cells. This model showed an average Cell Progenitor (MasP), Megakaryocyte Progenitor (MkP), Neutrophil
Fig. 1. Flowchart of the experiments reported in this work.
2
Progenitor (NeuP), Monocyte Progenitor (MoP), Erythrocyte Progenitor Table 1

(EryP), Hematopoietic Stem Cell (HSC), and B-Cell Progenitors (BP). We The details of applied hyperparameters in three models.
then used Student’s t-test to find genes that were significantly differ Model Hyperparameter Range
entially expressed in different cell types with p-value cutoff≤0.05. Genes
LightGBM Learning Rate 0.1, 0.2, 0.3, 0.4, 0.5
that were expressed at differential levels among each two of different Maximum Depth 5～15
cell types were identified and denoted as DEGs. In this process, 133 Number of Leave 16～256
DEGs were identified among these seven cell types. Minimum Child Samples 100～500
TabNet Decision dimension 16～512

2.4. Test the performance of DEGs in cell-type classification Attention dimension 16～512
Number of steps 1～10
Attention coefficient scaling factor 0.5～2.0
To analyze the classification performance of these 133 DEGs, we
applied some machine learning algorithms to test their effectiveness in DNN Number of layers 2, 3, 4, 5
Number of units in dense layers 64, 128, 256, 512, 1024
our datasets. Three machine learning algorithms (Support Vector Ma
Activation functions in dense layers relu, swish, tanh, sigmoid
chine, Random Forest, Logistic Regression) are used to classify the seven Rate of dropout layers 0.1, 0.2, 0.3, 0.4, 0.5
cell types in our CITE-seq dataset. Seven indexes (Sensitivity, Specificity,
Recall, Accuracy, Matthews Correlation Coefficient, Area Under Curve,
F1-score) are applied to show the effectiveness of this model. The pre
vp = f (vr )
diction score in the whole dataset was also calculated.
2.5. Functional enrichment of significant DEGs vp : the normalized value of surface protein
vr : the normalized value of transcription RNA
Information regarding gene function is central in precision medicine f: the machine learning algorithm
and drug discovery. The functions and biological processes of 133 DEGs
were analyzed by Metascape, which is a powerful tool for enrichment
2.8. Build RNA-protein relation model by DNN
analysis [14]. In addition, the disease enrichment of these 133 DEGs was
performed by DisGeNET, a database collecting the largest genes and
Complex bioinformatics datasets are typically high-dimensional,
variants involved in human diseases [15].
thus the deep learning algorithm was further applied to analyze the
RNA-protein relation model in the CITE-seq datasets. Deep neural
2.6. Normalization analysis of surface proteins
network (DNN) is one of the powerful deep learning algorithms in
feature selection and it performs well in analyzing high-dimensional
In the CITE-seq technology, the transcription RNA and surface pro
datasets, such as predicting protein solubility by DNN [23]. Our DNN
teins were measured simultaneously, thus analysis of the surface protein
model is composed of an input layer, an output layer, and multiple
is needed. RNA-bound proteins are important mediators of gene
hidden layers. In our DNN model, the input and output datasets were the
expression. Recent advances in protein technology have resulted in the
same as in the machine learning model above. We applied 10-fold
identification of many proteins that interact with RNA [16]. For
cross-validation strategy to assess the performance of DNN model. In
example, a recent study reported a relation between RNA structure and
this strategy, one group was randomly selected as the test dataset, while
its ability to interact with proteins [17]. In this study, 140 surface pro
the other nine groups were used as the training dataset.
teins were captured in CITE-seq datasets. Data normalization was used
The choice of parameters has a significant impact on deep learning
as a pre-processing step for this analysis of surface proteins to divide into
model, and the hyperparameters needed to be optimized in the DNN
groups in which equal contributions were obtainable for each feature. In
model [24]. In this model, four hyperparameters (the number of layers,
a recent paper, several types of data normalization were compared, and
number of units in dense layers, activation functions in dense layers, rate
the authors suggest that the best-performing method was simple
of dropout layers) were tuned. The details of hyperparameters in DNN
Z-scores [18]. Z-scores were therefore used to normalize the data and
were listed in Table 1. Hyperparameters were optimized by the ‘Adam
reduce the noises in data. The transformed value of surface protein was
optimizer’ to find the minimum loss with the best accuracy in the
then used to test their relation in RNA-protein interactions.
training datasets. Early stops were used to select the parameters that
performed best for this algorithm. The output parameters of the last
2.7. Build RNA-protein relation model by machine learning algorithms
dropout layer were recorded. The regularization function of the L2
weight matrix was used on the dense layers to improve the generaliza
We used two machine learning algorithms (LightGBM and TabNet)
tion of the model, and discard layers were added behind each dense
models to analyze the relation between protein and RNA values in our
layer to prevent over-fitting.
CITE-seq datasets. LightGBM is a distributed gradient boosting frame
Choosing an appropriate similarity measure is critical for models to
work based on the decision tree algorithm in Microsoft’s machine
predict accurate surface protein levels from RNA expression data. We
learning toolkit. LightGBM can be trained more than 20 times faster than
used the Pearson Correlation Coefficient (PCC), which is a widely used
XGBoost and traditional GBDT [19]. Moreover, LightGBM has also
measure of linear correlation. Importantly, predictions based on corre
shown excellent prediction results when used for the identification and
lation coefficients have been shown to perform well in previous studies
classification of miRNA targets of breast cancer [20]. TabNet, created by
[25]. The correlation coefficient focuses on the accuracy of the predic
Google Cloud AI in 2020, is an algorithm specifically designed for
tion of the RNA-protein relation.
tabular data. This algorithm was also found to perform well in repre
sentational learning and end-to-end training statistics for neural net
works. All key characteristics of tree models, including high 2.9. Test DNN model in three-party independent dataset
performance and interpretability, are well represented in this algorithm
[21]. In addition, previous research has indicated that these algorithms It is an important process to test performance of the model in the
show very good performance in the prediction of phosphorylation sites independent dataset. We download a CITE-seq dataset from Gene
in soybean proteins [22]. The details of hyperparameters in LightGBM Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.
and TabNet models were shown in Table 1. gov/geo) with accession number GSE236284. This dataset was
Here is a simple formula for our machine learning model: composed of expression profiling by CITE-seq in nine different cell lines
3
in Ewing sarcoma. By using the same method described in ‘Materials and 19%, and 23% respectively. Curiously, the number of MkP cells first
Methods 2.8’, we selected the DEGs and performed DNN model in declined and then slightly increased, showing 10%, 6%, and 7% over the
GSE236284 dataset. The PCC value was applied to test the performance. three time points. These results suggest that seven cell types develop
differently during the differentiation of bone marrow stem cells.
2.10. Screening of cell-type-specific genes in model
3.3. Analysis of genes among the three different time points
The cell-type-specific gene was a key factor in studying the differ
entiation mechanism of stem cell. Based on the above machine learning
A volcano plot is a scatter plot of variation that can be used to quickly
and deep learning algorithms, we build the RNA-protein relation in
visualize large data sets, facilitating rapid data interpretation. Volcano
seven different cell types of bone marrow stem cell differentiation. In
plots are frequently employed to assess potential differences between
this algorithm model, the weighting coefficients of each RNA (gene)
sub-samples of data before undertaking a comprehensive analysis. Vol
were obtained. We documented the highest-ranking genes specific to
cano plot regarding significant genes among time points is shown in
different cell types, designating them as cell-type-specific genes within
Fig. 5. The numbers of genes with fold change value ≥ 2 in three
this investigation. The chromosome locations of these genes were
different time points were varying. The genes downregulated from T1 to
further discussed.
T2 accounted for the largest proportion of DEGs in our study. The up-
regulated genes were highly increased in different time points from T1
3. Results and discussion
to T3. In addition, a comparison of ‘T2 vs T3’ shows a large number of
significant genes, of which most were upregulated.
3.1. Overall RNA data description
In this study, we analyzed CITE-seq data with 70,988 cells at three 3.4. Identification of DEGs among different cell types
different time points (i.e.: T1, T2, and T3). The cells in this dataset were
classified into seven types according to their cell types. The mean and To identify significantly differentially expressed genes (DEGs)
Standard Deviation (SD) of the expression value for all cells at each time among cell types, we calculated p-values for pairwise comparisons of the
point are shown in Table 2. Results showed that two cell types (MasP seven cell types by Student’s t-test. Genes with p-values less than or
and NeuP) showed a high expression level with a score of 1.07. The B- equal to 0.05 were considered to be DEGs. Based on the significant p-
Cell Progenitor showed a relatively low level with a score of 0.86. Most values, we divided the DEGs into three groups and visualized them in
of the SD was much larger than the mean in each cell type, indicating Fig. 6. Results showed all the p-values of ‘HSC vs MasP’ were very low,
that the distribution of RNA is highly diverse (Fig. 2). signifying substantial dissimilarity in the gene expression profilings
between these two cell types. The p-values of ‘BP vs EryP’ are relatively
3.2. Analysis of trends across different time points high, indicating that BP and EryP were more similar in the gene
expression patterns. We also found the number of genes with p-values =
The expression trend of each cell type at different time points was 0 was relatively high, and the number of genes close to 0.05 was rela
drawn for better visualization (Fig. 3). The results suggest that HSCs tively low. These results suggest that each cell type has a specific gene
change to other cell types throughout the experiment. At T1, 50% of expression pattern for maintaining stable differentiation.
cells were HSCs, but only about 30% remained at T3. With respect to the
relative abundances of individual cell types over time, we found that the
3.5. Test the performance of DEGs in cell-type classification
number of MasP cells increased slowly between T1 and T2 but more
quickly between T2 and T3. In contrast, MkP numbers decreased rapidly
The 133 DEGs were applied to test their performance in classification
between T1 and T2 and rose slowly between T2 and T3. The numbers of
of the seven cell types. Values for sensitivity, specificity, recall, accu
both EryP and NeuP rose consistently over all three different time points,
racy, MCC, AUC and F1-score are calculated to estimate the machine
while the numbers of BP and MoP showed nearly no changes among the
learning algorithm’s performance and were shown in Table 3. The
time points.
detection accuracies of random forest, logistic regression and support
The percentages of all cells at each time point were also drawn and
vector machine are 0.905, 0.880, and 0.882 respectively. The random
specific details regarding relative cell numbers for all types at each time
forest algorithm obtained an F1-score of 90.3% with significance p-
point are shown in Fig. 4. The numbers of MoP and BP were very low at
value≤0.05. From the qualitative analysis, it is observed that the pro
all time points, occupying approximately 1% of all expressed RNAs. The
posed SVM algorithm performed relatively better than the logistic
number of HSCs decreased continuously: 49% at T1, 43% at T2, and
regression and random forest algorithms in classifying these six cell
finally 35% at T3, respectively. In contrast, the number of MasP, NeuP,
types.
and EryP cells rose continuously. At T1, they accounted for 9%, 15%,
and 16% of all cells, respectively, while at T2, they accounted for 10%,
18%, and 20%, respectively. Finally, at T3, they accounted for 16%, 3.6. Function enrichment of significant DEGs
Table 2 The genetic information analysis is central to precision medicine and

Mean and standard deviation of RNA expression data for various cells at each drug discovery. To investigate the genetic information, the function
time point. SD: standard deviation. enrichment analysis of the 133 DEGs was investigated by Metascape and
Cell type T1 T2 T3
DisGeNET (Fig. 7). The results showed that the 133 significant DEGs
were strongly associated with blood-related diseases, including acute
mean SD mean SD mean SD
myocardial infarction and acute myeloid leukemia (AML). AML is a type
BP (B-Cell Progenitor) 0.86 1.85 0.78 1.8 0.89 1.87 of blood cancer that starts in the bone marrow. It usually begins in cells
EryP (Erythrocyte Progenitor) 1.04 1.93 1.04 1.92 0.98 1.90
that turn into white blood cells, but it can start in other blood-forming
HSC (Hematoploetic Stem Cell) 0.96 1.90 0.92 1.86 0.92 1.87
MasP (Mast Cell Progenitor) 1.07 1.94 1.10 1.94 0.99 1.92 cells as well. This disease is due to abnormal white blood cells in the
MkP (Megakaryocyte 1.01 1.92 1.02 1.91 0.99 1.92 bone marrow grow and divide uncontrollably. AML is highly associated
Progenitor) with cells in the peripheral blood and bone marrow [26]. We suggest
MoP (Monocyte Progenitor) 0.95 1.9 0.97 1.89 0.87 1.87 that these DEGs may affect progression of the blood-related disease,
NeuP (Neutrophil Progenitor) 1.07 1.93 1.03 1.91 0.93 1.87
which needs further study in the future.
4
Fig. 2. Basic information about CITE-seq expression data. (A) Mean of gene expression values for each of the seven cell types at each time point. (B) Standard
Deviation (SD) of gene expression values for each of the seven cell types at each time point.
3.7. Normalization analysis of surface proteins
Cell-specific genes can be accurately identified via studying the

relation between surface protein and RNA levels. In this study, we
examined the values of leukocyte differentiation antigens in 144 pro
teins. The distribution intervals for different expression levels were
found to be inconsistent. This presented an issue, since it may be asso
ciated with reduced correlational strength of the predictive model.
Normalization was used to fix this issue, aiming to standardize the dis
tribution intervals to approximately 0. Remarkably, this normalization
process did not reduce the accuracy of the model predictions but reduce
correlation loss. The top 10 proteins by expression level (i.e., CD31,
CD44, CD244, CD71, CD49d, CD29, CD11a, CD33, CD47, and CD162)
were then selected and their distributions were plotted (Fig. 8). Of these
proteins, some are normally distributed, but some showed non-normal
Fig. 3. Relative abundances of each type of cell over time.
distributions.
3.8. Build RNA-protein relation model
To evaluate the correlation between the value of significant DEGs
5
Fig. 4. Percentage of all cells at each time point.
Fig. 5. Volcano plots of gene expression for all time points. The red point indicated the up-regulated gene, while the blue point indicated the down-regulated gene.
and surface proteins, we divided the CITE-seq dataset into new sub- 3.9. Test DNN model in three-party independent dataset
datasets according to their cell types. Because HSC gradually differen
tiated into other six cell types in bone marrow and blood, thus these six The GSE236284 dataset was downloaded from NCBI database and
cell types (MasP, MkP, NeuP, MoP, EryP, BP) were taken into analysis. was applied to test our DNN model in CITE-seq datasets. In this dataset,
Each new dataset was in turn divided into three parts by the time point. the expression profilings of nine cell lines (A673, SKNMC, A4573, TC32,
Three algorithms (DNN, LightGBM, and TabNet) were used to predict CHLA9, CHLA10, TC71, PDX305, RDES) in Ewing sarcoma were per
and evaluate each of these parts separately. The Pearson Correlation formed. The DNN model was applied to build RNA-protein relation and
Coefficients (PCCs) for each of the three prediction models are shown in the results were shown in Fig. 9. Although different epochs could give
Table 4. The highest and lowest PCCs for DNN were 0.909 and 0.783, different PCC values, all the PCC values were larger than 0.82. The
respectively, with most of coefficients at approximately 0.87. Similarly, highest PCC value is 0.847, while the lowest PCC value is 0.822. These
the highest and lowest PCCs for LightGBM were 0.907 and 0.686, with results showed our DNN model is effective in building RNA-protein
most of coefficients also at approximately 0.87. However, the highest relation model in CITE-seq datasets.
and lowest PCCs for the TabNet algorithm were 0.902 and 0.428, with a
greater proportion of coefficients below 0.80 than the other two algo
rithms. The mean of PCCs predicted by three algorithms (DNN, 3.10. Screening of cell-type-specific genes
LightGBM, and TabNet) for each cell at each time point were 0.867,
0.841, and 0.775, respectively. Taken together, these results suggest that The results above suggest that the DNN model shows better predic
the correlation between DEGs and surface protein levels was high. tive results than LightGBM or TabNet. Therefore, the DNN model was
then used to screen for cell-type-specific genes. The weighting co
efficients for each of the significant DEGs were obtained using the DNN
6
Fig. 6. The important differentially expressed genes identified among cell types. The red box indicated the top 1/3 p-values; the yellow box indicated the middle (1/
3~2/3) p-values; the green box indicated the bottom 1/3 p-values.
algorithm. Cell-type-specific genes for each group were screened based

Table 3
on weighting coefficient. Their chromosomal positions and corre
The performance of three algorithms in the cell-type classification. MCC: Mat
sponding protein names were also obtained, and all details of the top
thews correlation coefficient; AUC: Area Under Curve; RF: Random Forest; LR:
cell-type-specific genes for each cell type are shown in Table 5. For
Logistic Regression; SVM: Support Vector Machine.
example, the gene LMNA showed a relatively high weighting score
Index RF LR SVM
(0.0123) in the MasP cell type, while TRIM24 showed a relatively low
Sensitivity 0.867 0.858 0.862 weighting score in EryP cells. LMNA was found to be specific to two cell
Specificity 0.923 0.873 0.878 types (i.e., MasP and MkP) and the gene CST7 was specific to three cell
Recall 0.865 0.858 0.863
Accuracy 0.905 0.880 0.882
types (i.e., MasP, NeuP, MoP). Interestingly, two genes (CST7 and ITPA)
MCC 0.883 0.853 0.854 were present on chromosome 20, while two other genes (LMNA and
AUC 0.955 0.954 0.962 GCSAML) were present on chromosome 1. It was reported that CST7
F1-score 0.903 0.877 0.879 played a significant role in human myeloid malignancies. The expression
of CST7 was positively correlated with the percentage of neutrophils in
Fig. 7. Enrichment analysis of differentially expressed genes.
7
Fig. 8. The distributions of the ten most highly expressed proteins.
Table 4
- Pearson correlation coefficients at a given time point as predicted by the two other time points.
Cell type DNN LightGBM TabNet
T1 T2 T3 T1 T2 T3 T1 T2 T3
BP 0.782 0.795 0.878 0.768 0.73 0.807 0.621 0.607 0.428

EryP 0.86 0.869 0.873 0.858 0.869 0.867 0.84 0.855 0.856
MasP 0.875 0.896 0.883 0.872 0.895 0.879 0.867 0.886 0.873
MkP 0.875 0.883 0.886 0.874 0.868 0.884 0.863 0.859 0.87
MoP 0.86 0.884 0.815 0.687 0.788 0.803 0.515 0.625 0.73
NeuP 0.89 0.909 0.896 0.888 0.907 0.894 0.88 0.902 0.888
in-depth research, in guiding the study of various types of tumor stem

cells. Integrating data from different sources is a key challenge in
computational genomics. Several single-cell technologies have emerged,
including scRNA-seq and CITE-seq. Single-cell sequencing technology is
a cutting-edge technology for cell state discovery and characterization,
as it can simultaneously describe multiple data types from the same
single cell. Cellular Indexing of Transcriptomes and Epitopes by
Sequencing (CITE-seq) is a multimodal single-cell phenotyping method
developed by Stoeckius et al. [2]. Only a portion can be used as syn
chronous measurements, due to factors such as reduced gene size,
reduced throughput, and increased noise. CITE-seq could be used to
integrate cellular protein and transcriptome measurements into an
efficient, single-cell readout.
In this project, the CITE-seq sequencing method was used to quantify
RNA and cell surface proteins from 70988 cells. These cells belong to
seven different types and are obtained at 3 different time points. We first
screened out 133 significantly differentially expressed genes (DEGs)
Fig. 9. The DNN performance in the third-party dataset. The dataset of among three-time points via Student’s t-test statistics. By function
GSE236284 was applied to test the performance of our DNN model. enrichment analysis of these 133 DEGs, we found they were strongly
associated with blood-related diseases, such as acute myeloid leukemia.
the peripheral blood of Acute myeloid leukemia (AML) patients [27]. In Similar to various other stem cell types, leukemic stem cells depend on
human hematopoietic stem cells, LMNA was involved in regulating the the nurturing bone marrow microenvironment for their sustenance and
senescence process and differentiation trajectory of stem cells. survival. Notably, bone marrow stem cells have been observed to fortify
Down-expression of LMNA could accelerate the senescence process of antioxidant defense mechanisms, which might play a pivotal role in
stem cells and induce alterations in their differentiation capacity [28]. bolstering the bioenergetics of acute myeloid leukemia [29].
Taken together, these results suggest that cell type is related to the Machine learning and Deep learning networks could detect the
chromosome location of cell-type-specific genes. presence of features, while the geometric relationships and spatial hi
erarchical information among features are often overlooked. To capture
4. Discussions and embed the crucial spatial information within features, a new hier
archical structure called the Caps-Score layer can be introduced into
Hematopoietic stem cells (HSCs) are cells in the blood system with neural networks [30]. Deep-learning models have rarely been applied to
the ability to self-renewal for a long time and the potential to differen study the cell differentiation of blood stem cells. Thus, we used machine
tiate into various types of mature blood cells. It is a type of stem cell with learning and deep neural network (DNN) approach to build the
8
Table 5 genome and encodes the major red cell membrane glycophorins.
- Cell-type-specific genes for each cell type. Only three genes were shown in each TRIM24 assumes a critical role in establishing a connection between
cell type. insulin signaling and processing bodies (P-bodies). Moreover, TRIM24
No. Cell Gene Protein Name Locus Score has been implicated in conferring resistance to antiandrogen therapy
type name and influencing the overall prognosis of cancer cases [33]. We suggest
1 MasP LMNA lamin isoform A Chromosome 0.0123 these cell-type specific genes could serve as novel clues for predicting
1 related diseases.
2 MasP CST7 cystatin-F precursor Chromosome 0.006 In the future, new methods, such as Generative Adversarial Network
20
(GAN) and Convolutional Neural Network (CNN) algorithms, will be
3 MasP BUD23 probable 18S rRNA Chromosome 0.0053
7 used in multimodal single-cell data. Although GANs are useful in map
4 MkP GCSAML germinal center- Chromosome 0.0122 ping distributions, they have some key drawbacks. Due to adversarial
associated signaling and 1 losses, they are difficult to train, which may lead to instability in the
motility-like protein prediction model [34]. This instability means that the model can quickly
isoform a
5 MkP LMNA lamin isoform A Chromosome 0.011
deteriorate from effective to ineffective during training iterations. CNN
1 methods have been widely applied for unstructured multimodal data.
6 MkP ITGA2B integrin alpha-IIb Chromosome 0.011 However, CNNs showed some limitations in the tackle with the spatial
isoform X1 17 hierarchical relationships between complex features. To preserve the
7 NeuP CST7 cystatin-F precursor Chromosome 0.007
spatial hierarchical relationships between features in each modality, the
20
8 NeuP ITPA inosine triphosphate Chromosome 0.006 structures of the modified Guided Co-occurrence Block (mGCoB) were
pyrophosphatase isoform 20 established to promote the performance [35]. We will also study the
X1 molecular mechanism of cell-type-specific genes in bone marrow stem
9 NeuP MPO DNA-directed primase/ Chromosome 0.006 cell to explain their differences in differentiation process in the future.
polymerase protein 4
To achieve this goal, an expansion of CITE-seq multimodal data is
isoform X4
10 MoP GATA2 endothelial transcription Chromosome 0.026 essential to validate the efficacy of our approach. However, many cur
factor GATA-2 isoform 2 3 rent datasets were not suitable for deep learning methods because of
11 MoP PRSS57 serine protease 57 Chromosome 0.019 lack of a time-series information.
isoform 2 precursor 19
12 MoP CST7 cystatin-F precursor Chromosome 0.018
20 5. Conclusions
13 EryP LGALS1 galectin-1 Chromosome 0.005
22 Stem cell differentiation is crucial in body development. Identifying
14 EryP CENPV centromere protein V Chromosome 0.004 cell-type-specific genes in stem cell differentiation is an urgent need. In
17
this study, we obtained the CITE-seq datasets of bone marrow stem cells
15 EryP TRIM24 transcription Chromosome 0.004
intermediary factor 1- 7 and analyzed the relation between RNAs (gene-transcribed RNA) and
alpha isoform a surface proteins. We first screened the transcribed RNAs by Student’s t-
16 BP BUD23 probable 18S rRNA Chromosome 0.027 test statistics and obtained 133 DEGs. The relation between transcribed
7
RNA and surface proteins was then constructed by three different
17 BP PRDX4 peroxiredoxin-4 Chromosome 0.020
precursor X
models (DNN, LightGBM, and TabNet). DNN was found to show the best
18 BP CENPV centromere protein V Chromosome 0.014 results for training dataset of CITE-seq. The average Pearson correlation
17 coefficients of three algorithms (DNN, LightGBM, and TabNet) were
found to be 0.867, 0.841, and 0.775, respectively. Finally, DNN was
implemented to identify cell-specific genes for each cell type based on
quantitative model of sequenced RNA and proteins. Moreover, a Keras
their coefficients. Based on their coefficients in the DNN model, three
Tuner was used to enable continuous optimization of the parameters of
genes (LGALS1, CENPV, TRIM24) were identified as cell-type-specific
the DNN algorithm. The highest-performing parameters allowed more
genes in erythrocyte progenitor. Our findings can help researchers un
accurate selection of cell-specific genes. To validate the predictive
derstand how gene regulation affects differentiation in maturation time
effectiveness of these three models, we applied these models to another
among blood and immune cells.
independent testing dataset. For testing dataset, the average Pearson
correlation coefficient of three algorithms (DNN, LightGBM, and Tab
Data availability statement
Net) were found to be 0.867, 0.841, and 0.775, respectively. DNN was
found to show the best results for training dataset of CITE-seq. Overall,
Our source code can be found in http://github.com/pyajagod/Dnn
these results indicated that these models showed good performance in
RpPre.git.
predicting protein values from transcribed RNA values.
The supplementary materials can be found in https://www.dropbox.
Based on the weighting coefficient of the quantitative model of
com/sh/j455306l2j1k29s/AABdxRIq2TYO0kwmJ9Xe4NPja.
sequenced RNA and proteins, the cell-type-specific genes were screened.
These genes could be used as effective biomarkers for prediction of cell
Declaration of competing interest
differentiation in bone marrow stem cells. For example, three genes
(LGALS1, CENPV, TRIM24) were identified as cell-type-specific genes in
We declare that we have no financial and personal relationships with
erythrocyte progenitor. Erythrocyte progenitor cells are formed from a
other people or organizations that can inappropriately influence our
common myeloid progenitor cell, which may mature into a red blood
work.
cell, platelet and some types of white blood cells. LGALS1 has garnered
attention as a promising therapeutic target against chemotherapy
Acknowledgements
resistance in related ailments, and the use of LGALS1 inhibitors has
shown potential to augment the efficacy of chemotherapy in primary
This research was funded by National Natural Science Foundation of
cell lines derived from patients [31]. CENPV is a kind of centromere
China (grant number 61903107).
protein-coding gene, which plays crucial roles in cell proliferation and
cell fission [32]. CENPV belongs to a small gene family localized on
9
Abbreviation [16] Waleed S. Albihlal, André P. Gerber, Unconventional RNA-binding proteins: an

uncharted zone in RNA biology, FEBS Lett. 592 (17) (2018) 2917–2931.
[17] Natalia Sanchez De Groot, Alexandros Armaos, Ricardo Graña-Montes, et al., RNA
BP B-Cell Progenitor structure drives interaction with proteins, Nat. Commun. 10 (1) (2019) 3246.
EryP Erythrocyte Progenitor [18] Dalwinder Singh, Birmohan Singh, Investigating the impact of data normalization
HSC Hematoploetic Stem Cell on classification performance, Appl. Soft Comput. 97 (2020), 105524.
[19] Weizhang Liang, Suizhi Luo, Guoyan Zhao, et al., Predicting hard rock pillar
MasP Mast Cell Progenitor stability using GBDT, XGBoost, and LightGBM algorithms, Mathematics 8 (5)
MkP Megakaryocyte Progenitor (2020) 765.
MoP Monocyte Progenitor [20] Dehua Wang, Yang Zhang, Yi Zhao, LightGBM: an effective miRNA classification
method in breast cancer patients, in: ICCBB 2017: Proceedings of the 2017
NeuP Neutrophil Progenitor International Conference on Computational Biology and Bioinformatics, 2017.
DNN Deep Neural Network [21] Sercan Arik, Tomas Pfister, Tabnet: attentive interpretable tabular learning, in:
DEG Differentially Expressed Gene Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
[22] Elham Khalili, Shahin Ramazi, Faezeh Ghanati, et al., Predicting protein
phosphorylation sites in soybean using interpretable deep tabular learning
References network, Briefings Bioinf. 23 (2) (2022) bbac015.
[23] Xi Han, Liheng Zhang, Kang Zhou, et al., ProGAN: protein solubility generative
[1] Parvin Razzaghi, Karim Abbasi, Mahmoud Shirazi, et al., Modality adaptation in adversarial nets for data augmentation in DNN framework, Comput. Chem. Eng.
multimodal data, Expert Syst. Appl. 179 (11) (2021), 115126. 131 (2019), 106533.
[2] M. Stoeckius, C. Hafemeister, W. Stephenson, et al., Simultaneous epitope and [24] Genta Aoki, Yasubumi Sakakibara, Convolutional neural networks for classification
transcriptome measurement in single cells, Nat. Methods 14 (9) (2017) 865–868. of alignments of non-coding RNA sequences, Bioinformatics 34 (13) (2018)
[3] Hani Jieun Kim, Yingxin Lin, Thomas A. Geddes, et al., CiteFuse enables multi- i237–i244.
modal analysis of CITE-seq data, Bioinformatics 36 (14) (2020) 4137–4143. [25] Ly Alexander, Maarten Marsman, Eric-Jan Wagenmakers, Analytic posteriors for
[4] Laleh Haghverdi, Aaron Tl Lun, Michael D. Morgan, et al., Batch effects in single- Pearson’s correlation coefficient, Stat. Neerl. 72 (1) (2018) 4–13.
cell RNA-sequencing data are corrected by matching mutual nearest neighbors, [26] Mary-Elizabeth Percival, Catherine Lai, Elihu Estey, et al., Bone marrow evaluation
Nat. Biotechnol. 36 (5) (2018) 421–427. for diagnosis and monitoring of acute myeloid leukemia, Blood Rev. 31 (4) (2017)
[5] Ricard Argelaguet, Anna Se Cuomo, Stegle Oliver, et al., Computational principles 185–192.
and challenges in single-cell data integration, Nat. Biotechnol. 39 (10) (2021) [27] Hiroshi Yuita, F. Isaac, López-Moyado, Hyeongmin Jeong, et al., Inducible
1202–1215. disruption of Tet genes results in myeloid malignancy, readthrough transcription,
[6] Yiwen Wang, Kim-Anh Lêcao, Managing batch effects in microbiome data, and a heterochromatin-to-euchromatin switch, Proc. Natl. Acad. Sci. USA 120 (6)
Briefings Bioinf. 21 (6) (2020) 1954–1970. (2023), e2214824120.
[7] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, et al., Multimodal deep learning, in: [28] Hsuan-Ting Huang, Dean Wade, Daniel Bilbao, et al., Age-acquired downregulation
Proceedings of the 28th International Conference on International Conference on of lmna leads to epigenetic deregulation and altered HSPC function, Blood 138
Machine Learning, Omnipress: Bellevue, Washington, USA, 2011, pp. 689–696. (2021) 3280.
[8] Seonwoo Min, Byunghan Lee, Sungroh Yoon, Deep learning in bioinformatics, [29] D. Forte, M. Garcia-Fernandez, A. Sanchez-Aguilera, et al., Bone marrow
Briefings Bioinf. 18 (5) (2017) 851–869. mesenchymal stem cells support acute myeloid leukemia bioenergetics and
[9] Hui Cheng, Zhaofeng Zheng, Tao Cheng, New paradigms on hematopoietic stem enhance antioxidant defense and escape from chemotherapy, Cell Metabol. 32 (5)
cell differentiation, Protein & cell 11 (1) (2020) 34–44. (2020) 829–843.
[10] Roberto Tamma, Domenico Ribatti, Bone niches, hematopoietic stem cells, and [30] Karim Abbasi, Parvin Razzaghi, Incorporating part-whole Hierarchies into fully
vessel formation, Int. J. Mol. Sci. 18 (1) (2017) 151. convolutional network for scene parsing, Expert Syst. Appl. 160 (2020), 113662.
[11] Malte Luecken Daniel Burkhardt, Andrew Benz, Peter Holderrieth, [31] K.N. Li, Y.X. Du, Y. Cai, et al., Single-cell analysis reveals the chemotherapy-
Jonathan Bloom, Christopher Lance, Open Problems - Multimodal Single-Cell induced cellular reprogramming and novel therapeutic targets in relapsed/
Integration, 2022. Available from: https://kaggle.com/competitions/open-pro refractory acute myeloid leukemia, Leukemia 37 (2) (2022) 308–325.
blems-multimodal. [32] Y. Zeng, C. Liu, Y.D. Gong, et al., Single-cell RNA sequencing resolves
[12] Md Rabiul Auwul, Md Rezanur Rahman, Esra Gov, et al., Bioinformatics and spatiotemporal development of pre-thymic lymphoid progenitors and thymus
machine learning approach identifies potential drug targets and pathways in organogenesis in human embryos, Immunity 51 (5) (2019) 930–948.
COVID-19, Briefings Bioinf. 22 (5) (2021) bbab120. [33] S. Hatakeyama, TRIM family proteins: roles in autophagy, immunity, and
[13] Z. Yang, H. Wang, Z. Zhao, et al., Gene-microRNA network analysis identified carcinogenesis, Trends Biochem. Sci. 42 (4) (2017) 297–311.
seven Hub genes in association with progression and prognosis in non-small cell [34] M. Amodio, S.E. Youlten, A. Venkat, et al., Single-cell multi-modal GAN reveals
lung cancer, Genes 13 (8) (2022) 1480. spatial patterns in single-cell data from triple-negative breast cancer, Patterns (N Y)
[14] Y. Zhou, B. Zhou, L. Pache, et al., Metascape provides a biologist-oriented resource 3 (9) (2022), 100577.
for the analysis of systems-level datasets, Nat. Commun. 10 (1) (2019) 1523. [35] Parvin Razzaghi, Karim Abbasi, Pegah Bayat, Learning spatial Hierarchies of high-
[15] Janet Piñero, lex Bravo, Núria Queralt-Rosinach, et al., DisGeNET: a level features in deep neural network, J. Vis. Commun. Image Represent. 70
comprehensive platform integrating information on human disease-associated (2020), 102817.
genes and variants, Nucleic Acids Res. 45 (D1) (2017) D833–D839.
10

1 s2.0 S0010482523009630 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0010482523009630 Main

Uploaded by

Copyright:

Available Formats

Computers in Biology and Medicine 166 (2023) 107498

Contents lists available at ScienceDirect

Computers in Biology and Medicine

Identification of cell-type-specific genes in multimodal single-cell data

1. Introduction appropriate algorithms to tackle batch effect, methods for

Fig. 1. Flowchart of the experiments reported in this work.

Progenitor (NeuP), Monocyte Progenitor (MoP), Erythrocyte Progenitor Table 1

TabNet Decision dimension 16～512

Table 2 The genetic information analysis is central to precision medicine and

3.7. Normalization analysis of surface proteins

Cell-specific genes can be accurately identified via studying the

3.8. Build RNA-protein relation model

To evaluate the correlation between the value of significant DEGs

Fig. 4. Percentage of all cells at each time point.

algorithm. Cell-type-specific genes for each group were screened based

Fig. 7. Enrichment analysis of differentially expressed genes.

Fig. 8. The distributions of the ten most highly expressed proteins.

BP 0.782 0.795 0.878 0.768 0.73 0.807 0.621 0.607 0.428

in-depth research, in guiding the study of various types of tumor stem

Abbreviation [16] Waleed S. Albihlal, André P. Gerber, Unconventional RNA-binding proteins: an

You might also like