26Li-2018-Predicting and Analyzing Early Wake-Up

BBA - Molecular Basis of Disease 1864 (2018) 2241–2246
Contents lists available at ScienceDirect
BBA - Molecular Basis of Disease

journal homepage: www.elsevier.com/locate/bbadis
Predicting and analyzing early wake-up associated gene expressions by T

integrating GWAS and eQTL studies☆
JiaRui Lia, Tao Huangb,⁎
a
College of Life Science, Shanghai University, Shanghai 200444, People's Republic of China
b
Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China
A R T I C L E I N F O A B S T R A C T
Keywords: Circadian rhythms are endogenous 24-hour rhythmic oscillations affecting human behaviors, such as sleep,
Circadian rhythm blood pressure and other biological processes, the disturbance of which lead to circadian rhythm sleep disorders
Early wake-up (CRSDs). In this study, based on the data from genome-wide association studies (GWASs) and expression
Dagging quantitative trait loci (eQTLs), we tried to identify novel gene expression patterns in brain tissues that were
Maximum-relevance-minimum-redundancy
associated with early wake-up. First, the maximum-relevance-minimum-redundancy (mRMR) method was
adopted to analyze the involved gene expression patterns, yielding a feature list. Second, the incremental feature
selection (IFS) method and the Dagging algorithm were applied to extract important gene expression patterns,
which yield the best performance for Dagging. As a result, 4374 gene expression patterns were obtained, and
they were further used to build an optimal classifier with a good performance of a Matthews's correlation
coefficient of 0.933. Furthermore, the most important 49 gene expression patterns were extensively analyzed.
Four genes were found to be related to circadian rhythm, as reported in previous studies. As a first attempt in
identifying the target genes whose expression levels are associated with sleep-wake rhythms through integrating
GWAS and eQTL results, this study can motivate more investigations in this regard.
This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic
Big Data Analysis edited by Yudong Cai & Tao Huang.
1. Introduction As one of the intrinsic CRSDs, ASPD leads to the early wake up
phenotype and patients with ASPD have chronic or recurrent difficulty
Circadian rhythms, controlled by endogenous circadian clocks, are staying awake until the desired or socially acceptable bedtime, together
rhythmic oscillations in our behavior and physiological processes with a with an earlier than desired wake-up time [5]. The estimated pre-
period close to 24 h, and they exist in diverse organisms on the Earth, valence of ASPD is 1% in the general population, which is likely an
ranging from bacteria and fungi to plants and animals [1]. Circadian underestimate since many individuals successfully adapt their social
rhythms are generated by the suprachiasmatic nucleus (SCN), located in and work schedules to the advanced sleep phase [7]. ASPD is believed
the anterior hypothalamus [2] and are synchronized with the earth's to result from the dysfunction of the circadian clock or its afferent and
rotation by daily adjustments in the timing of the SCN, following ex- efferent pathways [5]. A previous study demonstrated that increased
posure to stimuli that signal the time of day. The SCN generates retinal sensitivity to light was one of the reasons leading to ASPD [8].
rhythmic cues that entrain the circadian clocks of peripheral organs and Based on this knowledge, early evening light therapy is the most
cells in the body by orchestrating hormonal, body temperature, neural, commonly used treatment for ASPD, which is effective for some pa-
feeding, metabolic, and locomotor activity rhythms [3]. The sleep-wake tients but not uniformly positive [9]. This triggers more efforts on the
rhythm is one of the most important and observable circadian rhythms identification of the genetic and molecular basis of ASPD to shed light
[4], and a disturbance of the sleep-wake cycle causes circadian rhythm on a novel diagnosis and therapies. Promisingly, studies in the past few
sleep disorders (CRSDs) [5], which includes advanced sleep phase decades have established that the circadian rhythm is determined by a
disorder (ASPD), delayed sleep phase disorder (DSPD), free-running core set of clock genes [10], and two causative gene mutations of ASPD
disorder (FRD), and irregular sleep-wake rhythm (ISWD), and only were identified through target gene studies [11].
ASPD has the phenotype of early wake up [6]. However, an abundant number of transcripts were reported to
☆
This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic Big Data Analysis edited by Yudong Cai & Tao Huang.
⁎
Corresponding author at: Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, People's Republic of China.
E-mail address: huangtao@sibs.ac.cn (T. Huang).
http://dx.doi.org/10.1016/j.bbadis.2017.10.036
Received 4 September 2017; Received in revised form 19 October 2017; Accepted 30 October 2017
Available online 03 November 2017
0925-4439/ © 2017 Elsevier B.V. All rights reserved.
J. Li, T. Huang BBA - Molecular Basis of Disease 1864 (2018) 2241–2246
fluctuate in their expression level with the circadian rhythm in both the
hypothalamus and peripheral organs [12]. Meanwhile, different genes
are associated with different CRSDs [13], suggesting the individual
roles of the circadian genes in CRSDs and the complexity of the disease.
Because of technological advances, especially the emergence of mi-
croarray studies and next-generation sequencing, genome-wide asso-
ciation studies (GWASs) have become more affordable and have been
widely performed to study complex traits to identify many disease-as-
sociated loci and provide insights into the allelic architecture of com-
plex traits [14]. To dissect the genetic basis of the circadian rhythm,
such a study was recently applied in a large cohort of self-reported
morningness population, which identified an abundant number of ge-
netic polymorphisms associated with morningness [15], including
seven that are near well-known circadian genes. However, as with other
GWASs, this study identified many morningness-associated genetic
polymorphisms that were located in non-coding regions apart from the
genes with established functions and were unable to be correlated with
biological processes. Thus, it is still unclear how these polymorphic loci
contribute to sleep-wake timing variations. A systematic identification Fig. 1. The IFS-curve for the prediction performances of Dagging on different feature sets
of expression quantitative trait loci (eQTLs) combined with GWAS was with X-values representing number of features used and Y-values indicating the MCC
demonstrated as one of the approaches for unveiling biological me- value. The maximum MCC value is marked with red dot on the curve.
chanisms, through which the causal genetic factors determine the

phenotype [16,17]. Brain Frontal Cortex BA9; (7) Brain Hippocampus; (8) Brain
In this study, we identified characteristic tissue-gene expression Hypothalamus; (9) Brain Nucleus accumbens basal ganglia; and (10)
patterns through the combination of morningness-associated genetic Brain Putamen basal ganglia. A total of 22,832 eQTL features were used
polymorphisms from previous GWAS [15], the recent published global in this study, i.e., each SNP in Sd was represented by 22,832 eQTL
gene expression profiles and eQTLs in the Genotype-Tissue Expression features.
(GTEx) project [18]. Some computational methods, including the
maximum-relevance-minimum-redundancy (mRMR) [19] method, the 2.3. mRMR and IFS methods
incremental feature selection (IFS) method and the Dagging algorithm
[20], were employed to analyze the tissue-gene expression patterns and To analyze the 22,832 eQTL features mentioned in Section 2.2, a
extract the important ones. According to the computational results, popular feature selection method, the mRMR method [19], which has
4374 patterns were obtained and deemed to be related to morningness. been widely applied to analyze various complicated biological pro-
These patterns were also used to construct a classifier with a perfor- blems, was adopted in this study [22–33]. Through this method, all
mance of the Matthews's correlation coefficient of 0.933. Furthermore, features can be ranked in two lists, named the MaxRel feature list and
the most important 49 patterns among the 4374 patterns were selected the mRMR feature list. Two criteria were employed in this method to
for a detailed analysis to identify their relevance to early wake-up. Four produce the lists and these were Max-Relevance and Min-Redundancy.
genes were found to be related to circadian rhythm in previous studies. The former one indicates that a feature with maximum relevance to
This study provides insight into the biological basis for morningness- target has priority to be selected, while the latter one suggests that a
associated polymorphic loci. It is hopeful that the new findings will feature with minimum redundancy to already selected features will be
motivate further functional studies on this biological circumstance in selected preferentially. 22,832 features were firstly ranked based on the
humans. Max-Relevance criterion, which yielded the MaxRel feature list. Then,
to reduce computational complexity, 5658 features with the scores
2. Materials and methods larger than zero were further investigated using both criteria. The
evaluation of the relevance and redundancy was all based on the mu-
2.1. Dataset tual information (MI) between two variables x and y, which can be
computed by
To investigate the circadian chronotype, we downloaded the 10,000
SNPs associated with morningness that were reported in Hu et al.'s
study [15]. Among these 10,000 SNPs, 2159 were reported as eQTLs,
I (x , y ) = ∬ p (x, y) log pp(x(x) p, y()y) dxdy (1)
regulating gene expressions in at least one GTEx [21] brain tissue. Thus,
where p(x) and p(y) are the marginal probabilistic density of x and y,
they were used as positive SNPs. Furthermore, we randomly selected
respectively, and p(x, y) is their joint probabilistic density.
2159 SNPs from the other 490,242 SNPs, which regulated the gene
To produce the mRMR feature list, a loop procedure was executed in
expressions in at least one GTEx [21] brain tissue, as negative SNPs. The
the mRMR method. Here, we let Ω be the set consisting of all features,
obtained positive and negative SNPs comprised one dataset, denoted as
Ωs be the set containing features that have been selected (initially, it is
Sd in this study.
an empty set) and Ωt be the set containing the rest features. In each
round, a feature was selected from Ωt and moved to Ωs. For each feature
2.2. Feature construction
f in Ωt, the D value, as I(f, c), was calculated, where c is the target
1
variable. On the other hand, the R value, defined as |Ω | ∑ I (f , f ′ ) (it
For each SNPs in Sd, the eQTL features, representing the gene ex- s
f ′∈ Ωs
pression patterns regulated by the corresponding eQTLs, were con- is set to zero if Ωs is an empty set) was further computed. As mentioned
structed by encoding the SNPs with −log10(eQTL p value) using all above, the Max-Relevance and Min-Redundancy were both considered
eQTL results, downloaded from GTEx V6p (http://gtexportal.org/ in the mRMR method. To indicate these, the D-R value for each feature
home/datasets), in the following ten brain tissues: (1) Brain Anterior in Ωt was computed. The feature with a maximum D-R value was se-
cingulate cortex BA24; (2) Brain Caudate basal ganglia; (3) Brain lected and removed from Ωt to Ωs. When all features were in Ωs, the
Cerebellar Hemisphere; (4) Brain Cerebellum; (5) Brain Cortex; (6) loop stopped. According to the selection order of each feature, the
2242
Table 1
The selected top 49 features in mRMR feature list.
Rank Tissue Ensembl gene ID Gene name Score
1 Caudate_basal_ganglia ENSG00000115524 SF3B1 0.09149

2 Frontal_Cortex_BA9 ENSG00000132911 NMUR2 0.06147
3 Cerebellar_Hemisphere ENSG00000116786 PLEKHM2 0.04576
4 Cerebellar_Hemisphere ENSG00000255098 RP11-481A20.11 0.04614
5 Cerebellum ENSG00000203817 FAM72C 0.02229
6 Hypothalamus ENSG00000226747 AC007966.1 0.01919
7 Cerebellar_Hemisphere ENSG00000145888 GLRA1 0.02411
8 Cerebellum ENSG00000247828 TMEM161B-AS1 0.01839
9 Cerebellum ENSG00000116786 PLEKHM2 0.01345
10 Cerebellum ENSG00000173295 FAM86B3P 0.01305
11 Cerebellum ENSG00000119711 ALDH6A1 0.01311
12 Nucleus_accumbens_basal_ganglia ENSG00000184905 TCEAL2 0.01295
13 Cerebellum ENSG00000227888 FAM66A 0.01330
14 Cerebellar_Hemisphere ENSG00000255020 AF131216.5 0.01211
15 Caudate_basal_ganglia ENSG00000242353 RPL12P30 0.01156
16 Cerebellar_Hemisphere ENSG00000132436 FIGNL1 0.01084
17 Cerebellum ENSG00000226747 AC007966.1 0.01063
18 Cerebellar_Hemisphere ENSG00000254507 RP11-481A20.10 0.01132
19 Cortex ENSG00000247828 TMEM161B-AS1 0.01127
20 Cortex ENSG00000132911 NMUR2 0.00980
21 Cerebellar_Hemisphere ENSG00000255310 AF131215.2 0.00963
22 Hippocampus ENSG00000255020 AF131216.5 0.00945
23 Cerebellar_Hemisphere ENSG00000173295 FAM86B3P 0.00937
24 Cerebellar_Hemisphere ENSG00000227888 FAM66A 0.00877
25 Cortex ENSG00000184905 TCEAL2 0.00826
26 Hypothalamus ENSG00000247828 TMEM161B-AS1 0.00800
27 Cerebellar_Hemisphere ENSG00000226747 AC007966.1 0.00846
28 Putamen_basal_ganglia ENSG00000132436 FIGNL1 0.00755
29 Nucleus_accumbens_basal_ganglia ENSG00000254532 RP11-624D11.2 0.00735
30 Frontal_Cortex_BA9 ENSG00000255556 RP11-351I21.6 0.00746
31 Frontal_Cortex_BA9 ENSG00000173295 FAM86B3P 0.00660
32 Caudate_basal_ganglia ENSG00000255556 RP11-351I21.6 0.00654
33 Cortex ENSG00000255020 AF131216.5 0.00632
34 Anterior_cingulate_cortex_BA24 ENSG00000247828 TMEM161B-AS1 0.00662
35 Putamen_basal_ganglia ENSG00000226747 AC007966.1 0.00688
36 Nucleus_accumbens_basal_ganglia ENSG00000076003 MCM6 0.00582
37 Frontal_Cortex_BA9 ENSG00000184905 TCEAL2 0.00576
38 Cerebellar_Hemisphere ENSG00000254423 RP11-351I21.7 0.00582
39 Cerebellum ENSG00000132436 FIGNL1 0.00562
40 Caudate_basal_ganglia ENSG00000179344 HLA-DQB1 0.00516
41 Caudate_basal_ganglia ENSG00000254532 RP11-624D11.2 0.00520
42 Cerebellar_Hemisphere ENSG00000247828 TMEM161B-AS1 0.00526
43 Frontal_Cortex_BA9 ENSG00000255020 AF131216.5 0.00535
44 Cortex ENSG00000226747 AC007966.1 0.00552
45 Cerebellum ENSG00000255556 RP11-351I21.6 0.00523
46 Cerebellum ENSG00000114735 HEMK1 0.00494
47 Cerebellum ENSG00000253893 FAM85B 0.00462
48 Cerebellum ENSG00000255020 AF131216.5 0.00428
49 Cerebellum ENSG00000149485 FADS1 0.00427
mRMR feature list can be built in a way that the first selected feature the best performance was identified. This feature set was termed the
takes the first place in the list, followed by the second selected feature, optimal feature set, and the features in this set were called the optimal
the third selected feature and so forth. For formulation, the obtained features, which capture the essential differences between the positive
mRMR feature list is denoted as and negative SNPs. In addition, an optimal classifier was built, which
used the optimal features to represent the SNPs and the classification
F = [f1 , f2 ,…, fN ] (2) algorithm as the prediction engine.
As mentioned above, the mRMR method only yields the mRMR

feature list. Identifying which features should be selected is still a 2.4. Classification algorithm
problem. Thus, the IFS method was employed in this study. It is easy to
know that features with high ranks in the mRMR feature list are more In the IFS method, a classification algorithm is always employed to
important than those with low ranks. The combination of some top evaluate the constructed feature sets. In this study, we adopted Dagging
features provides a key contribution for the classification of the positive [20], which is one type of meta algorithms. For a given training dataset
and negative SNPs. In view of this, we constructed a series of feature with N samples, the Dagging algorithm always constructs M (M is a free
sets from the mRMR feature list F, denoted as FS1, FS2, …, FSN, where parameter) datasets from the original training dataset. Each dataset
FSi = {f1, f2, …, fi}, i.e., and it contained the top i features in F. For each contains n (nM < N) samples, and no two datasets have common
constructed feature set, all SNPs in Sd were represented by features in samples. For each dataset, one basic classifier is trained on it and the
this set, and a classification algorithm was executed on these SNPs with classification model can be built. Thus, M datasets can induce M clas-
its performance evaluated by one of the cross-validation methods sification models. Given a query sample, each classification model
[34,35]. After all of the feature sets were tested, the feature set yielding yields its predicted class, and the class receiving the most votes is
2243
Nucleus_accumbens_basal_ganglia
method, like the jackknife test [35], this method takes less time and
Anterior_cingulate_cortex_BA24
always yields similar results.
In the dataset Sd, only two types of SNPs were involved. Thus, it is a
binary classification problem. The predicted results of this type of
Putamen_basal_ganglia
Cerebellar_Hemisphere
Caudate_basal_ganglia
problem can always be counted as four measurements, including sen-
Frontal_Cortex_BA9
sitivity (SN), specificity (SP), accuracy (ACC) [37–40], and the Mat-
thews's correlation coefficient (MCC) [41], which are computed by the
Hypothalamus
Hippocampus
following equations:
Cerebellum
TP
Cortex
SN =
ENSEMBL Gene ID Gene Name Score TP + FN (3)
ENSG00000115524 SF3B1 0.091
ENSG00000132911 NMUR2 0.061
TN
SP =
ENSG00000255098 RP11-481A20.11 0.046 TN + FP (4)
ENSG00000116786 PLEKHM2 0.046
ENSG00000145888 GLRA1 0.024 TP + TN
ENSG00000203817 FAM72C 0.022
ACC =
TP + TN + FP + FN (5)
ENSG00000226747 AC007966.1 0.019
ENSG00000247828 TMEM161B-AS1 0.018 TP × TN − FP × FN
ENSG00000227888 FAM66A 0.013 MCC =
ENSG00000119711 ALDH6A1 0.013 (TP + FP )(TP + FN )(TN + FP )(TN + FN ) (6)
ENSG00000173295 FAM86B3P 0.013
ENSG00000184905 TCEAL2 0.013 where TP, TN, FP, and FN refer to the number of positive samples that
ENSG00000255020 AF131216.5 0.012 are predicted correctly, the number of negative samples that are pre-
ENSG00000242353 RPL12P30 0.012 dicted correctly, the number of negative samples that are predicted
ENSG00000254507 RP11-481A20.10 0.011
ENSG00000132436 FIGNL1 0.011 incorrectly, and the number of positive samples that are predicted in-
ENSG00000255310 AF131215.2 0.010 correctly, respectively.
ENSG00000255556 RP11-351I21.6 0.007 Among the above four measurements, MCC is deemed as a balanced
ENSG00000254532 RP11-624D11.2 0.007
ENSG00000076003 MCM6 0.006
measurement that can give fair evaluating results even if the sizes of the
ENSG00000254423 RP11-351I21.7 0.006 classes are of great differences. MCC was first proposed by Matthews in
ENSG00000179344 HLA-DQB1 0.005 1975 [41]. Its value ranges from − 1 to 1. In detail, 1 means a perfect
ENSG00000114735 HEMK1 0.005
classification, 0 indicates that the predicted results are no better than
ENSG00000253893 FAM85B 0.005
ENSG00000149485 FADS1 0.004 random predictions, and − 1 represents a total misclassification. In this
study, it was used as the key measurement, i.e., the performance of the
Fig. 2. The top 49 features, including 25 genes expressing in ten brain tissues, identified Dagging on the different feature sets is mainly measured by MCC. Other
to be related to early wake-up. Gene in red was reported to be associated with narcolepsy.
three measurements were provided as references.
Genes in cyan have clues suggest the correlation with circadian rhythm. The 25 genes
were sorted based on the highest mRMR score in the brain tissues.
3. Results
deemed as the predicted class of the Dagging.

As described in Section 2.2, 22,832 eQTL features were constructed
Weka [36] is a software suite that contains a collection of several
based on the published eQTL results in ten brain tissues and ranked
popular machine learning algorithms and data processing tools. There
based on their relevance in the MaxRel feature list. 5658 features have
is a classifier named “Dagging” that implements the Dagging algorithm
the scores higher than zero, suggesting their relevance to morningness.
mentioned above. For convenience, it was directly adopted in this study
These features were further analyzed through the mRMR method,
and executed using its default parameters.
yielding the mRMR feature list that is provided in Supplementary ma-
terial S1.
2.5. Accuracy measurements To determine which features can be optimally combined for pre-
dicting the positive and negative SNPs, the IFS method was employed as
As mentioned in Section 2.3, each constructed feature set was tested mentioned in Section 2.3. From the mRMR feature list provided in
by a classification algorithm that was evaluated by a cross-validation Supplementary material S1, 5658 feature sets, say FS1, FS2, …, FS5658,
method. In this study, we selected the ten-fold cross-validation [34]. In were constructed. Each feature set was tested by executing the Dagging
this method, the original dataset is always equally and randomly di- on the dataset Sd, in which each SNP was represented by features in the
vided into ten parts. Samples in each part are singled out in turn as set, with its performance evaluated by a ten-fold cross-validation. The
testing samples, which are tested by the model trained on the samples predicted results were counted as the measurements calculated by Eqs.
in the remaining nine parts. Compared to another cross-validation (3)–(6). After all of the feature sets were tested, a series of SNs, SPs,
Fig. 3. Tissue distribution of the 4373 optimal features.

Brain_Putamen_basal_ganglia
Brain_Nucleus_accumbens_basal_ganglia
Brain_Hypothalamus
Brain_Hippocampus
Brain_Frontal_Cortex_BA9
Brain_Cortex
Brain_Cerebellum
Brain_Cerebellar_Hemisphere
Brain_Caudate_basal_ganglia
Brain_Anterior_cingulate_cortex_BA24
0 200 400 600 800 1000 1200
2244
ACCs and MCCs were obtained, which are available in Supplementary LCPUFAs may cause sleep problems [50], the expression of this gene is
material S2. As described in Section 2.5, the MCC was selected as the likely to affect the sleep-wake cycle. The second gene NMUR2 shows a
major measurement. Thus, we tried to find the feature set yielding the circadian expression in rat [51] and encodes the receptor neuromedin S
maximum MCC. For easy observation, a curve, namely an IFS curve, (NMS) [52], which is also reported to be expressed in the suprachias-
was plotted and is shown in Fig. 1, in which the MCC was set to the Y- matic nucleus (SCN, which is believed to control the circadian rhythm
axis, and the number of features used was set to the X-axis. It was ob- [13]) and might play a role in the circadian rhythm [53]. The last gene
served that the IFS curve first follows a sharp increasing trend and then GLRA1 is reported to have a mutation leading to hyperekplexia, which
becomes stable. The maximum MCC was 0.933 when the number of has the symptom of periodic limb movements in sleep. These findings
features was 4374, meaning the top 4374 features in the mRMR feature indicated the potential influence of all three genes in the sleep-wake
list yield the best performance for Dagging, which indicates that these cycle.
features have a strong association with sleep-wake time variations. The In summary, these four genes are functionally related to the circa-
obtained 4374 features were called optimal features and comprised the dian rhythm and sleeping regulation via direct or indirect evidence,
optimal feature set. In addition, an optimal classifier was built, which while there is lack of evidence for the other 21 to support their re-
used the optimal 4374 features to represent SNPs and Dagging as the lationship with the circadian rhythm. This result indicated the sig-
prediction engine. The SN, SP and ACC yielded by this classifier were nificance of our method in identifying the morningness-related gene
0.943, 0.990, and 0.966, respectively, suggesting it is a good classifier. expressions affected by previously reported SNPs. Interestingly, we
found most of these 25 genes showed significant expression changes in
4. Discussion the cerebellar hemisphere and cerebellum, but not the hypothalamus,
in which the SCN controls the circadian rhythm [13]. The distribution
In total, 4374 optimal eQTL features were extracted in Section 3. of the 4373 optimal features also show the same enrichment (Fig. 3).
These features are deemed to be highly related to sleep-wake time This phenomenon could be attributed to three reasons. First, the sig-
variations. However, it is impossible to analyze them one by one. As nificant expression changes could be cues generated by the SCN or the
mentioned above, the IFS curve follows a sharp increasing trend in the behaviors of the peripheral organs and cells. Second, circadian gene
beginning, which means that some of the top features are more im- expression might be observed not only in the SCN but also in peripheral
portant than others. By carefully checking the IFS curve, we found the tissues, such as the liver [54]. Third, the tissue samples were from post-
IFS curve first exceeds 0.850 at the X-axis 49 (SN, SP, ACC and MCC are mortem donors, which might not contain all of the expression changes
0.852, 0.994, 0.923, and 0.855, respectively) meaning the top 49 fea- related to sleep/wake cycle control.
tures in the mRMR feature list are more important, which are listed in
Table 1. Thus, in this study, we only analyzed them to identify their 5. Conclusions
functional roles in the sleep-wake cycle as follows. Four genes were
found to be related to circadian rhythm by some previous studies. In this study, we tried to extract morningness-associated gene ex-
The 49 tissue-gene expression patterns of the 49 features included pression patterns. Based on the results, four genes were functionally
25 genes expressed in ten tissues of human brain. We hypothesized that related to circadian rhythm and sleeping regulation by previous studies,
those 25 key genes were related to circadian rhythm through their while the other 21 genes could also be associated with early wake-up.
expression in brain tissues. To validate this hypothesis, we investigated However, there is lack of evidence for this currently. We believe that
these genes in biological functions, pathways and processes. this study, as a pioneer investigation on interpreting the mechanisms of
We first did the functional annotation of the 25 genes, including the morningness-associated SNPs in affecting the sleep-wake cycle by
GO terms, the KEGG pathway, and Interpro et al., though the online identifying the downstream gene expression patterns, will shed light on
database and tool DAVID [42]. The result showed none of these 25 further research involving circadian rhythm.
genes were related to circadian rhythm or sleep disorder. However, Supplementary data to this article can be found online at https://
since the information in the database is usually incomprehensive, we doi.org/10.1016/j.bbadis.2017.10.036.
did a further literature review of the 25 genes. Promisingly, we found
one gene that was reported to be directly associated with a sleeping Transparency document
disorder, narcolepsy, while another three had clues suggesting potential
roles in circadian rhythm (Fig. 2). The Transparency document associated with this article can be
Narcolepsy is a sleep disorder of the regulation of sleep and wake- found, in the online version.
fulness, resulting in a variety of symptoms, such as excessive daytime
sleepiness (EDS), cataplexy, hypnagogic hallucinations, sleep paralysis Acknowledgements
and disturbed nocturnal sleep [43]. Narcolepsy is tightly associated
with human leukocyte antigen (HLA) or in other words a specific HLA This study was supported by the National Natural Science
allele, i.e., HLA-DQB1*06:02 [44]. A functional HLA-DQ molecule ori- Foundation of China (31371335, 31701151), the Natural Science
ginates from the binding of an α chain (DQA1) with a β chain (DQB1). Foundation of Shanghai (17ZR1412500), the Shanghai Sailing
Worldwide approximately 85–95% of the narcolepsy with cataplexy Program, The Youth Innovation Promotion Association of Chinese
patients carry a specific haplotype DQB1*06:02-DQA1*01:02, com- Academy of Sciences (CAS) (2016245), Training and Assistance Plan of
pared to 12–38% of the general population [45]. Another haplotype Shanghai Young College Teacher.
DRB1*1501-DQB1*0602 is suggested as almost necessary but not suf-
ficient for developing narcolepsy [46,47]. Further studies suggest that References
the dosage of HLA also affects the development of narcolepsy [48]. In
our study, we predicted 25 key genes, including HLA-DQB1. This result [1] J.C. Dunlap, Molecular bases for circadian clocks, Cell 96 (1999) 271–290.
suggests an important role of the expression of this gene as well as HLA [2] U. Schibler, P. Sassone-Corsi, A web of circadian pacemakers, Cell 111 (2002)
919–922.
in sleeping control. [3] C. Dibner, U. Schibler, U. Albrecht, The mammalian circadian timing system: or-
There are another three genes without direct experimental evidence ganization and coordination of central and peripheral clocks, Annu. Rev. Physiol.
but with some clues indicating their potential roles in the circadian 72 (2010) 517–549.
[4] K.G. Baron, K.J. Reid, Circadian misalignment and health, Int. Rev. Psychiatry 26
clock. The first gene FADS1 is one member of the acid desaturase (2014) 139–154.
(FADS) gene cluster (11q12-13.1), which mediates long-chain poly- [5] P.C. Zee, H. Attarian, A. Videnovic, Circadian rhythm abnormalities, Continuum 19
unsaturated fatty acids (LCPUFAs) [49]. Since a low proportion of (2013) 132–147.
2245
[6] R.L. Sack, D. Auckley, R.R. Auger, M.A. Carskadon, K.P. Wright Jr., M.V. Vitiello, [31] Z. Cai, D. Xu, Q. Zhang, J. Zhang, S.-M. Ngai, J. Shao, Classification of lung cancer
I.V. Zhdanova, M. American Academy of Sleep, Circadian rhythm sleep disorders: using ensemble-based feature selection and machine learning methods, Mol.
part II, advanced sleep phase disorder, delayed sleep phase disorder, free-running BioSyst. 11 (2015) 791–800.
disorder, and irregular sleep-wake rhythm. An American Academy of Sleep [32] L. He, Z. Cao, Y. Wang, W. Du, Y. Liang, An ensemble feature selection method
Medicine review, Sleep 30 (2007) 1484–1501. based on mRMR for paired microarray data, J. Comput. Inf. Syst. 10 (2014)
[7] K. Ando, D.F. Kripke, S. Ancoli-Israel, Delayed and advanced sleep phase symptoms, 4875–4882.
Isr. J. Psychiatry Relat. Sci. 39 (2002) 11–18. [33] L. Chen, C. Chu, T. Huang, X. Kong, Y.-D. Cai, Prediction and analysis of cell-pe-
[8] K.J. Reid, P.C. Zee, Circadian rhythm sleep disorders, Handb. Clin. Neurol. 99 netrating peptides using pseudo-amino acid composition and random forest models,
(2011) 963–977. Amino Acids 47 (2015) 1485–1493.
[9] T.I. Morgenthaler, T. Lee-Chiong, C. Alessi, L. Friedman, R.N. Aurora, B. Boehlecke, [34] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and
T. Brown, A.L. Chesson Jr., V. Kapur, R. Maganti, J. Owens, J. Pancer, T.J. Swick, model selection, International Joint Conference on Artificial Intelligence, vol. 14,
R. Zak, M. Standards of Practice Committee of the American Academy of Sleep, Lawrence Erlbaum Associates Ltd, 1995, pp. 1137–1145.
Practice parameters for the clinical evaluation and treatment of circadian rhythm [35] L. Chen, W.M. Zeng, Y.D. Cai, K.Y. Feng, K.C. Chou, Predicting anatomical ther-
sleep disorders. An American Academy of Sleep Medicine report, Sleep 30 (2007) apeutic chemical (ATC) classification of drugs by integrating chemical-chemical
1445–1459. interactions and similarities, PLoS One 7 (2012) e35254.
[10] J.S. Takahashi, H.K. Hong, C.H. Ko, E.L. McDearmon, The genetics of mammalian [36] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and
circadian order and disorder: implications for physiology and disease, Nat. Rev. Techniques, 2nd edn, Morgan, Kaufmann, San Francisco, 2005.
Genet. 9 (2008) 764–775. [37] J. Lu, S. Wang, Y.D. Cai, Q. Zhang, Analysis and prediction of nitrated tyrosine sites
[11] Y. Xu, Q.S. Padiath, R.E. Shapiro, C.R. Jones, S.C. Wu, N. Saigoh, K. Saigoh, with mRMR method and support vector machine algorithm, Curr. Bioinforma. 11
L.J. Ptacek, Y.H. Fu, Functional consequences of a CKIdelta mutation causing fa- (2016), http://dx.doi.org/10.2174/1574893611666160608075753 (E-pub ahead
milial advanced sleep phase syndrome, Nature 434 (2005) 640–644. of print).
[12] R. Zhang, N.F. Lahens, H.I. Ballance, M.E. Hughes, J.B. Hogenesch, A circadian gene [38] B.Q. Li, Y.D. Cai, K.Y. Feng, G.J. Zhao, Prediction of protein cleavage site with
expression atlas in mammals: implications for biology and medicine, Proc. Natl. feature selection by random forest, PLoS One 7 (2012) e45854.
Acad. Sci. U. S. A. 111 (2014) 16219–16224. [39] Y. Cai, J. He, L. Lu, Predicting sumoylation site by feature selection method, J.
[13] J.M. Parish, Genetic and immunologic aspects of sleep and sleep disorders, Chest Biomol. Struct. Dyn. 28 (2011) 797–804.
143 (2013) 1489–1499. [40] L. Chen, K.Y. Feng, Y.D. Cai, K.C. Chou, H.P. Li, Predicting the network of substrate-
[14] P.M. Visscher, M.A. Brown, M.I. McCarthy, J. Yang, Five years of GWAS discovery, enzyme-product triads by combining compound similarity and functional domain
Am. J. Hum. Genet. 90 (2012) 7–24. composition, BMC Bioinf. 11 (2010) 293.
[15] Y. Hu, A. Shmygelska, D. Tran, N. Eriksson, J.Y. Tung, D.A. Hinds, GWAS of 89,283 [41] B.W. Matthews, Comparison of the predicted and observed secondary structure of
individuals identifies genetic variants associated with self-reporting of being a T4 phage lysozyme, Biochim. Biophys. Acta 405 (1975) 442–451.
morning person, Nat. Commun. 7 (2016) 10448. [42] W. Huang da, B.T. Sherman, R.A. Lempicki, Systematic and integrative analysis of
[16] W. Cookson, L. Liang, G. Abecasis, M. Moffatt, M. Lathrop, Mapping complex dis- large gene lists using DAVID bioinformatics resources, Nat. Protoc. 4 (2009) 44–57.
ease traits with global gene expression, Nat. Rev. Genet. 10 (2009) 184–194. [43] Y. Dauvilliers, I. Arnulf, E. Mignot, Narcolepsy with cataplexy, Lancet 369 (2007)
[17] P. Li, M. Guo, C. Wang, X. Liu, Q. Zou, An overview of SNP interactions in genome- 499–511.
wide association studies, Brief. Funct. Genomics 14 (2015) 143–155. [44] E. Thorsby, Invited anniversary review: HLA associated diseases, Hum. Immunol.
[18] G.T. Consortium, Human genomics. The genotype-tissue expression (GTEx) pilot 53 (1997) 1–11.
analysis: multitissue gene regulation in humans, Science 348 (2015) 648–660. [45] E. Mignot, R. Hayduk, J. Black, F.C. Grumet, C. Guilleminault, HLA DQB1*0602 is
[19] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of associated with cataplexy in 509 narcoleptic patients, Sleep 20 (1997) 1012–1020.
max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. [46] H. Hor, Z. Kutalik, Y. Dauvilliers, A. Valsesia, G.J. Lammers, C.E. Donjacour,
Mach. Intell. 27 (2005) 1226–1238. A. Iranzo, J. Santamaria, R. Peraita Adrados, J.L. Vicario, S. Overeem, I. Arnulf,
[20] K.M. Ting, I.H. Witten, Stacking bagged and dagged models, In Fourteenth I. Theodorou, P. Jennum, S. Knudsen, C. Bassetti, J. Mathis, M. Lecendreux,
International Conference on Machine Learning, 1997 (San Francisco, CA.). G. Mayer, P. Geisler, A. Beneto, B. Petit, C. Pfister, J.V. Burki, G. Didelot,
[21] Human genomics. The genotype-tissue expression (GTEx) pilot analysis: multitissue M. Billiard, G. Ercilla, W. Verduijn, F.H. Claas, P. Vollenweider, G. Waeber,
gene regulation in humans, Science 348 (2015) 648–660. D.M. Waterworth, V. Mooser, R. Heinzer, J.S. Beckmann, S. Bergmann, M. Tafti,
[22] L. Chen, Y.-H. Zhang, G. Lu, T. Huang, Y.-D. Cai, Analysis of cancer-related lncRNAs Genome-wide association study identifies new HLA class II haplotypes strongly
using gene ontology and KEGG pathways, Artif. Intell. Med. 76 (2017) 27–36. protective against narcolepsy, Nat. Genet. 42 (2010) 786–789.
[23] L. Liu, L. Chen, Y.-H. Zhang, L. Wei, S. Cheng, X.-Y. Kong, M. Zheng, T. Huang, Y.- [47] E. Mignot, L. Lin, W. Rogers, Y. Honda, X. Qiu, X. Lin, M. Okun, H. Hohjoh, T. Miki,
D. Cai, Analysis and prediction of drug-drug interaction by minimum redundancy S. Hsu, M. Leffell, F. Grumet, M. Fernandez-Vina, M. Honda, N. Risch, Complex
maximum relevance and incremental feature selection, J. Biomol. Struct. Dyn. 35 HLA-DR and -DQ interactions confer risk of narcolepsy-cataplexy in three ethnic
(2017) 312–329. groups, Am. J. Hum. Genet. 68 (2001) 686–699.
[24] H. Mohabatkar, M. Mohammad Beigi, K. Abdolahi, S. Mohsenzadeh, Prediction of [48] A. van der Heide, W. Verduijn, G.W. Haasnoot, J.J. Drabbels, G.J. Lammers,
allergenic proteins by means of the concept of Chous pseudo amino acid compo- F.H. Claas, HLA dosage effect in narcolepsy with cataplexy, Immunogenetics 67
sition and a machine learning approach, Med. Chem. 9 (2013) 133–137. (2015) 1–6.
[25] L. Chen, C. Chu, K. Feng, Predicting the types of metabolic pathway of compounds [49] J.Y. Zhang, K.S. Kothapalli, J.T. Brenna, Desaturase and elongase-limiting en-
using molecular fragments and sequential minimal optimization, Comb. Chem. dogenous long-chain polyunsaturated fatty acid biosynthesis, Curr. Opin. Clin.
High Throughput Screen. 19 (2016) 136–143. Nutr. Metab. Care 19 (2016) 103–110.
[26] Q. Ni, L. Chen, A feature and algorithm selection method for improving the pre- [50] J.R. Burgess, L. Stevens, W. Zhang, L. Peck, Long-chain polyunsaturated fatty acids
diction of protein structural classes, Comb. Chem. High Throughput Screen (2017), in children with attention-deficit hyperactivity disorder, Am. J. Clin. Nutr. 71
http://dx.doi.org/10.2174/1386207320666170314103147 (E-pub ahead of print). (2000) 327S–330S.
[27] Z. Li, X. Zhou, Z. Dai, X. Zou, Classification of G-protein coupled receptors based on [51] S. Aizawa, I. Sakata, M. Nagasaka, Y. Higaki, T. Sakai, Negative regulation of
support vector machine with maximum relevance minimum redundancy and ge- neuromedin U mRNA expression in the rat pars tuberalis by melatonin, PLoS One 8
netic algorithm, BMC Bioinf. 11 (2010) 325. (2013) e67118.
[28] L. Chen, Y.H. Zhang, M. Zheng, T. Huang, Y.D. Cai, Identification of compound- [52] P.J. Brighton, P.G. Szekeres, G.B. Willars, Neuromedin U and its receptors: struc-
protein interactions through the analysis of gene ontology, KEGG enrichment for ture, function, and physiological roles, Pharmacol. Rev. 56 (2004) 231–248.
proteins and molecular fragments of compounds, Mol Gen Genet 291 (2016) [53] K. Mori, M. Miyazato, T. Ida, N. Murakami, R. Serino, Y. Ueta, M. Kojima,
2065–2079. K. Kangawa, Identification of neuromedin S and its possible role in the mammalian
[29] Y. Zhang, C. Ding, T. Li, Gene selection algorithm by combining reliefF and mRMR, circadian oscillator system, EMBO J. 24 (2005) 325–335.
BMC Genomics 9 (2008) S27. [54] S. Luck, P.O. Westermark, Circadian mRNA expression: insights from modeling and
[30] L. Chen, Y.-H. Zhang, T. Huang, Y.-D. Cai, Gene expression profiling gut microbiota transcriptomics, Cell. Mol. Life Sci. 73 (2016) 497–521.
in different races of humans, Sci. Rep. 6 (2016) 23075.
2246

26Li-2018-Predicting and Analyzing Early Wake-Up

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

26Li-2018-Predicting and Analyzing Early Wake-Up

Uploaded by

Copyright:

Available Formats

BBA - Molecular Basis of Disease 1864 (2018) 2241–2246

Contents lists available at ScienceDirect

BBA - Molecular Basis of Disease

Predicting and analyzing early wake-up associated gene expressions by T

chanisms, through which the causal genetic factors determine the

Rank Tissue Ensembl gene ID Gene name Score

1 Caudate_basal_ganglia ENSG00000115524 SF3B1 0.09149

As mentioned above, the mRMR method only yields the mRMR

deemed as the predicted class of the Dagging.

Fig. 3. Tissue distribution of the 4373 optimal features.

You might also like