You are on page 1of 1

A Knowledge-Based Machine Learning Approach to

Gene Prioritisation in Amyotrophic Lateral Sclerosis


Jiajing Hu1,2,, Rosalba Lepore3,, Richard Dobson1,, Ammar Al-Chalabi2, Daniel Bean1,, and Alfredo Iacoangeli1,2,#
1Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London,
London, UK; 2 Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry,
Psychology & Neuroscience, King’s College London, London, UK; 3BSC-CNS Barcelona Supercomputing Center, Barcelona, Spain;
#Correspondance should be addressed to alfredo.iacoangeli@kcl.ac.uk

Introduction
As a result of the advent of high-throughput technologies,
there has been rapid progress in our understanding of the
genetics underlying biological processes. Despite such
advances, the genetic landscape of human diseases has
only marginally been disclosed.
Amyotrophic Lateral Sclerosis represents this scenario well.
In ALS several rare disruptive gene variants have been
associated with the disease and are responsible for about
15% of all cases. Although our knowledge of the genetic
landscape of this disease is improving, it remains limited.
Machine learning models trained on the available protein–
protein interaction and phenotype-genotype association
data can use our current knowledge of the disease genetics
for the prediction of novel candidate genes.
Figure 1. Method overview: The method takes as input a graph of known data
Our method is available via DGLinker at this web address: related to the prediction task, in this case, gene-disease links, gene functions,
https://dglinker.rosalind.kcl.ac.uk. and others, and returns a list of predicted edges missing from that graph.

Results
We predicted new ALS genes using sets of known ALS genes from
Objectives 4 sources: Clinvar, DisGeNet, OMIM and a manual literature review
3. In total 651 genes were predicted. We investigated the relevance
Our objective is to exploit the available biological
information from publicly available datasets and our of the predicted genes in ALS by using the available summary
current knowledge of the disease genetics for the statistics from the largest ALS genome-wide association study and
by performing functional and phenotype enrichment analysis. The
prediction of novel candidate genes, and to investigate the
predictions were enriched for genes associated with biological
relevance of the predicted genes in ALS. processes known to be affected by the ALS pathogenesis, and with
other neurodegenerative diseases for which evidence of
phenotypic and genetic overlap with ALS exist. Using ALS genes
Method from ClinVar and our manual review as input, the predicted sets
We developed a knowledge-based machine learning were enriched for ALS-associated genes (ClinVar p = 0.038 and
method for this purpose (Figure 1). We trained our model manual review p = 0.060) when used for gene prioritisation in a
on protein–protein interaction data from IntAct, gene genome-wide association study. As our predictions were based on
data released in 2019, in March 2021 we retrospectively tested the
function annotation from Gene Ontology, and known
validity of our prediction by assessing if any of the genes that were
disease-gene associations from DisGeNet. These three discovered to be associated with ALS between the end of 2019
databases are the default setting in DGLinker. Using several and 2021 were predicted by our method. Five out of seven
sets of known ALS genes from public databases and a discovered genes were present among our predictions (p = 0.012).
manual review as input, we generated a list of new These were ATXN1, ATXN3, SCFD1, CAV1, and SPTLC1,. Only ACSL5
candidate genes for each input set 1. and GLT8D1, were not present among the predicted genes.
We also developed DGLinker 2, a webserver that
implements and generalizes our machine learning method, Conclusions
making it publicly and easily accessible for the prediction of
candidate genes associated with any target human disease. The genetics of ALS has proven challenging, and even though
great progress has been made by generating and analysing large
DGLinker has a user-friendly interface that allows non-
multi-omics datasets, the causes of ALS in most patients (~85%)
expert users to exploit biomedical information from a wide remain unexplained. Using machine learning models to leverage
range of biological and phenotypic databases, and/or to our current knowledge of ALS and other diseases could allow us
upload their own data, to generate a knowledge-graph and to accelerate our progress in the understanding of the genetic
use machine learning to predict new disease-associated causes of ALS and lead towards new avenues of treatment. Our
genes. The webserver includes tools to explore and method is available via DGLinker at this web address
interpret the results and generates publication-ready https://dglinker.rosalind.kcl.ac.uk. The webserver is free
figures. and open to all users without the need for registration.

Table 1. Temporal external validation of predictive performance using DisGeNet. All disease-gene associations in DisGeNet up to 2018 were used to predict associations added
by 2020. Significance is determined at threshold p < 0.05 after 5% false discovery rate correction for multiple comparisons. “Overall” = values for all models, “Significant” =
values for all models that were significant in validation, “Not significant” = values for all models that were not significant in validation. “IQR” = Interquartile range.

Overall Significant Not Significant


Median (IQR) Median (IQR) Median (IQR)
Median (IQR) Median (IQR) Median (IQR) Median (IQR) Median (IQR) Median (IQR)
Criteria Number of diseases Significant (N, %) validation genes in
predictions validated training genes
validation genes in
predictions validated training genes
validation genes in
predictions validated training genes
KG KG KG
At least 1 new gene by 2020
At least 1 new gene is in KG
1024 170 (16.6%) 1 (2) 1 (1) 6 (19.25) 4 (5) 2 (4) 9 (20) 1 (1) 0 (1) 6 (19.0)
At least 1 new gene by 2020
At least 1 new gene is in KG 804 184 (22.9%) 1 (2) 1 (2) 9 (31) 3 (5) 2 (3.25) 10 (19.25) 1 (1) 1 (1) 8 (32.5)
Model made at least 1 prediction
At least 1 new gene by 2020
At least 1 new gene is in KG
Model made at least 1 prediction
346 146 (42.2%) 2 (2) 1 (1) 10 (15.75) 3 (4.75) 2 (2) 7 (13) 1 (1) 0 (0) 12 (20.5)
Training J >= 0.9
At least 1 new gene by 2020
At least 1 new gene is in KG
Model made at least 1 prediction
512 200 (39.1%) 2 (3) 1 (2) 13 (34) 3 (5) 2 (3) 10.5 (20.25) 2 (2) 0 (1) 16 (38.25)
Validation has sufficient power
At least 1 new gene by 2020
At least 1 new gene is in KG
Model made at least 1 prediction 323 146 (45.2%) 2 (3) 1 (1) 9 (14) 3 (4.75) 2 (2) 7 (13) 1 (1) 0 (0) 11 (14)
Validation has sufficient power
Training J >= 0.9

Contact References
[1] Bean, D. M., Al-Chalabi, A., Dobson, R. J., & Iacoangeli, A. (2020). A knowledge-based machine learning
Jiajing Hu
approach to gene prioritisation in amyotrophic lateral sclerosis. Genes, 2020.
King’s College London [2] J Hu, R Lepore, R Dobson, A Al-Chalabi, D Bean and A Iacoangeli. DGLinker: flexible knowledge-graph
Email: jiajing.hu@kcl.ac.uk prediction of disease-gene associations. Nucleic Acid Research (accepted), 2021.
[3] Iacoangeli, Alfredo, et al. "ALSgeneScanner: a pipeline for the analysis and interpretation of DNA
sequencing data of ALS patients." Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 2019.

You might also like