You are on page 1of 1

Identification of SNPs Relating to Autism Spectrum Disorders Using Whole

Genome Genotyping Data


Martha T. Gizaw and Iosif Vaisman
School of Arts & Sciences, The College of William & Mary, Williamsburg, VA; School of Systems Biology, George Mason University, Manassas, VA
Abstract Results Conclusions
Single nucleotide polymorphisms (SNPs) are genetic sequence variations that take place at a position where a single DNA • Discussion: This project initially seeks to discover how machine learning and mathematics can reveal the visibility and
nucleotide is switched to another. The SNPs in the genes that are linked to autism spectrum disorders (ASD) may serve as measurability of the genetic makeup of ASD. Hypothetically, those disciplines would help make predictions of novel
useful biomarkers for ASD diagnosis and help to understand the exact genetic causes of ASD. The purpose of this project is Unbalance biomarkers in an ASD patient since they are useful for analyzing vast amounts of available data. The distance formula
SNP Label Balanced Genes Evidence
to identify ASD-related SNPs based on SNP genotyping in genomic DNA in a large cohort of ASD patients and unaffected d Table 1. A sample of the 10 highest-ranking SNPs was taken from the model demonstrates that when the genotype counts for ASD patients are much higher than those for unaffected relatives,
related individuals. The dataset retrieved from the Gene Expression Omnibus database (GSE6754) contains more than 6,000 top 61 SNPs with distances between ASD patients and unaffected the distance suggests that one or more genes are likely to be linked to autism. After the genes were located in biological
ZDHHC9 (Kou et al., 2012) individuals in three-dimensional genotype count space. Distances for
samples from 1,400 families. The SNPs are ranked by the distance in three-dimensional genotype count space between all the SNP_A-1508349 1066.8 111.8 pathway database GPL2641, they were supported with any available literature that emphasizes autism. Weka has also
affected and unaffected subjects in the cohort. The results demonstrate that the SNPs with the highest ranking distances are CXorf9 N/A each SNP when the affected (A) and unaffected (U) populations are contributed to the purpose of this project by measuring genotyping data within large populations to calculate areas under
likely to be linked to ASD. High-ranking SNPs that currently have no known links to ASD can potentially become novel SNP_A-1507448 1017.0 26.0 ANPEP (Yang et al., 2016) unbalanced tend to be longer than those under balanced A/U the ROC curve. Interestingly, those areas are generally larger in an unbalanced cohort of patients than in a balanced
biomarkers. The involvement of the genes containing these SNPs in the biological pathways that might be relevant to ASD is SNP_A-1514896 1014.0 29.4 FCGR1A (Glatt et al., 2012) populations. There are no overlaps among the SNPs within unbalanced cohort. Therefore, ROC curves may have a way of how many patients are included in the GSE6754 series matrix
analyzed using a pathway database. The top-ranking SNPs can be used as attributes for machine learning models to identify and balanced cohorts. The longer the distance, the more likely the SNP regardless of autism status. In most cases, the Naïve Bayes classifier in Weka leads to larger areas under the ROC curve;
SNP_A-1510344 1011.0 13.0 SYTL5 (Chini et al., 2016) is discovered in a gene that links to autism. Furthermore, all genes in
ASD patients based on their genetic sequencing data. This project is expected to help pinpoint the exact genetic locations in this outcome suggests that any machine learning classifier can have an impact on the complexity of genetic analysis in a
autistic patients that could be used in future research to improve diagnostic and therapeutic interventions. each SNP were discovered and derived from Gene Expression population that partially involves ASD-linked genes. The results in general have validated the hypothesis that machine
(Ching et al., 2010) Omnibus database GPL2641. Evidence with proof of autism was learning and mathematics is able to reveal each gene that may serve as ASD biomarkers whether or not it is supported
Keywords: autism, biomarkers, database, genetics, machine learning, SNP
NRXN1 (Kim et al., 2008) gathered to determine whether or not each gene is an ASD biomarker. with the literature that obtains proof of autism.
SNP_A-1509438 1010.0 2.4 (Szatmari et al., 2007) Out of the 61 high-ranking SNPs targeted in the unbalanced cohort, 45 • Sources of Error
Introduction are linked to ASD. Other genes within the last 51 high-ranking SNPs • Including “No Call” genotype counts into the mathematical model: When the impact of “No Call” genotype
ASB3 N/A not listed in this table include PIK3C3, DLGAP1, RNF180, SPATA5L1, counts on the calculation of SNP ranking distances were observed, the distances with “No Call” tend to be slightly
• Key Definitions: Machine learning is a field of artificial intelligence that allows a computer to learn and analyze data— C13orf25, GLI3, C7orf25, CRP, FAM46D, TBX22, FANCL, ICOS,
new and old—without human supervision. Various applications like e-commerce and information technology exist to longer than those without in some SNPs. In a majority of SNPs, both distances may numerically be the same, but to
PABPC5 (“Gene Set”, n.d.) ALS2CR19, NAPE-PLD, SEL1L, FLRT2, RTTN, EIF3S3, TRPS1, reduce the complexity of graphing in multivariable space, only AA, AB, and BB are considered. The removal of “No
gather an enormous amount of data and make predictions. In the case of medicine, machine learning could scan single SNP_A-1516193 1010.0 2.8 ADARB2, SPRY2, DSCAM, EPM2A, UTRN, MAGEE2, CXorf26,
nucleotide polymorphisms (SNPs) for the prevalence of specific genes that are linked to autism spectrum disorders CPXCR1 N/A Call” may also reduce the overall time taken to build Weka models, improve the presence of all data saved in a CSV
RAG2, NGL-1, RAG1, TMOD1, NR5A2, PTPRC, RCHY1, SMARCC2, file, and make results more accurate and visible.
(ASD). ASD is neurodevelopmental, is diagnosable in the first two to three years of one’s life, and is characterized by SNP_A-1513269 1009.0 5.7 N/A N/A RNASEH1, FLT1, TPH1, ASCL1, CDH9, KHDRBS2, OPHN1, AR,
social, communication, and behavioral deficits. SNPs are genetic sequence variations that take place at a position where a • Including patient ID numbers in Weka: More than 6,000 patients are identified with numbers that can go well up to
SNP_A-1511799 1009.0 4.6 VGLL4 N/A SH3BGRL, ABCC12, NLGN4X, VCX3A, CA10, and KIF2B. 7,000. If ID numbers are not removed from the attributes list, the areas under ROC would be resized in a certain
single DNA nucleotide is switched to another.
• Background Research: Machine learning emerges from the latest technologies that can rely on frequently used SNP_A-1513156 1008.1 12.9 FLJ37659 N/A matter, and the time taken to built a model with the Decision Tree, Random Forest, and Naïve Bayes classifiers might
computations to make decisions based on self-adaptable new data. Companies that generate large volumes of data are be slightly longer. In addition, statistical summaries and the confusion matrices would either involve significantly very
SNP_A-1510098 1008.1 21.5 N/A N/A
usually excited about data mining, algorithm design, and cheaper data storage and processing. When machine learning is large numbers or indicate errors in the results.
used in bioinformatics, scientists most likely use neural networks, genetic algorithms, and fuzzy logic. One of the greatest • Making a single SNP as a nominal class: The affected/unaffected classification column in a transposed GSE6754
problems this project contributes to is the classification of genes that are impacted by an illness or disorder and are series matrix is the last attribute that Weka accepts by default. If a SNP is used as the nominal (main) class for
distinguishable from normal genes. Researchers cannot clearly tell how autism is being structured, but linkage scans and statistical performance, the results would become misleading and be easily confused with the results derived from the
copy number variations (CNV) in over 1,000 families with at least two affected individuals can explain the possibility of performance under the nominal class of affected and unaffected patient identifications. In fact, a SNP always has
autism risk loci being chromosome 11p12-p13 under linkage analysis and neurexins under CNV. Obviously, linkage genotype counts of AA, AB, BB, and “No Call”; all of which would apply to all other attributes and lack the
screening can become a diagnostic tool for autism ancestry. remaining data in in the GSE6754 file.
• Significance: In 2018, the CDC Autism and Developmental Disabilities Monitoring Network reports that about 1 in 59 • Future Research Goals: This project will extend to the top 100-300 SNP ranking differences that implicate ASD-linked
U.S. children—and counting—are diagnosed with ASD. When machine learning plays a role in transforming several genes as well as high-ranking SNPs in balanced populations of affected and unaffected individuals. In addition, it will
industry sectors, a majority of executives think that artificial intelligence maximizes productivity in the job economy. establish connections to relevant research conducted in the field of mathematical and computational neuroscience. Ideally,
Figure 3. The top 5 SNP ranking distances in the unbalanced (pink) the genetic makeup of ASD would be further analyzed if a full understanding of mirror neurons genetics is achieved and if
Likewise, data science can solve real-world problems in natural science, international development, humanities, and many
and balanced (orange) cohort of affected and unaffected patients are such neural simulators as Nengo/Neural Engineering Object (a Python-based software package for simulating large-scale
other disciplines that involve very large sets of information. Since clinical autism research is being produced over the
displayed in a three-dimensional Cartesian plane. The distances in neuronal models) and Brian (a Python-based software package for spiking neural networks) are utilized. Quantitative data
years, machine learning and computer science in general can revolutionize our usual routines of treating neurological
orange are visibly shorter after the random removal of genotyping data to be gathered in the near future may suggest interesting insights into how the autistic brain works and how it is being
abnormalities with medications that could be replaced with therapeutic interventions.
for 1,014 unaffected subjects. compared to neurotypical minds.
Methodology
• Mathematical Modeling: The first step to identifying the SNPs with the Acknowledgements
highest chances of containing one of more ASD biomarkers is to use the My biggest thanks belong to Dr. Vaisman for his valuable support and feedback on a project that I have dreamed of for 21
GSE6754 series matrix in Excel to classify each individual as affected or years and that I have done as a first-time intern. I also give credit to the William & Mary (W&M) Cohen Career Center for
unaffected and count each genotype AA, AB, and BB for both affected funding my unpaid summer experience as I have donated part of my stipend to the W&M For the Bold campaign.
and unaffected populations. Since the algebraic distance formula is , the
mathematical model for calculating SNP ranking differences is . Due to References
intentions of visualizing those differences in a three-dimensional Autism prevalence slightly higher in CDC’s ADDM Network | CDC Online Newsroom | CDC. (2018, June 29).
Cartesian plane (see Figures 1 and 3), genotype “No Call” becomes an Number of High-Rank- Number of High-Rank- Number of High-Ranking Retrieved July 31, 2018, from https://www.cdc.gov/media/releases/2018/p0426-autism-prevalence.html
extraneous variable and is thus removed so that it will not impact the ing SNPs vs. Area under ing SNPs vs. Area under SNPs vs. Area under Ching, M. S. L., et al. (2010). Deletions of NRXN1 (neurexin-1) predispose to a wide spectrum of developmental
accuracy of the results. Figure 4. The areas under the receiver ROC: Unbalanced, 10- ROC: Unbalanced, N- ROC: Unbalanced, Per-
disorders. American Journal of Medical Genetics. Part B, Neuropsychiatric Genetics: The Official Publication of the
• Gene Location in Biological Pathways: For each of the 61 top-ranking Figure 1. Two-dimensional operating characteristic (ROC) curve helped Fold CV Fold CV 0.62 cent Split
International Society of Psychiatric Genetics, 153B(4), 937–947.
SNPs discovered in GSE6754, the genes are analyzed in the biological representation of graphing a measure the whole-genome genotyping data 0.62 0.62 Chini, V., Shalaby, K., Al-Sarraj, Y., Taha, R., Alshaban, F., Kambouris, M., & El-Shanti, H. (2016). X-linked Genes with
0.6
pathways stored in GPL2641. They are then sorted out in a table (see distance between an affected with the top 10, 15, and 20 SNPs for 0.6 0.6 Novel Rare Variants Identified by WGS in ASD Patients are Involved in Neurodevelopment. Qatar Foundation Annual
Table 1) and are validated with scientific literature that proves that a and unaffected subject in thousands of patients. Subfigures A-C 0.58 Research Conference Proceedings, 2016(1), HBPP1350.
certain gene plays a role in autism. Genes without such proof might genotype count space. AB axis 0.58 0.58
represent the data collected when the affected Frank, E., Hall, M.A., Witten, I.H. (2016). The WEKA workbench. Online Appendix for "Data Mining: Practical Machine
become novel ASD biomarkers. is not included. Not drawn to and unaffected populations are unbalanced, 0.56 0.56 0.56 Learning Tools and Techniques.” Retrieved August 2, 2018, from
Top 10 Top 15 Top 20 Top 10 Top 15 Top 20 Top 10 Top 15 Top 20
scale. whereas subfigures D-F deal with the data https://www.cs.waikato.ac.nz/ml/weka/citing.html.
when the populations are balanced. The A B C Gene Set - autism spectrum disorder. (n.d.). Retrieved July 29, 2018, from
machine learning methods used in Weka 3.8 http://amp.pharm.mssm.edu/Harmonizome/gene_set/autism+spectrum+disorder/DISEASES+Text- mining+Gene-
Figure 2. Locating the gene NRXN1 are 10-fold cross-validation (CV) (A & D), Disease+Assocation+Evidence+Scores
Number of High-Ranking Number of High-Rank- Number of High-Ranking
N-fold CV (B & E), and percent split (C & SNPs vs. Area under ing SNPs vs. Area under SNPs vs. Area under Glatt, S. J., Tsuang, M. T., Winn, M., Chandler, S. D., Collins, M., Lopez, L., … Courchesne, E. (2012). Blood-Based
• Machine Learning Analysis: Developed by the University of Waikato in within a single biological pathway is an Gene Expression Signatures of Autistic Infants and Toddlers. Journal of the American Academy of Child and Adolescent
F). The pink bars symbolize J48 (Decision ROC: Balanced, 10-Fold ROC: Balanced, N-Fold ROC: Balanced, Percent
New Zealand, Weka 3.8 is a popular Java-based machine learning tool for example of gene identification. Psychiatry, 51(9), 934-44.e2.
Tree) classification, the orange bars are for CV CV Split
data mining and knowledge discovery. It accepts Excel files saved as CSV Kim, H.-G., Kishikawa, S., Higgins, A. W., Seong, I.-S., Donovan, D. J., Shen, Y., … Gusella, J. F. (2008). Disruption of
files and performs basic and advanced statistical calculations ranging Random Forest, and blue bars for Naïve 0.56 0.56 0.56 neurexin 1 associated with autism spectrum disorder. American Journal of Human Genetics, 82(1), 199-207.
from the counting of instances and attributes to the determination of Bayes.
0.54 0.54 0.54 Kou, Y., Betancur, C., Xu, H., Buxbaum, J. D., & Ma’ayan, A. (2012). Network- and Attribute-Based Classifiers Can
confusion matrices and areas under the receiver operating characteristic Prioritize Genes and Pathways for Autism Spectrum Disorders and for Intellectual Disability. American Journal of Medical
0.52 0.52 0.52
(ROC) curve. CSV files serve as training sets for analysis with such Genetics. Part C, Seminars in Medical Genetics, 160C(2), 130–142.
classifiers as Neural Networks, Support Vector Machine, Random Forest, 0.5 0.5 0.5
Top 10 Top 15 Top 20 Top 10 Top 15 Top 20 Top 10 Top 15 Top 20 Szatmari, P., et al. (2007). Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nature
Naïve Bayes, and Decision Tree. Genetics, 39(3), 319–328.
D E F Yang, L., Rudser, K., Golnik, A., Wey, A., Higgins, L. A., & Gourley, G. R. (2016). Urine Protein Biomarker Candidates for
Autism. Paper of Proteomics & Bioinformatics.

You might also like