You are on page 1of 1

A Bayesian Network-Based Genetic Predictor for Alcohol Dependence

Or Sagy1, *; Gil Alterovitz, PhD1, 2, 3, 4


for Biomedical Informatics, Harvard Medical School, Boston, MA; 2Childrens Hospital Informatics Program at Harvard-MIT Division of Health Science, Boston, MA; 3Partners Healthcare Center for Personalized Genetic Medicine, Boston, MA; 4Department of Electrical Engineering and Computer Science at MIT, Cambridge, MA * Corresponding Author:

Alcohol dependence (AD) is a severe condition that is difficult to cure and can lead to mortality. There are indications that this disease has a genetic basis. In this study we tried to generate a model that can predict susceptibility to alcohol dependence based on genomic data. We assessed 3,776 individuals to construct a model capable of using genetic factors for predicting alcohol dependence in individuals. We utilized a novel method, based on finding major differences in edit distances between SNPs across patients between alcohol dependent and non-alcohol dependent patients. A Bayesian network was then built based on the selected SNPs. Our Bayesian network-based framework provides a significant predictive model of alcohol dependence. The network, including only 139 features (namely SNPs) has a predictive power of 72.5%, as measured by the area under the receiver operating characteristic curve (AUROC), a notable value compared to previous work in the field.


A Bayesian network is a directed acyclic graph that compactly represents the joint probability distribution of a set of variables. Bayesian networks have previously proven effective on this front: they have successfully described the complex interactions underpinning polygenic traits such as early stroke under the context of sickle cell anemia and nicotine dependence. In a Bayesian network, nodes represent variables and edges represent probabilistic dependencies between variables. Since Bayesian networks represent joint distributions, they can be used to predict the probability of observing a specific state of a target variable (in our case, a phenotype) given the states of all other variables (in our case, SNPs and demographic variables), and have consequently been used as classifiers. In this work, we employ the K2 algorithm, using WEKA, to infer Bayesian network structure.


We have managed to generate a Bayesian network, using only 139 SNPs as features, which has proven an AUROC of 72.5% when tested on our testing data. This is a substantial result compared to previous studies attempting prediction of alcohol dependence based on SNPs for example. a recent study by Yan et al. (2013) showed AUROCs of 56.5% at most for prediction using multiple SNPs selected based on association analyses.

Use of genetic information is becoming feasible at a large scale, as the cost of genotyping an individual has been falling quickly. Our genetic classifier is considered a fair predictor, in terms of AUROC, especially considering the involvement of non-genetic factors in alcohol dependence. The models predictive power confirms the frequent assertion that alcohol dependence is a byproduct of genetic factors. It is also interesting to note that of the SNPs in the final model over 80% are on the X chromosome This may be attributed to the fact that men had the minor allele copy number for each SNP as either 0 or 2 (as they have a single copy of the X chromosome), but may hint at a biological phenomenon.
As would be expected, with different divisions of train-test sets different AUROCs are achieved, which could be lower. Further research should be done, and we imagine this novel method, which showed promising results and we think could be polished further, may prove useful for prediction of alcohol dependence and various other conditions.

Alcohol dependence (AD) is very difficult to overcome once it initiates, and thus there is much interest in preventing its onset altogether. Many GWA studies have struggled to pinpoint individual SNPs that explain a good portion of the variation in the phenotype. Rather than association as in typical GWA analyses, what is needed is high predictive power, in this case using a Bayesian network. A Bayesian network is a directed acyclic graph that compactly represents the joint probability distribution of a set of variables. They can be used to predict the probability of observing a specific state of a target variable (in our case, a phenotype) given the states of all other variables (in our case, SNPs), and have consequently been used as classifiers.

The data were divided into a training set (encompassing 90% of the patients) and a test set (the remaining 10%) The SNPs in each set were divided by chromosome, with subsequent halves of chromosomes combined, so that 24 files encompassing all the SNPs in total, were received (the division of the data was necessary in terms of computational complexity, optimally the algorithm would be run on all the SNPs together)

Developing an effective model for predicting alcohol dependence based on genetic data, namely SNPs

For each of these groups of SNPs, in the training set:

For each pair of SNPs within the group, the percentage of patients for which the SNPs were the same in terms of number of copies of the minor allele (the edit distance between the two SNPs) was calculated This was done separately for alcohol dependent and non-alcohol dependent patients

We utilized data from the Study of Addiction: Genetics and Environment (SAGE), which featured 3,829 subjects and considered 948,658 SNPs from across the human genome, as well as several demographic variables. The data included human samples from three prior studies; 30% of the individuals were African Americans and 70% were European Americans. The SAGE dataset includes 1,897 Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) cases and 1,932 alcohol-exposed non-dependents. SNPs out of Hardy-Weinberg equilibrium (P < 0.0001) had been removed, with Hardy-Weinberg equilibrium tests were run separately on the African Americans and the European Americans in order to ensure identification of any SNPs common only in one race out of equilibrium. SNPs with minor allele frequency (MAF) below 0.01 or call rate below 98% were also removed from consideration, leaving a total of 934,128 SNPs. Finally, the 3,776 samples with a genotyping rate above 98% were maintained.

For each pair of SNPs, the ratio between their edit distance in alcohol dependent and non-alcohol dependent patients was calculated

Pairs of SNPs for which the ratio was above or below set thresholds (several thresholds were tested) were selected

We would like to thank James Thomas for previous attempts this work stems from. John Rickert, Dan Karliner and Tom Kalvari for suggesting ideas, as well as Jonah Kallenbach, Alex Huang, Johnny Ho and Skanda Koppula for assisting with various aspects of the project. We would like to also acknowledge Kent Huynh for his contribution in implementing the pipeline for the data preparation and Aaron Merlob for critically reviewing a previous version of this work. This work was supported by grants from the NIDA (R21DA025168-02; G. Alterovitz), the NHGRI (R01HG004836-01; G. Alterovitz) and the NLM (R00LM009826-03; G. Alterovitz).

The values of the selected SNPs for all of the patients in the training set were extracted

A Bayesian network based on the selected SNPs values in the training set was built using the WEKA software, employing the K2 algorithm This Bayesian network would serve as a predictor for alcohol dependence based on these SNPs

The Bayesian networks accuracy was tested on the testing set