Professional Documents
Culture Documents
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
Random Forest: Prediction of Genetic Susceptibility To Complex Diseases
Random Forest
Prediction of Genetic Susceptibility to Complex Diseases
A catalogue of all human SNPs is hoped to allow genome-wide search of SNPs associated with
genetic diseases. Success stories when dealing with diseases caused by a single SNP or gene were
reported. But some complex diseases, such as psychiatric disorders, are characterized by a non-
mendelian, multi-factorial genetic contribution with a number of susceptible genes interacting with
each other (Botstein and Risch, 2003). In general, a single SNP or gene may be impossible to
associate because a disease may be caused by mix modifications of alternative pathways.
Furthermore, there are no reliable tools applicable to large given genome range that could rule out or
confirm association with a disease. Answers to above questions may not explicitly help to find
specific disease associated SNPs, they may be critical for disease prevention. Indeed, knowing that
an individual is (or is not) susceptible to a certain disease will allow greatly reduce the cost of
screening and preventive measures or even help to completely avoid disease development, e.g., by
changing a diet.
We used Random Forest algorithm to assess accumulated information targeting to predict genotype
susceptibility to complex diseases with significantly high accuracy and statistical power. The next
section describes this prediction method. In the last section we analyze results. The highest
rediction accuracy achieved by Random Forest is 66.14% for Daly’s data and 65.05% for Johnson’s
data, respectively.
Problem Formulation
• Input Data
• Training genotype set gi = (gi,j), i = 0,…n-1; j =1,…m, gi,j ∈ {0,1,2}.
• Disease status s(gi) ∈ {1, 2}, indicating if gi,j =0,… n-1 , is in case (1) or in case (2).
• Output Data
• A Disease status of the test genotype s(gt).
An input or training set for a disease susceptibility prediction method is given as a set of genotypes
gi’s each having a disease status s(gi) (1 for case and 2 for control). For the test genotype gt, the
method should predict the disease status s(gt).
Random Forest
1. Overview
A random forest is a collection of CART-like trees following specific rules for tree growing, tree
combination, self-testing, and post-processing. Although there are many implementations of
Random Forest classification algorithm are available, we decided to use Leo Breiman and Adele
Cutler’s original implementation of RF version 5.1.
We assume that the user knows about the construction of single classification trees. Random Forests
grows many classification trees. To classify a test sample from an input vector, put the input vector
down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for
that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere
in between is an "optimal" range of m - usually quite wide. This is the only adjustable parameter to
which random forests is somewhat sensitive.
Data Set
After each tree is built, all of the data are run down the tree, and proximities are computed for each
pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the
end of the run, the proximities are normalized by dividing by the number of trees. Proximities are
used in replacing missing data, locating outliers, and producing illuminating low-dimensional views
of the data.
Control C D NPR=D/(C+D)
Sensitivity Specificity PR = (A+D)/
A/(A+C) D/(B+D) (A+B+C+D)
Risk Rate = number of case chances that predicted case have more than predicted control
Risk Rate RR = PPR/(1-NPR) = A*(C+D) / (C*(A+B))
RR does not depend on case/control skew.
Test Results
References
[1] “Random Forests”, Leo Breiman and Adele Cutler.
web: http://www.stat.berkeley.edu/users/breiman/RandomForests/
[2] “GeneSuscept: Detection of Genotype Susceptibility in Case/Control Studies”, Weidong Mao,
Nisar Hundewale, Stefan Gremalschi, Alexander Zelikovsky. Department of Computer
Science, Georgia State University, Atlanta, GA 30303
[3] “A Brief Overview to RandomForests™”, Dan Steinberg, Mikhail Golovnya, N. Scott Cardell,
Salford Systems.
web: http://www.salford-systems.com/