You are on page 1of 7

Project Report

Random Forest
Prediction of Genetic Susceptibility to Complex Diseases

Name: Stefan Gremalschi


Course: Algorithms CSc4520/6520
Advisor: Dr. Alex Zelikovsky
Introduction

Recent improvement in accessibility of high-throughput genotyping brought a great deal of attention


to disease association and susceptibility studies. High density maps of single nucleotide
polymorphism as well as massive genotype data with large number of individuals and number of
SNPs become publicly available (Daly et al., 2001).

A catalogue of all human SNPs is hoped to allow genome-wide search of SNPs associated with
genetic diseases. Success stories when dealing with diseases caused by a single SNP or gene were
reported. But some complex diseases, such as psychiatric disorders, are characterized by a non-
mendelian, multi-factorial genetic contribution with a number of susceptible genes interacting with
each other (Botstein and Risch, 2003). In general, a single SNP or gene may be impossible to
associate because a disease may be caused by mix modifications of alternative pathways.
Furthermore, there are no reliable tools applicable to large given genome range that could rule out or
confirm association with a disease. Answers to above questions may not explicitly help to find
specific disease associated SNPs, they may be critical for disease prevention. Indeed, knowing that
an individual is (or is not) susceptible to a certain disease will allow greatly reduce the cost of
screening and preventive measures or even help to completely avoid disease development, e.g., by
changing a diet.

We used Random Forest algorithm to assess accumulated information targeting to predict genotype
susceptibility to complex diseases with significantly high accuracy and statistical power. The next
section describes this prediction method. In the last section we analyze results. The highest
rediction accuracy achieved by Random Forest is 66.14% for Daly’s data and 65.05% for Johnson’s
data, respectively.
Problem Formulation

Based on accumulated information, predict genotype susceptibility to complex diseases with


significantly high accuracy and statistical power.

• Input Data
• Training genotype set gi = (gi,j), i = 0,…n-1; j =1,…m, gi,j ∈ {0,1,2}.
• Disease status s(gi) ∈ {1, 2}, indicating if gi,j =0,… n-1 , is in case (1) or in case (2).
• Output Data
• A Disease status of the test genotype s(gt).

An input or training set for a disease susceptibility prediction method is given as a set of genotypes
gi’s each having a disease status s(gi) (1 for case and 2 for control). For the test genotype gt, the
method should predict the disease status s(gt).

Random Forest

1. Overview
A random forest is a collection of CART-like trees following specific rules for tree growing, tree
combination, self-testing, and post-processing. Although there are many implementations of
Random Forest classification algorithm are available, we decided to use Leo Breiman and Adele
Cutler’s original implementation of RF version 5.1.

We assume that the user knows about the construction of single classification trees. Random Forests
grows many classification trees. To classify a test sample from an input vector, put the input vector
down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for
that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is grown as follows:


• If the number of cases in the training set is N, sample N cases at random - but with
replacement, from the original data. This sample will be the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to split the
node. The value of m is held constant during the forest growing.
• Each tree is grown to the largest extent possible. There is no pruning.

The forest error rate depends on two things:


• The correlation between any two trees in the forest. Increasing the correlation increases the
forest error rate.
• The strength of each individual tree in the forest. A tree with a low error rate is a strong
classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere
in between is an "optimal" range of m - usually quite wide. This is the only adjustable parameter to
which random forests is somewhat sensitive.

Data Set

Bootstrapped Bootstrapped Bootstrapped


sample sample ... sample
Create 500 samples

Split each node


choosing only from
random subset of
variables (mtry = 10).
Trees are not
pruned.

To classify new observation, use majority vote from the


forest.
2. How Random Forests Work
When the training set for the current tree is drawn by sampling with replacement, about one-third of
the cases are left out of the sample. This oob (out of bag) data is used to get a running unbiased
estimate of the classification error as trees are added to the forest. It is also used to get estimates of
variable importance.

After each tree is built, all of the data are run down the tree, and proximities are computed for each
pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the
end of the run, the proximities are normalized by dividing by the number of trees. Proximities are
used in replacing missing data, locating outliers, and producing illuminating low-dimensional views
of the data.

3. The out-of-bag (oob) error estimate


In random forests, there is no need for cross-validation or a separate test set to get an unbiased
estimate of the test set error. It is estimated internally, during the run, as follows:
• Each tree is constructed using a different bootstrap sample from the original data. About one-
third of the cases are left out of the bootstrap sample and not used in the construction of the
kth tree.
• Put each case left out in the construction of the kth tree down the kth tree to get a
classification. In this way, a test set classification is obtained for each case in about one-third
of the trees. At the end of the run, take j to be the class that got most of the votes every time
case n was oob. The proportion of times that j is not equal to the true class of n averaged
over all cases is the oob error estimate. This has proven to be unbiased in many tests.
Confusion Matrix

Golden Standard “Real”


Case Control Prediction Rates
Case A B PPR=A/(A+B)
Predicted

Control C D NPR=D/(C+D)
Sensitivity Specificity PR = (A+D)/
A/(A+C) D/(B+D) (A+B+C+D)

Risk Rate = number of case chances that predicted case have more than predicted control
Risk Rate RR = PPR/(1-NPR) = A*(C+D) / (C*(A+B))
RR does not depend on case/control skew.

Test Results

Data Set I Data Set II


Population
Closest Neighbor Random Forest Closest Neighbor Random Forest
Sensitivity 45. 52 34. 02 37. 62 17. 96
Specificity 63. 29 85. 18 64. 46 92. 79
PPR 45. 52 57. 64 67. 19 59. 48
NPR 63. 29 68. 54 34. 82 65. 76
PR 54. 52 66. 14 46. 23 65. 05
RR 1. 24 1. 83 1. 03 1. 73

Daly et al (Data Set I)


Maximum Risk Rate for major allele SNP: 1.41
Maximum Risk Rate for minor allele SNP: 2.69

Johnson et al (Data Set II)


Maximum Risk Rate for major allele SNP: 2.05
Maximum Risk Rate for minor allele SNP: 2.26
Conclusions

• Complex diseases are associated with haplotypes


• Statistics methods for single-marker (SNP) are not applicable to complex disease
• We introduce some universal methods for classification and disease discrimination
• Leave-one-out and randomization tests are used to validate proposed algorithms
• Random Forest seems to have a bias towards healthy samples

References
[1] “Random Forests”, Leo Breiman and Adele Cutler.
web: http://www.stat.berkeley.edu/users/breiman/RandomForests/
[2] “GeneSuscept: Detection of Genotype Susceptibility in Case/Control Studies”, Weidong Mao,
Nisar Hundewale, Stefan Gremalschi, Alexander Zelikovsky. Department of Computer
Science, Georgia State University, Atlanta, GA 30303
[3] “A Brief Overview to RandomForests™”, Dan Steinberg, Mikhail Golovnya, N. Scott Cardell,
Salford Systems.
web: http://www.salford-systems.com/

You might also like