You are on page 1of 3

The Random Forests method

The Random Forests model uses bootstrap aggregation (“bagging”) technique to repeatedly select a
random sample (with replacement) of the data and fit decision trees to these samples.
Mathematically, Random Forests can be expressed as:
𝐵
1
𝑓̂𝑎𝑣𝑒 (𝑥) = ∑ 𝑓̂ 𝑏 (𝑥) 𝐵 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑒𝑠
𝐵
𝑏=1

In addition to bagging, Random Forests randomly selects a subset p of predictors at each split. In
most of the cases, the model will randomly select ⌊√𝑚⌋ of predictors at each split by default. This is
the most important difference between Random Forests and bagging.

The randomness of predictor subset selection further reduces bias and variance by reducing
correlation between bagged trees. Indeed, a predictor with a high variable importance will not be
over- weighted because at each time a split, the algorithm doesn’t consider the whole predictors.

Random Forests have another advantage: no overfiting when increasing the number of trees in
forest. Indeed, this is true under the strong law of large numbers for i.i.d bagged trees.

Similar to bagging, the out-of-bag (“OOB”) error provides an efficient alternative to cross validation.
On average, each bagged trees make use of around 2/3 of the whole sample, leaves out 1/3 of the
samples not used to fit a given bagged tree. Hence for a Random Forests of B bagged trees, it will
usually yield B/3 predictions for each observation and this can be used as a validation score.

Fine-tuning of parameters

For our model, we can further fine tune the predication capability by fine tuning two parameters: 1)
maximum number of predictors, p at each split and 2) The maximum depth of every tree in the
forest. We will use OOB classification error as the performance measure.

Figure 1: maximum number of predictors at each split:

Figure 2: maximum depth of every tree in the forest:


Both parameters were tested using a Random Forests with 1,000 bagged trees. It appears that the
model reaches a minimum in terms of OOB error when maximum number of predictors at each split
equals to 9. Moreover, the OOB error stabilises when the maximum depth of trees reaches 18.

Fitting the model

We take the fine tuned model, fit to the training data set and compute the confusion matrix/ROC
curve:
Using Gini index, we can also rank the variable importance:

Random Forest model has reached a prediction accuracy of almost 83%, with area under the curve
0.90. In addition, Random Forests allocated fewer predictions to false positive and false negative
comparing to ridge/lasso logistic regression and neural network. Therefore it is a better model not
just in terms of accuracy, but also in terms of minimising Type I and Type II error.

Interestingly, we see that being likeable and attractive, having a good sense of humour and common
interests are the key factors that influence a partner’s decision. It is also worth highlighting that
although race does play a role in partner selection, it does not trump over these personal qualities.

You might also like