Data Science Methods in Finance
R Tutorial 5
November 29, 2024
Important Instructions
• The purpose of this tutorial is for you to practise some of the key concepts on tree-based
methods
• It should not be submitted, but we strongly encourage you to work through it before
the discussion on Friday.
1
Dataset and Setup
This assignment uses loan data from https://bondora.com/en/ and refers to the data file
“LoanData2021.RDS”. The ultimate goal of the assignment is to identify a good model for
predicting default. The dependent variable is whether you have defaulted or not. You can
construct this variable with “default= ifelse(is.na(DefaultDate) == F, 1, 0)" and “defaultFac-
tor = factor(default, levels= c("1", "0"))". Please use the following variables for the default
predictions:
1. the age of the borrower (Age)
2. the borrower’s total income (IncomeTotal)
3. the gender of the borrower (0 male; 1 female) (Gender)
4. the total liabilities of the borrower (LiabilitiesTotal)
5. the borrower’s number of existing liabilities (ExistingLiabilities)
6. interest rate of the loan (Interest)
7. principal that still needs to be paid by the borrower (PrincipalBalance)
8. current loan duration in months (LoanDuration)
9. education of the borrower (1 Primary education; 2 Basic education; 3 Vocational edu-
cation; 4 Secondary education; 5 Higher education) (Education)
10. whether it’s a new credit customer or not (NewCreditCustomer)
11. amount the borrower received (Amount)
12. value of previous loans (AmountOfPreviousLoansBeforeLoan)
13. income after monthly liabilities (FreeCash)
2
Question 1
1. Construct a table with descriptive statistics of the variables you use. Make sure to deal
with missing values (if there is).
2. Fit a logistic regression using only age as predictor and interpret the coefficient on age.
3. Split the dataset into a training sample and a test set. Use 70% for training and 30%
for testing. Fit a logistic regression on the training set and report its accuracy based on
predictions on the test set. Recall that the model will output a probability of default,
which you need to turn into a default prediction (1 or 0), which can then be compared
with the actual defaults.
4. Add the interest variable and compare the estimated accuracy of the model with interest
and age with that of the model with only age. Interpret the coefficient on the interest
variable.
5. Fit a (large) classification tree using the training data using all the features in the
dataset. Plot the tree and report the number of terminal nodes as well as the four most
relevant variables used for splitting the tree. Report the training and test error rate for
the tree. Interpret and compare to the errors from the logistic model.
6. Prune the tree and use 5-fold cross validation to determine the optimal pruning param-
eter. Use the error rate as the pruning criterion and compare to the error rates of the
tree from the previous question.
7. Use random forest to improve predictions. For this set the seed to 1 and fit 300 trees
√
using m = p (round to the nearest integer) where m is the number of features used in
each split and p is the total number of features. Provide a plot of the OOB estimate of
the test error against the number of trees grown. Report the error rate for the training
data, the OOB estimate of the test error rate as well as the error rate for the test data.
Interpret the outcomes and compare to the accuracy of the other methods.
8. Provide a plot of the importance of the variables in the random forest and interpret.
9. Tune the Random Forest to select the optimal hyperparameter m. For this, you can
either set up a grid of different m by yourself or use the Caret package. If you use the
Caret package, start by defining the trainControl object, define your validation method
and add “preProc = c("center", "scale")" to the train object. Compare the in-sample
3
and out-of-sample accuracy of the tuned random forest and compare with the other
methods.