You are on page 1of 16

Stats 101C Final Project

--Team YYY
Yiklun Kei
Yongkai Zhu
Qianhao Yu
Data Cleaning Procedure
1. Delete Emergency Dispatch Code since it has only one level, and also
remove incident.ID and row.ID when running models.
2. Separate Time variable by Hour, Min and Sec or
3. Divide Time variable into different intervals (We tried to divide it into either two
levels (day: (6 AM, 6 PM]; night: (6 PM, 6 AM]), or four levels (morning: [6
AM, 12 PM); afternoon: [12 PM, 6 PM); evening: [6 PM, 12 AM), late night:
[12AM, 6 AM)).
4. Divide Dispatch.Sequence into four intervals based on tree model
5. Combine levels in factor variables, especially the ones that have many levels
e.g. Unit.Type and Dispatch.Status (We combined all the entries that are less
than 10000 for Unit.Type (ended up with 7 levels) and Dispatch.Status (ended
up with 6 levels)).
6. Remove or replace NA with mean and median in training and testing data
Algorithms that we have tried
Random Forest (can handle variables with mixed types, and is robust to

Lasso (the model matrix command can accept factor input and can perform
variable selection)

Artificial Neural Network (can induce hypotheses that generalize better than
other algorithms, but will create complex model that is hard to explain and
take longer training time)
Random Forest
tree.lafire <- tree(elapsed_time ~.,lafire)
cv.X <- cv.tree(tree.lafire)
prune.tree <- prune.tree(tree.lafire,best=4)
new.predict <- predict(prune.tree,test)
tree.predict <- predict(tree.lafire,test)

The only significant variable is Dispatch.Sequence. The tree is generated by

dividing the variable into 4 branches. Within branch, the prediction is identical.

Best MSE Using Random Forest: 1446915.91501

The variables used are:
Lasso Regression Dispatch.Sequence (cut into
different levels), Unit.Type,
PPE.Levels, Dispatch.Status
(with some levels combined),
x=model.matrix(elapsed_time~.,data=lafdtrain1[-c(1,2,5:8,10,12,15,16,17,18,21:36)])[,-1] hour, minute, and


cv.out=cv.glmnet(x,y,alpha=1) Best MSE Using Lasso:





Artificial Nerual Network
lafire[,c(2,3,8,9,10)] <- scale(lafire[,c(2,3,8,9,10)])
new1 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence < 8.38764),]
new2 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence >= 8.38764),]
new3 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence < 26.9571),]
new4 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence >= 26.9571),]

temp_model1 <- nnet(data = new1,,

size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model2 <- nnet(data = new2,,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model3 <- nnet(data = new3,,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model4 <- nnet(data = new4,,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)

Best MSE Using Artificial Neural Network:

(Regarding our best model in terms of smallest MSE)

Predictors Used: year (numeric), Dispatch.Sequence (numeric), Unit.Type (with all

the 41 levels), Dispatch.Status (with all the 12 levels), PPE.Level (with 2 levels).

Dealing With Missing Values: Removed all of them using na.omit in training data,
and replaced with mean in testing data.

Parameter Tuning: Used CV; started with the following range for each of the

max_depth 5~10 eta 0~0.3 gamma 0~0.2 subsample 0.5~0.8

colsample_bytree 0.6~0.9
XGBoost--Parameters Introduction
max_depth stands for the maximum depth of a tree; increasing this value will make
the model more complex; default=6, range=[0,infinity)
Eta stands for the step size shrinkage used in update to prevent overfitting.
Default=0.3, range=[0,1].
Gamma stands for minimum loss reduction required to make a further partition on a
leaf node of the tree. The larger, the more conservative the algorithm will be.
Default=0, range=[0,infinity)
Subsample stands for the subsample ratio of the training instance. Setting it to 0.5
means that XGBoost randomly collected half of the data instances to grow trees and
this will prevent overfitting.
Default=1, range=(0,1].
Colsample_bytree stands for the subsample ratio of columns for each split, in each
Default=1, range=(0,1].
XGBoost (Continued; Code for Parameter Tuning)
modelmtx1=sparse.model.matrix(elapsed_time~.- cv.nround = 100
cv.nfold = 5
_time) mdcv <-, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,verbose = TRUE)
for (iter in 1:5) {
min_rmse = min(mdcv$evaluation_log[,"test_rmse_mean"])
param <- list(objective = "reg:linear",
min_rmse_index =
eval_metric = "rmse", which.min(as.matrix(mdcv$evaluation_log[,"test_rmse_mea
max_depth = sample(5:10, 1),
if (min_rmse < best_min_rmse) {
eta = runif(1, 0, .3), best_min_rmse = min_rmse
best_min_rmse_index = min_rmse_index
gamma = runif(1, 0.0, 0.2), best_param = param
subsample = runif(1,0.5,0.8), }

colsample_bytree = runif(1,0.6,0.9) nround = best_min_rmse_index

XGBoost (Continued; Code for Running It and Doing
Predictions) max_depth=9, eta=0.2,
gamma=0.12, subsample=0.64,
Found Using CV

Training <-xgb.train(params = best_param, data = train, nrounds=35,watchlist =

list(train = train),verbose = TRUE,print_every_n = 1,nthread = 6)
Columns for year, Dispatch.Sequence, Unit.Type,
finaltest=lafdtest[,c(3,6,7,8,9)] Dispatch.Status, PPE.Level



Best MSE Using XGBoost:1394460.86615
XGBoost Importance Plot For This Case
Conclusion--Regarding the Variables
Year: Factor Numerical Not included in the model

Not included in the model

Time: Separated Not Separated
Not Divided
Dispatch Sequence: Divided Not included in the model
Not Combined

Unit Type & Dispatch Status: Levels Combined

Omittedin the model
Not included
Replaced With Mean (=2)

NAs in Training Data : Replaced

Conclusion--Regarding the Algorithms
Algorithms: Random Forest


Artificial Neural Network

Final Conclusion
In this case, XGboost yields the minimal MSE among all the algorithms we
have tried, which proves that it is indeed the go-to algorithm for Kaggle
competitive data science platform (what we found on one of the XGBoost
introduction webpages).

Whats more, it turns out that the data cleaning procedures we have tried did
not necessarily improve on our results, because our best MSE is produced by
completely omitting the NAs in the training dataset and using the variables as
they were in the original dataset. For example, we tried applying XGBoost to
the new data with some levels of Unit.Type and Dispatch.Status combined,
but the results were not as good.
Possible Further Improvements
We should spend more time learning about and trying the
XGBoost algorithm.

Also, we should perhaps try more methods for data

cleaning; we might hopefully find a method that
outcompetes what we have been trying for cleaning the

Thank You All For Watching

Wish You Best of Luck in All of Your
Future Endeavors