You are on page 1of 16

Stats 101C Final Project

--Team YYY
Yiklun Kei
(allen29@g.ucla.edu)
Yongkai Zhu
(zlarry7497.hfbzmssm@gmail.com)
Qianhao Yu
(luckyhowellyu@gmail.com)
Data Cleaning Procedure
1. Delete Emergency Dispatch Code since it has only one level, and also
remove incident.ID and row.ID when running models.
2. Separate Time variable by Hour, Min and Sec or
3. Divide Time variable into different intervals (We tried to divide it into either two
levels (day: (6 AM, 6 PM]; night: (6 PM, 6 AM]), or four levels (morning: [6
AM, 12 PM); afternoon: [12 PM, 6 PM); evening: [6 PM, 12 AM), late night:
[12AM, 6 AM)).
4. Divide Dispatch.Sequence into four intervals based on tree model
5. Combine levels in factor variables, especially the ones that have many levels
e.g. Unit.Type and Dispatch.Status (We combined all the entries that are less
than 10000 for Unit.Type (ended up with 7 levels) and Dispatch.Status (ended
up with 6 levels)).
6. Remove or replace NA with mean and median in training and testing data
Algorithms that we have tried
Random Forest (can handle variables with mixed types, and is robust to
outliers)

Lasso (the model matrix command can accept factor input and can perform
variable selection)

Artificial Neural Network (can induce hypotheses that generalize better than
other algorithms, but will create complex model that is hard to explain and
take longer training time)
Random Forest
tree.lafire <- tree(elapsed_time ~.,lafire)
cv.X <- cv.tree(tree.lafire)
prune.tree <- prune.tree(tree.lafire,best=4)
new.predict <- predict(prune.tree,test)
tree.predict <- predict(tree.lafire,test)

The only significant variable is Dispatch.Sequence. The tree is generated by


dividing the variable into 4 branches. Within branch, the prediction is identical.

Best MSE Using Random Forest: 1446915.91501


The variables used are:
year, First.in.District,
Lasso Regression Dispatch.Sequence (cut into
different levels), Unit.Type,
lafdtrain1=lafdtrain[complete.cases(lafdtrain),]
PPE.Levels, Dispatch.Status
(with some levels combined),
x=model.matrix(elapsed_time~.,data=lafdtrain1[-c(1,2,5:8,10,12,15,16,17,18,21:36)])[,-1] hour, minute, and
elapsed_time
y=lafdtrain1$elapsed_time

lasso.mod=glmnet(x,y,alpha=1,lambda=grid)

cv.out=cv.glmnet(x,y,alpha=1) Best MSE Using Lasso:


1476606.03132.
bestlam=cv.out$lambda.min

lafdtest$elapsed_time=NULL

newx=model.matrix(~.,lafdtest[-c(1,2,5,6,7,8,10,13,16:22)])[,-1]

lasso.pred=predict(lasso.mod,s=bestlam,newx=newx,type="response")

lassoprediction=data.frame(lafdtest$row.id,lasso.pred)
Artificial Nerual Network
lafire[,c(2,3,8,9,10)] <- scale(lafire[,c(2,3,8,9,10)])
new1 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence < 8.38764),]
new2 <- lafire[which(lafire$Dispatch.Sequence < 16.346 & lafire$Dispatch.Sequence >= 8.38764),]
new3 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence < 26.9571),]
new4 <- lafire[which(lafire$Dispatch.Sequence >= 16.346 & lafire$Dispatch.Sequence >= 26.9571),]

temp_model1 <- nnet(data = new1, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,


size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model2 <- nnet(data = new2, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model3 <- nnet(data = new3, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)
temp_model4 <- nnet(data = new4, elapsed_time~First.in.District+Dispatch.Sequence+PPE.Level,
size = 10, linout = T, skip =T, maxit = 10000, decay = 0.001)

Best MSE Using Artificial Neural Network:


1474897.26758
XGboost
(Regarding our best model in terms of smallest MSE)

Predictors Used: year (numeric), Dispatch.Sequence (numeric), Unit.Type (with all


the 41 levels), Dispatch.Status (with all the 12 levels), PPE.Level (with 2 levels).

Dealing With Missing Values: Removed all of them using na.omit in training data,
and replaced with mean in testing data.

Parameter Tuning: Used CV; started with the following range for each of the
parameters:

max_depth 5~10 eta 0~0.3 gamma 0~0.2 subsample 0.5~0.8


colsample_bytree 0.6~0.9
XGBoost--Parameters Introduction
max_depth stands for the maximum depth of a tree; increasing this value will make
the model more complex; default=6, range=[0,infinity)
Eta stands for the step size shrinkage used in update to prevent overfitting.
Default=0.3, range=[0,1].
Gamma stands for minimum loss reduction required to make a further partition on a
leaf node of the tree. The larger, the more conservative the algorithm will be.
Default=0, range=[0,infinity)
Subsample stands for the subsample ratio of the training instance. Setting it to 0.5
means that XGBoost randomly collected half of the data instances to grow trees and
this will prevent overfitting.
Default=1, range=(0,1].
Colsample_bytree stands for the subsample ratio of columns for each split, in each
level.
Default=1, range=(0,1].
XGBoost (Continued; Code for Parameter Tuning)
modelmtx1=sparse.model.matrix(elapsed_time~.- cv.nround = 100
1,data=traindata)
cv.nfold = 5
train=xgb.DMatrix(data=modelmtx1,label=traindata$elapsed
_time) mdcv <- xgb.cv(data=train, params = param, nthread=6,
nfold=cv.nfold, nrounds=cv.nround,verbose = TRUE)
for (iter in 1:5) {
min_rmse = min(mdcv$evaluation_log[,"test_rmse_mean"])
param <- list(objective = "reg:linear",
min_rmse_index =
eval_metric = "rmse", which.min(as.matrix(mdcv$evaluation_log[,"test_rmse_mea
n"]))
max_depth = sample(5:10, 1),
if (min_rmse < best_min_rmse) {
eta = runif(1, 0, .3), best_min_rmse = min_rmse
best_min_rmse_index = min_rmse_index
gamma = runif(1, 0.0, 0.2), best_param = param
}
subsample = runif(1,0.5,0.8), }

colsample_bytree = runif(1,0.6,0.9) nround = best_min_rmse_index


)
XGBoost (Continued; Code for Running It and Doing
Predictions) max_depth=9, eta=0.2,
gamma=0.12, subsample=0.64,
Found Using CV
colsample_by_tree=0.62

Training <-xgb.train(params = best_param, data = train, nrounds=35,watchlist =


list(train = train),verbose = TRUE,print_every_n = 1,nthread = 6)
Columns for year, Dispatch.Sequence, Unit.Type,
finaltest=lafdtest[,c(3,6,7,8,9)] Dispatch.Status, PPE.Level

sparsemtxtest=sparse.model.matrix(~.-1,data=finaltest)

testdata=xgb.DMatrix(data=sparsemtxtest)

prediction=predict(Training,testdata)
Best MSE Using XGBoost:1394460.86615
XGBoost Importance Plot For This Case
Conclusion--Regarding the Variables
Year: Factor Numerical Not included in the model

Not included in the model


Time: Separated Not Separated
Not Divided
Dispatch Sequence: Divided Not included in the model
Not Combined

Unit Type & Dispatch Status: Levels Combined


Omittedin the model
Not included
Replaced With Mean (=2)

NAs in Training Data : Replaced


Conclusion--Regarding the Algorithms
Algorithms: Random Forest

Lasso

Artificial Neural Network

XGBoost
Final Conclusion
In this case, XGboost yields the minimal MSE among all the algorithms we
have tried, which proves that it is indeed the go-to algorithm for Kaggle
competitive data science platform (what we found on one of the XGBoost
introduction webpages).

Whats more, it turns out that the data cleaning procedures we have tried did
not necessarily improve on our results, because our best MSE is produced by
completely omitting the NAs in the training dataset and using the variables as
they were in the original dataset. For example, we tried applying XGBoost to
the new data with some levels of Unit.Type and Dispatch.Status combined,
but the results were not as good.
Possible Further Improvements
We should spend more time learning about and trying the
XGBoost algorithm.

Also, we should perhaps try more methods for data


cleaning; we might hopefully find a method that
outcompetes what we have been trying for cleaning the
data.
Q&A

Thank You All For Watching


&
Wish You Best of Luck in All of Your
Future Endeavors