Professional Documents
Culture Documents
2 Data 1
3 Method 2
3.1 OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.4 Growing a Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.5 Bootstrap Aggregation (Bagging) . . . . . . . . . . . . . . . . . . . . . . . . 5
3.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6.1 Effect of Tree depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6.2 Effect of Forest size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Results 7
4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Variable evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Model evalutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Conclusions 11
6 Appendix I: Tables 13
2 Data
Data is 75 558 milking events spread over 9 months, starting jan 1 2015 and ending oct 1
2015. For each event several variables including but not limited to start time, animal id
number, duration of milking event, amount of milk and milk flow were recorded. Not all
of these variables are relevant for for the particular application of this thesis but they can
be interesting for other applications. Errors have also been recorded for each event. Events
where an error relating to milk yield have occurred is treated as missing value data and is
excluded from analysis. Data has been split in training data and testing data, the first 7
months are used as training data and the last 2 as testing data. Data is provided by the
Swedish University of Agricultural Sciences (Sveriges Lantbruksuniversitet).
1
3 Method
Lilja and Keteris Eckerstedt 2016, p. 10 describe the variables and the most common used
model for lactation curves as follows:
1
yij = α1 (Datei ) + α2 (M ilkingi ) + β1 ∗ DIMi + β2 ∗ DIMi2 + β3 ∗ DIMi3 + β4 ∗ +
DIMi
+ β5 (M ilkingi ) ∗ DIMi + β6 (M ilkingi ) ∗ DIMi2 + β7 (M ilkingi ) ∗ DIMi2 +
1
+ β9 (M ilkingi ) ∗ + α(Cowi ) + i (3.1.1)
DIMi
”The variable days in milk (DIM) is defined as the number of days since latest calving. DIM
is not included in the original data but is calculated as the difference in days between most
recent calving date and the date for the milking event. DIM follows a characteristic curve
depending on time since calving, which is why different functions of this variable also are
included in the model. Milking is a fixed effect which allows for different levels for milk yield
as well as allowing for different slopes depending on time of day when the milking procedure
occurred. The variable date is included to control for day-specific occurrences. The effect of
the specific cow is modelled as a random effect.”
A decision tree is a series of nodes, each asking a yes or no question about one of the features
in the dataset. If the answer is yes then one branch is followed, if the answer is no then the
other branch is followed. This is repeated until the end point is reached. The end point is
called a leaf. An example of this is found in figure 1. Here a company wants to know if
2
Figure 1: Example of a decision tree. Is a customer likely to buy a computer from a company
or not?
a person in their store is likely to buy a computer. The first variable the tree looks at is
age. If the person is between 18 and 25 then it is likely that they will buy the computer. If
the person is not between these ages we keep asking questions until a leaf node is reached.
Decision trees are a very good off-the-shelf procedure for data mining. With relatively fast
computation times and possibilities for interpretation (as long as the trees are small). Trees
also incorporate both numerical and categorical predictor variables or features seamlessly
and handle missing values very well. Solutions are also invariant under strictly monotone
(increasing or decreasing) transformations which is a very useful property for estimators
of any kind. Many other machine learning methods such as nearest-neighbor-methods or
Support Vector Machines do not have this invariance property. Linear regression is the most
common method of approximating relationships in data and is only invariant under linear
transformations. Feature selection is also part of the procedure to build a tree which makes
them highly insensitive to irrelevant features/variables.(Hastie, Tibshirani, and Friedman
2009, p.352)
All of these properties together is what makes decision trees such a great learner. There is one
downside to decision trees, that is that they are individually not very accurate, they have high
variance. This can be remedied by both boosting as well as bootstrapping. Bootstrapping
will be explained in more detail later in chapter 3.5.
The next step once we understand the basics of a decision tree is to move on to regression
trees. Regression trees are used to model a relationship between input variables and a
numerical outcome variable or in other words to create a regression function. Once a leaf is
reached a model is fitted and once all of the leafs are computed then the tree is complete
and is then used to predict values on the outcome variable of new observations. There is
an option of what model to fit in each leaf, one could imagine either a linear or quadratic
model. The most common way however is to calculate ck from equation 3.4.1 which is to
take the average of all observations in both of the new regions after a splits. This most
common method is visualized in figure 2: In the left figure a simple tree (commonly known
3
Figure 2: Example of a regression tree with depth 1 (left) and depth 2 (right)
as stump) is fitted to some training data. We see that the split is made somewhere between
0.8 and 1.0. A constant is fitted in each region and the combination of the left and right
regions are now our estimate for the regression function. The interpretation of this is that
if the value on the x-axis of a new observation is to the left of the split then the predicted
value on the y-axis would be somewhere around −45. The second image shows a tree where
another split has been made, the first split is the same point and then each of those regions
are in turn split into two regions. Now our regression model is four horizontal lines. This
tree to the right looks like a much better estimation of what data looks like.
A regression tree from data with p inputs and a response, for each of N observations is grown
in the following manner:
The goal is to split data into K regions and in each region fit a constant to the observations
in that region. The fitted function would then be:
K
X
f (x) = ck I(x ∈ Rk ) (3.4.1)
k=1
where I(x) is the Identity function which is equal to 1 when x is within the region Rk and
zero
P otherwise,2 R1 , R2 . . . , Rk are the regions and ck is the minimization of the sum of squares
(yi − f (xi )) in region Rk . This means that the estimator for this region would be
4
which is the mean of y within the region RK .
To find the best subsets R1 , R2 . . . , Rk all at once is very computationally intensive, border-
line infeasible. The best we can do is adopt a greedy method to partition data. This is done
by splitting one node at a time and grow the tree that way. Starting with a splitting variable
j and split point s, we define half-planes
and then seek the splitting variable j and split point s that solves
X X
min min (yi − c1 )2 + min (yi − c2 )2 . (3.4.4)
j,s c1 c2
x1 ∈R1 (j,s) x1 ∈R2 (j,s)
cˆ1 = average(yi |xi ∈ R1 (j, s)) and cˆ2 = average(yi |xi ∈ R2 (j, s)). (3.4.5)
In other words we are doing the split which minimizes the sum of the squared errors for a
constant fit in each region. This is computed for every variable and every point xi until the
best pair (j, s) is found. Each split is relatively fast to compute.
This process is then performed iteratively for each node until the entire tree is built. The
process continues working until it reaches some stopping criteria. The default value for the
commonly used R package randomForest(Cutler and Wiener 2015) is to keep going until the
smallest new region after a split is no smaller than 5 observations. There are other methods
for stopping but they will not be discussed here.
Before introducing the concept of bagging we first have to learn about bootstrapping. If
we start out with a training set Z = (z1 , z2 , . . . , zN ) where zi = (xi , yi ) and fit a model to
this set. We now want to find the accuracy of the model fit. Bootstrapping is a method
to achieve this when there is no out-of-sample data to test on. The bootstrapping process
is to randomly sample observations with replacement from the training set Z and then
refit the proposed model to each of these samples to create a confidence interval for the
model using the bootstrap sampling distribution. The most common way is to sample as
many observations as one had in the original dataset. It is also possible to sample fewer
(undersampling) and more (oversampling). Once we have our bootstrap samples we want
to reduce the variance of our estimator as much as possible. The average of B independent
identically distributed random variables, each with variance σ 2 , has variance
1 2
σ . (3.5.1)
B
If the variables are identically distributed, but not necessarily independent then the variance
is
5
1−ρ 2
ρσ 2 + σ . (3.5.2)
B
This expression simplifies to
ρσ 2 (3.5.3)
when B grows large. The variance will in other words not approach zero asymptotically with
larger sample size as long as the bootstrap samples are correlated. A rectification to this
problem follows in the next chapter where we discuss random forests, a cousin of bootstrap
samples where the correlation between the samples is minimized using randomness.
The idea behind random forest regression is to make many uncorrelated regression trees and
average them. First off, the random forest algorithm(Hastie, Tibshirani, and Friedman 2009,
p. 588) is show in algorithm 1:
fˆrf
B 1
PB
= B b=1 Tb (x)
This algorithm tries to minimize correlation by not only sampling observations (Bootstrap
sampling) but also sample which variables should be included for consideration at each split
of a node, known as feature sampling. The default value for random forest regression is to
choose m = bp/3c. Using a smaller value for m will reduce the correlation between trees
and thus by equation 3.5.2 reducing the variance of the average (Hastie, Tibshirani, and
Friedman 2009, p. 589).
6
3.6.1 Effect of Tree depth
Tree depth is the number of levels or splits before a leaf is reached. As the depth of the
tree increases, the function described in equation 3.4.1 will become more and more smooth
since each of the k regions will be thinner and thinner. This leads to a closer fit to the
data for each tree. Too shallow trees will have problems with underfitting, the tree will not
be able to capture the variation in data if only a few points are fitted. Too deep trees will
have the opposite problem, called overfitting. An overfitted tree will not perform as well
on the test data as it did on the training data. The default value in the randomForest
package is to allow splits where the resulting nodes i greater than 5 as well as no limit on
maximum number of nodes. Both of these parameters can be tuned for higher accuracy or
lower computation time.
Increasing the forest size is to average more and more trees together for a lower variance
estimator. Since adding more trees will not affect bias but reduce variance one can not
average too many trees. More trees do however increase computation time and the added
benefit of calculating a larger number of trees diminishes with forest size. It is useful to look
at the OOB error estimate and to not add additional trees once the error stabilizes.
4 Results
The average milk yield per day is plotted in figure 3 (a)-(d), the lactations curves for each
of the 4 teats of cows looks fairly similar, the rear teats seem to release more milk than the
front ones but this is the only obvious difference. The lactations curves are very steep in the
beginning, which is reasonable to be able to provide the calf with milk. The curve continues
to climb until around the 50 day mark where it then slowly declines. After around 400 days
all of the graphs (a)-(d) show some erratic behaviors, this can be explained when looking at
graph (e). There are not many milking events past this point which would make the average
less stable than before, causing the sudden jumps in the curve.
In figure 4 we see graphs visualizing variable importance for the random forest regression
model. The graphs showing %incMSE is the percent increased Mean Squared Error when
the variable in question is not included when constructing a forest. The error is calculated as
the Out-of-bag error for the ensemble of trees where the variable is not included. Increased
Node Purity is a measure of how much worse a split would be if the variable in question is
7
(a) (b)
(c) (d)
(e)
excluded. Node purity for regression is measured in Residual Sum of Squares (RSS). These
measures are similar but generally %incMSE is the preferred measure to use since it is easier
to interpret. Increased node purity is what the method uses for splits.
The most important variables across all teats is Milking (if the cow was milked in the morning
or afternoon), Date (date of the milking event) and Cow (Animal Number). Table 1 displays
the values for each of these variables. The table for Increased Node purity can be found in
Appendix I.
Figure 5 displays the MSE for the forest with increasing number of trees. The graphs both
display that the error seem to reach a stable level asymptotically and that somewhere around
100-200 trees might be enough of a forest size to still achieve maximum accuracy while not
spending unnecessary computational power in calculating more trees. If the graph has not
yet leveled out when reaching the far right of the x-axis then it is advisable to increase the
number of trees when calculating the forest.
8
(a) a (b) b
(c) c (d) d
Figure 4: Percent decrease in MSE and Node purity when a variable in included for consid-
eration in a tree.
As we can see in both table 2 and table 3 the random forest method is outperforming the
standard OLS method by on average 26% decrease MSE and 14% RMSE. Both of these
values seem to be a quite substantial decrease in model error. The values for MSE and
RMSE for OLS model is performed by Lilja and Keteris Eckerstedt 2016. Further worth
noting is that the OLS model was fitted on months 4-9 and tested on months 10-11. The
random forest regression model was trained on months 1-7 and tested on months 8-9. The
data for months 10-11 is unfortunately lost at the time of writing this thesis and a more
direct comparison can therefor not be made.
9
Table 1: Percent decreased Mean Squared Error when a variable is included in analysis for
each variable and each teat.
(a) a (b) b
(c) c (d) d
LF RF LR RR Average
randomForest 1.192044 1.104153 1.803447 1.648640 1.437071
OLS 1.550466 1.517615 2.336136 2.424136 1.957088
Absolute Difference -0.358422 -0.413462 -0.532689 -0.775496 -0.520017
Percent Decrease 23.12% 27.24% 22.80% 31.99% 26.29%
LF RF LR RR Average
randomForest 1.091808 1.050787 1.342925 1.283994 1.192379
OLS 1.245177 1.231915 1.527966 1.555905 1.390241
Absolute Difference -0.153369 -0.181128 -0.185041 -0.271911 -0.197862
Percent Decrease 12.32% 14.70% 12.11% 17.48% 14.15%
10
5 Conclusions
Random forest seem to outperform OLS when it comes to predict lactations curves. The
average decrease in MSE when using random forests over OLS is 26.3%. The data used to
estimate the OLS and random forest models is not exactly the same data but a large portion
of the data does overlap.
The most important variables for determining milk yield is whether the cow was milked dur-
ing the morning or afternoon, the date of the milking(likely explained by seasonal variation
or other day-to-day variations) as well as which cow was being milked. Removing either of
these from analysis would increase MSE by over 200%.
Default values were used when performing random forest analysis, in future work a good
starting point for tuning parameters would be to use B = 100 to B = 200. Starting with the
smaller value in interest of computational time and then increase the number of estimated
trees if the OOB error does not seem stable. More variables could be included for testing,
since random forest has power variable selection feature built in, testing seemingly random
variables should not have a noticeable effect on prediction performance.
For a more direct comparison to Lilja and Keteris Eckerstedt 2016 the OLS regression could
be rerun on the same dataset as the random forest was run to gain more comparable results.
Finally, Bayesian methods could be used for tuning the parameters depth, number of trees
and number of features for random forests to make the process more automated.
11
References
Cutler, Fortran original by Leo Breiman and Adele and R. port by Andy Liaw and Matthew
Wiener (2015). randomForest: Breiman and Cutler’s Random Forests for Classification
and Regression. url: https://cran.r-project.org/web/packages/randomForest/
index.html.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The Elements of Statistical
Learning. Springer Series in Statistics. DOI: 10.1007/978-0-387-84858-7. New York, NY:
Springer New York. isbn: 978-0-387-84857-0 978-0-387-84858-7. url: http : / / link .
springer.com/10.1007/978-0-387-84858-7.
Lilja, Mathias and Ilse Keteris Eckerstedt (2016). The influence of teat wash failure on milk
yield in dairy cows. eng. url: http://www.diva-portal.org/smash/record.jsf?pid=
diva2:901092.
12
6 Appendix I: Tables
library("plyr")
set.seed(1111)
13
# Fixing problem with several row for the same
# milking event depending on the number of calves
# for the specific cow
merged.data <- subset(merged.data, Datum > Calving.Date)
# Functiion for keeping the maxiumum value
14
colnames(dok05) = names
names = colnames(dok04)
colnames(dok06) = names
names = colnames(dok04)
colnames(dok07) = names
names = colnames(dok04)
colnames(dok08) = names
names = colnames(dok04)
colnames(dok09) = names
# Combing the different months into one
detach(package:plyr)
library(dplyr)
15
test <- rbind(dok08, dok09)
# Excluding all observations where something
# (except for teatwash) went bad
test <- test[is.na(test$Avspark), ]
test <- test[is.na(test$Ofullstndig), ]
test <- test[is.na(test$Spenar.hittas.ej), ]
# Several errors for the same milking event are
# possible
test <- test %>% arrange(Djurnr, Datum, MilkingAm)
i <- 2
41
while (i <= dim(test)[1]) {
if (test$Djurnr[i] == test$Djurnr[i - 1] & test$Datum[i] ==
test$Datum[i - 1] & test$MilkingAm[i] == test$MilkingAm[i -
1]) {
test$dummyVF[i - 1] <- max(test$dummyVF[i],
test$dummyVF[i - 1])
test$dummyHF[i - 1] <- max(test$dummyHF[i],
test$dummyHF[i - 1])
test$dummyVB[i - 1] <- max(test$dummyVB[i],
test$dummyVB[i - 1])
test$dummyHB[i - 1] <- max(test$dummyHB[i],
test$dummyHB[i - 1])
test <- test[-i, ]
} else {
i <- i + 1
}
}
test$Djurnr <- as.numeric(test$Djurnr)
############################## dataset test maipulation complete
# install.packages(’randomForest’)
library(randomForest)
library(gplots)
set.seed(1111)
16
# Run random forest regression on each of the
# teats.
forestVF <- randomForest(VF.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestHF <- randomForest(HF.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestVB <- randomForest(VB.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestHB <- randomForest(HB.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
# Number of trees
plot(forestVF)
plot(forestVB)
plot(forestHF)
plot(forestHB)
# MSE:
mean((predict(forestVF, test) - test$VF.4)^2)
mean((predict(forestVB, test) - test$VB.4)^2)
mean((predict(forestHF, test) - test$HF.4)^2)
mean((predict(forestHB, test) - test$HB.4)^2)
# RMSE:
sqrt(mean((predict(forestVF, test) - test$VF.4)^2))
sqrt(mean((predict(forestVB, test) - test$VB.4)^2))
sqrt(mean((predict(forestHF, test) - test$HF.4)^2))
sqrt(mean((predict(forestHB, test) - test$HB.4)^2))
# importance
importance(forestVF)
importance(forestHF)
importance(forestVB)
17
importance(forestHB)
# importance plots
varImpPlot(forestVF)
varImpPlot(forestHF)
varImpPlot(forestVB)
varImpPlot(forestHB)
# End of code
18