You are on page 1of 21

Can methods of machine learning be used to better

predict lactation curves for bovines?


Andreas Östling

Supervisor: Måns Thulin


Department of Statistics
Uppsala University
Abstract
A random forest is compared to an OLS model for predicting lactation curves for cows.
Both of the methods have been estimated and tested using data from the period 2015-01 to
2015-09. Random forests outperform OLS in testing for modeling lactation curves with a
decrease in MSE by approximately 26%. Data is provided by Sveriges Lantrbruksuniversitet
and includes 75 558 milking events from 320 cows. The date of the milking, the time of
day when the milking occurred as well as which cow was milked were found to be important
variables for accurate predictions.
Contents
1 Introduction 1

2 Data 1

3 Method 2
3.1 OLS Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.4 Growing a Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.5 Bootstrap Aggregation (Bagging) . . . . . . . . . . . . . . . . . . . . . . . . 5
3.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6.1 Effect of Tree depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6.2 Effect of Forest size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Results 7
4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Variable evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Model evalutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Conclusions 11

6 Appendix I: Tables 13

7 Appendix II: R Code 13


1 Introduction
Agriculture has always had a close relationship with statistical analysis, whether it is to
produce maximum amount of wheat on a field or finding correct treatments for sick animals.
This thesis aims to better understand and predict lactation curves for cows. How much
milk a cow produces depend on many factors such as feed, time of year and how long since
the cow had a calf. With both profit and animal health in mind it is useful to be able to
predict these curves to maximize utility, even out production and provide better care for the
animals.
Previous methods have used mixed linear models to estimate the relationships between milk
production and available variables. The aim of this thesis is to answer the question: Can
methods of machine learning be used to better predict lactations curves for bovines?
The field of machine learning is vast and random forests were chosen as the tool to answer
this question. There are many advantages of random forests, the most prominent ones is
that it is easy to implement and it is also much easier to understand what the method does
in each step compared to other ”black-box”-methods such as neural networks.

2 Data
Data is 75 558 milking events spread over 9 months, starting jan 1 2015 and ending oct 1
2015. For each event several variables including but not limited to start time, animal id
number, duration of milking event, amount of milk and milk flow were recorded. Not all
of these variables are relevant for for the particular application of this thesis but they can
be interesting for other applications. Errors have also been recorded for each event. Events
where an error relating to milk yield have occurred is treated as missing value data and is
excluded from analysis. Data has been split in training data and testing data, the first 7
months are used as training data and the last 2 as testing data. Data is provided by the
Swedish University of Agricultural Sciences (Sveriges Lantbruksuniversitet).

1
3 Method

3.1 OLS Model

Lilja and Keteris Eckerstedt 2016, p. 10 describe the variables and the most common used
model for lactation curves as follows:
1
yij = α1 (Datei ) + α2 (M ilkingi ) + β1 ∗ DIMi + β2 ∗ DIMi2 + β3 ∗ DIMi3 + β4 ∗ +
DIMi
+ β5 (M ilkingi ) ∗ DIMi + β6 (M ilkingi ) ∗ DIMi2 + β7 (M ilkingi ) ∗ DIMi2 +
1
+ β9 (M ilkingi ) ∗ + α(Cowi ) + i (3.1.1)
DIMi

Yij : Observed milk yield (kg)


Cowi : Identification number for the specific cow
Datei : Date of milking
DIMi : Days in milk (Number of days since calving)
M ilkingi : Classification of milking according to time of day (AM or PM)
i : Residual (kg)
i = Observed cow
 
left front teat
right front teat 
j=  left rear teat 

right rear teat

”The variable days in milk (DIM) is defined as the number of days since latest calving. DIM
is not included in the original data but is calculated as the difference in days between most
recent calving date and the date for the milking event. DIM follows a characteristic curve
depending on time since calving, which is why different functions of this variable also are
included in the model. Milking is a fixed effect which allows for different levels for milk yield
as well as allowing for different slopes depending on time of day when the milking procedure
occurred. The variable date is included to control for day-specific occurrences. The effect of
the specific cow is modelled as a random effect.”

3.2 Decision Tree

A decision tree is a series of nodes, each asking a yes or no question about one of the features
in the dataset. If the answer is yes then one branch is followed, if the answer is no then the
other branch is followed. This is repeated until the end point is reached. The end point is
called a leaf. An example of this is found in figure 1. Here a company wants to know if

2
Figure 1: Example of a decision tree. Is a customer likely to buy a computer from a company
or not?

a person in their store is likely to buy a computer. The first variable the tree looks at is
age. If the person is between 18 and 25 then it is likely that they will buy the computer. If
the person is not between these ages we keep asking questions until a leaf node is reached.
Decision trees are a very good off-the-shelf procedure for data mining. With relatively fast
computation times and possibilities for interpretation (as long as the trees are small). Trees
also incorporate both numerical and categorical predictor variables or features seamlessly
and handle missing values very well. Solutions are also invariant under strictly monotone
(increasing or decreasing) transformations which is a very useful property for estimators
of any kind. Many other machine learning methods such as nearest-neighbor-methods or
Support Vector Machines do not have this invariance property. Linear regression is the most
common method of approximating relationships in data and is only invariant under linear
transformations. Feature selection is also part of the procedure to build a tree which makes
them highly insensitive to irrelevant features/variables.(Hastie, Tibshirani, and Friedman
2009, p.352)
All of these properties together is what makes decision trees such a great learner. There is one
downside to decision trees, that is that they are individually not very accurate, they have high
variance. This can be remedied by both boosting as well as bootstrapping. Bootstrapping
will be explained in more detail later in chapter 3.5.

3.3 Regression Tree

The next step once we understand the basics of a decision tree is to move on to regression
trees. Regression trees are used to model a relationship between input variables and a
numerical outcome variable or in other words to create a regression function. Once a leaf is
reached a model is fitted and once all of the leafs are computed then the tree is complete
and is then used to predict values on the outcome variable of new observations. There is
an option of what model to fit in each leaf, one could imagine either a linear or quadratic
model. The most common way however is to calculate ck from equation 3.4.1 which is to
take the average of all observations in both of the new regions after a splits. This most
common method is visualized in figure 2: In the left figure a simple tree (commonly known

3
Figure 2: Example of a regression tree with depth 1 (left) and depth 2 (right)

as stump) is fitted to some training data. We see that the split is made somewhere between
0.8 and 1.0. A constant is fitted in each region and the combination of the left and right
regions are now our estimate for the regression function. The interpretation of this is that
if the value on the x-axis of a new observation is to the left of the split then the predicted
value on the y-axis would be somewhere around −45. The second image shows a tree where
another split has been made, the first split is the same point and then each of those regions
are in turn split into two regions. Now our regression model is four horizontal lines. This
tree to the right looks like a much better estimation of what data looks like.

3.4 Growing a Regression Tree

The following description is based on chapter 9.2.9 in Elements of Statistical Learning


(Hastie, Tibshirani, and Friedman 2009)

A regression tree from data with p inputs and a response, for each of N observations is grown
in the following manner:

(xi , yi ), i = 1, 2, . . . , N and xi = (xi1 , xi2 , . . . xip , )

The goal is to split data into K regions and in each region fit a constant to the observations
in that region. The fitted function would then be:

K
X
f (x) = ck I(x ∈ Rk ) (3.4.1)
k=1

where I(x) is the Identity function which is equal to 1 when x is within the region Rk and
zero
P otherwise,2 R1 , R2 . . . , Rk are the regions and ck is the minimization of the sum of squares
(yi − f (xi )) in region Rk . This means that the estimator for this region would be

ĉk = average(yi |xi ∈ Rk ) (3.4.2)

4
which is the mean of y within the region RK .
To find the best subsets R1 , R2 . . . , Rk all at once is very computationally intensive, border-
line infeasible. The best we can do is adopt a greedy method to partition data. This is done
by splitting one node at a time and grow the tree that way. Starting with a splitting variable
j and split point s, we define half-planes

R1 (j, s) = {X|Xj ≤ s} and R2 (j, s) = {X|Xj > s} (3.4.3)

and then seek the splitting variable j and split point s that solves
 
X X
min min (yi − c1 )2 + min (yi − c2 )2  . (3.4.4)
j,s c1 c2
x1 ∈R1 (j,s) x1 ∈R2 (j,s)

For all choices j and s, the inner minimization is solved by

cˆ1 = average(yi |xi ∈ R1 (j, s)) and cˆ2 = average(yi |xi ∈ R2 (j, s)). (3.4.5)

In other words we are doing the split which minimizes the sum of the squared errors for a
constant fit in each region. This is computed for every variable and every point xi until the
best pair (j, s) is found. Each split is relatively fast to compute.
This process is then performed iteratively for each node until the entire tree is built. The
process continues working until it reaches some stopping criteria. The default value for the
commonly used R package randomForest(Cutler and Wiener 2015) is to keep going until the
smallest new region after a split is no smaller than 5 observations. There are other methods
for stopping but they will not be discussed here.

3.5 Bootstrap Aggregation (Bagging)

Before introducing the concept of bagging we first have to learn about bootstrapping. If
we start out with a training set Z = (z1 , z2 , . . . , zN ) where zi = (xi , yi ) and fit a model to
this set. We now want to find the accuracy of the model fit. Bootstrapping is a method
to achieve this when there is no out-of-sample data to test on. The bootstrapping process
is to randomly sample observations with replacement from the training set Z and then
refit the proposed model to each of these samples to create a confidence interval for the
model using the bootstrap sampling distribution. The most common way is to sample as
many observations as one had in the original dataset. It is also possible to sample fewer
(undersampling) and more (oversampling). Once we have our bootstrap samples we want
to reduce the variance of our estimator as much as possible. The average of B independent
identically distributed random variables, each with variance σ 2 , has variance

1 2
σ . (3.5.1)
B
If the variables are identically distributed, but not necessarily independent then the variance
is

5
1−ρ 2
ρσ 2 + σ . (3.5.2)
B
This expression simplifies to
ρσ 2 (3.5.3)

when B grows large. The variance will in other words not approach zero asymptotically with
larger sample size as long as the bootstrap samples are correlated. A rectification to this
problem follows in the next chapter where we discuss random forests, a cousin of bootstrap
samples where the correlation between the samples is minimized using randomness.

3.6 Random Forest

The idea behind random forest regression is to make many uncorrelated regression trees and
average them. First off, the random forest algorithm(Hastie, Tibshirani, and Friedman 2009,
p. 588) is show in algorithm 1:

Algorithm 1 Random Forest for Regression


1. For b=1 to B:

(a) Draw a bootstrap sample Z of size N from the training data.


(b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeating
the following steps for each terminal node of the tree, until the minimum node
size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.

2. Output the ensemble of trees {Tb }B


1.

To make a prediction at a new point x:

fˆrf
B 1
PB
= B b=1 Tb (x)

This algorithm tries to minimize correlation by not only sampling observations (Bootstrap
sampling) but also sample which variables should be included for consideration at each split
of a node, known as feature sampling. The default value for random forest regression is to
choose m = bp/3c. Using a smaller value for m will reduce the correlation between trees
and thus by equation 3.5.2 reducing the variance of the average (Hastie, Tibshirani, and
Friedman 2009, p. 589).

6
3.6.1 Effect of Tree depth

Tree depth is the number of levels or splits before a leaf is reached. As the depth of the
tree increases, the function described in equation 3.4.1 will become more and more smooth
since each of the k regions will be thinner and thinner. This leads to a closer fit to the
data for each tree. Too shallow trees will have problems with underfitting, the tree will not
be able to capture the variation in data if only a few points are fitted. Too deep trees will
have the opposite problem, called overfitting. An overfitted tree will not perform as well
on the test data as it did on the training data. The default value in the randomForest
package is to allow splits where the resulting nodes i greater than 5 as well as no limit on
maximum number of nodes. Both of these parameters can be tuned for higher accuracy or
lower computation time.

3.6.2 Effect of Forest size

Increasing the forest size is to average more and more trees together for a lower variance
estimator. Since adding more trees will not affect bias but reduce variance one can not
average too many trees. More trees do however increase computation time and the added
benefit of calculating a larger number of trees diminishes with forest size. It is useful to look
at the OOB error estimate and to not add additional trees once the error stabilizes.

4 Results

4.1 Descriptive Statistics

The average milk yield per day is plotted in figure 3 (a)-(d), the lactations curves for each
of the 4 teats of cows looks fairly similar, the rear teats seem to release more milk than the
front ones but this is the only obvious difference. The lactations curves are very steep in the
beginning, which is reasonable to be able to provide the calf with milk. The curve continues
to climb until around the 50 day mark where it then slowly declines. After around 400 days
all of the graphs (a)-(d) show some erratic behaviors, this can be explained when looking at
graph (e). There are not many milking events past this point which would make the average
less stable than before, causing the sudden jumps in the curve.

4.2 Variable evaluation

In figure 4 we see graphs visualizing variable importance for the random forest regression
model. The graphs showing %incMSE is the percent increased Mean Squared Error when
the variable in question is not included when constructing a forest. The error is calculated as
the Out-of-bag error for the ensemble of trees where the variable is not included. Increased
Node Purity is a measure of how much worse a split would be if the variable in question is

7
(a) (b)

(c) (d)

(e)

Figure 3: Milk yield of each teat as well as distribution of days in milk.

excluded. Node purity for regression is measured in Residual Sum of Squares (RSS). These
measures are similar but generally %incMSE is the preferred measure to use since it is easier
to interpret. Increased node purity is what the method uses for splits.
The most important variables across all teats is Milking (if the cow was milked in the morning
or afternoon), Date (date of the milking event) and Cow (Animal Number). Table 1 displays
the values for each of these variables. The table for Increased Node purity can be found in
Appendix I.
Figure 5 displays the MSE for the forest with increasing number of trees. The graphs both
display that the error seem to reach a stable level asymptotically and that somewhere around
100-200 trees might be enough of a forest size to still achieve maximum accuracy while not
spending unnecessary computational power in calculating more trees. If the graph has not
yet leveled out when reaching the far right of the x-axis then it is advisable to increase the
number of trees when calculating the forest.

8
(a) a (b) b

(c) c (d) d

Figure 4: Percent decrease in MSE and Node purity when a variable in included for consid-
eration in a tree.

4.3 Model evalutation

As we can see in both table 2 and table 3 the random forest method is outperforming the
standard OLS method by on average 26% decrease MSE and 14% RMSE. Both of these
values seem to be a quite substantial decrease in model error. The values for MSE and
RMSE for OLS model is performed by Lilja and Keteris Eckerstedt 2016. Further worth
noting is that the OLS model was fitted on months 4-9 and tested on months 10-11. The
random forest regression model was trained on months 1-7 and tested on months 8-9. The
data for months 10-11 is unfortunately lost at the time of writing this thesis and a more
direct comparison can therefor not be made.

9
Table 1: Percent decreased Mean Squared Error when a variable is included in analysis for
each variable and each teat.

Left Forward Right Forward Left Rear Right Rear


Milking 299.64 389.16 398.20 428.49
Date 277.87 270.47 221.94 259.64
Cow 254.27 256.13 220.19 253.87
1/DIM 42.59 37.84 37.21 36.02
DIM2 41.05 37.53 38.50 34.59
DIM 40.69 37.15 39.43 35.45
DIM3 40.14 36.33 38.34 32.33

(a) a (b) b

(c) c (d) d

Figure 5: Number of trees versus the absolute error of the forest

Table 2: MSE for randomForest regression versus OLS model

LF RF LR RR Average
randomForest 1.192044 1.104153 1.803447 1.648640 1.437071
OLS 1.550466 1.517615 2.336136 2.424136 1.957088
Absolute Difference -0.358422 -0.413462 -0.532689 -0.775496 -0.520017
Percent Decrease 23.12% 27.24% 22.80% 31.99% 26.29%

Table 3: RMSE for randomForest regression versus OLS model

LF RF LR RR Average
randomForest 1.091808 1.050787 1.342925 1.283994 1.192379
OLS 1.245177 1.231915 1.527966 1.555905 1.390241
Absolute Difference -0.153369 -0.181128 -0.185041 -0.271911 -0.197862
Percent Decrease 12.32% 14.70% 12.11% 17.48% 14.15%

10
5 Conclusions
Random forest seem to outperform OLS when it comes to predict lactations curves. The
average decrease in MSE when using random forests over OLS is 26.3%. The data used to
estimate the OLS and random forest models is not exactly the same data but a large portion
of the data does overlap.
The most important variables for determining milk yield is whether the cow was milked dur-
ing the morning or afternoon, the date of the milking(likely explained by seasonal variation
or other day-to-day variations) as well as which cow was being milked. Removing either of
these from analysis would increase MSE by over 200%.
Default values were used when performing random forest analysis, in future work a good
starting point for tuning parameters would be to use B = 100 to B = 200. Starting with the
smaller value in interest of computational time and then increase the number of estimated
trees if the OOB error does not seem stable. More variables could be included for testing,
since random forest has power variable selection feature built in, testing seemingly random
variables should not have a noticeable effect on prediction performance.
For a more direct comparison to Lilja and Keteris Eckerstedt 2016 the OLS regression could
be rerun on the same dataset as the random forest was run to gain more comparable results.
Finally, Bayesian methods could be used for tuning the parameters depth, number of trees
and number of features for random forests to make the process more automated.

11
References
Cutler, Fortran original by Leo Breiman and Adele and R. port by Andy Liaw and Matthew
Wiener (2015). randomForest: Breiman and Cutler’s Random Forests for Classification
and Regression. url: https://cran.r-project.org/web/packages/randomForest/
index.html.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The Elements of Statistical
Learning. Springer Series in Statistics. DOI: 10.1007/978-0-387-84858-7. New York, NY:
Springer New York. isbn: 978-0-387-84857-0 978-0-387-84858-7. url: http : / / link .
springer.com/10.1007/978-0-387-84858-7.
Lilja, Mathias and Ilse Keteris Eckerstedt (2016). The influence of teat wash failure on milk
yield in dairy cows. eng. url: http://www.diva-portal.org/smash/record.jsf?pid=
diva2:901092.

12
6 Appendix I: Tables

Table 4: Variable importance for each teat.

Left Forward Right Forward Left Rear Right Rear


IncNodePurity IncNodePurity IncNodePurity IncNodePurity
Milking 8461,28 9333,18 17910,72 17689,89
Date 14930,69 15001,80 25491,96 25298,57
Cow 32481,43 31605,40 54801,64 50121,69
1/DIM 8469,49 9026,62 10285,82 15199,59
DIM2 7330,99 8596,52 12253,45 12114,23
DIM 6900,88 7915,62 13049,42 12641,00
DIM3 7285,31 9187,04 12789,96 12109,29

7 Appendix II: R Code


# install.packages(’plyr’)

library("plyr")

set.seed(1111)

# Data manipulation Importing two different


# datasets, one with variables for the milking
# process and the other one with background
# variables
ARM01 <- read.csv("AllaMjAMR201501.csv", skip = 1,
sep = ";", dec = ",", stringsAsFactors = FALSE,
na.strings = c("", "NA"))
lakt <- read.csv("All_laktations.csv", sep = ",", stringsAsFactors = FALSE)

# Converting factors to date (Hours:minutes


# included to differentiate between different
# events on the same day)
ARM01$StarttidTime <- as.POSIXct(as.character(ARM01$Starttid),
format = "%Y-%m-%d %H:%M")
lakt$Calving.Date <- as.Date.character(lakt$Calving.Date)
ARM01$Datum <- as.Date.character(ARM01$Starttid)
# Merging the datasets by Aninmal ID
merged.data <- merge(ARM01, lakt, by.x = "Djurnr",
by.y = "Animal.Id", all.x = TRUE)

13
# Fixing problem with several row for the same
# milking event depending on the number of calves
# for the specific cow
merged.data <- subset(merged.data, Datum > Calving.Date)
# Functiion for keeping the maxiumum value

merged.data <- ddply(merged.data, .(Djurnr, StarttidTime),


function(d) d[which.max(d$Calving.Date), ])
# Create a new variable (Days in milk)
merged.data$DIM <- as.numeric(merged.data$Datum - merged.data$Calving.Date)
# Skapar sedan funktioner av denna
merged.data$DIM2 <- (merged.data$DIM)^2
merged.data$DIM3 <- (merged.data$DIM)^3
merged.data$DIMdel <- 1/(merged.data$DIM)

# Creating dummy for milking AM or PM,(AM=TRUE,


# PM=FALSE)
merged.data$Hour <- format(merged.data$StarttidTime,
"%H")
merged.data$Hour <- as.numeric(merged.data$Hour)
merged.data$MilkingAm <- merged.data$Hour < 12

# Merging into final dataset


dok01 <- merged.data
# Creating the columns for the dummy teatwash
dok01$dummyVF <- 0
dok01$dummyHF <- 0
dok01$dummyVB <- 0
dok01$dummyHB <- 0
dok01 <- subset(dok01, !(is.na(VF.4) | is.na(HF.4) |
is.na(VB.4) | is.na(HB.4)))

#Once data manipulation for all monts is complete run following


#code to create training and testing datasets

# Same name in all datasets


names = colnames(dok04)
colnames(dok01) = names
names = colnames(dok04)
colnames(dok02) = names
names = colnames(dok04)
colnames(dok03) = names
names = colnames(dok04)

14
colnames(dok05) = names
names = colnames(dok04)
colnames(dok06) = names
names = colnames(dok04)
colnames(dok07) = names
names = colnames(dok04)
colnames(dok08) = names
names = colnames(dok04)
colnames(dok09) = names
# Combing the different months into one

detach(package:plyr)
library(dplyr)

train <- rbind(dok01, dok02, dok03, dok04, dok05, dok06,


dok07)
# Excluding all observations where something
# (except for teatwash) went bad
train <- train[is.na(train$Avspark), ]
train <- train[is.na(train$Ofullstndig), ]
train <- train[is.na(train$Spenar.hittas.ej), ]
# Several errors for the same milking event are
# possible
train <- train %>% arrange(Djurnr, Datum, MilkingAm)
i <- 2
41
while (i <= dim(train)[1]) {
if (train$Djurnr[i] == train$Djurnr[i - 1] & train$Datum[i] ==
train$Datum[i - 1] & train$MilkingAm[i] ==
train$MilkingAm[i - 1]) {
train$dummyVF[i - 1] <- max(train$dummyVF[i],
train$dummyVF[i - 1])
train$dummyHF[i - 1] <- max(train$dummyHF[i],
train$dummyHF[i - 1])
train$dummyVB[i - 1] <- max(train$dummyVB[i],
train$dummyVB[i - 1])
train$dummyHB[i - 1] <- max(train$dummyHB[i],
train$dummyHB[i - 1])
train <- train[-i, ]
} else {
i <- i + 1
}
}
train$Djurnr <- as.numeric(train$Djurnr)
############################## dataset train maipulation complete

15
test <- rbind(dok08, dok09)
# Excluding all observations where something
# (except for teatwash) went bad
test <- test[is.na(test$Avspark), ]
test <- test[is.na(test$Ofullstndig), ]
test <- test[is.na(test$Spenar.hittas.ej), ]
# Several errors for the same milking event are
# possible
test <- test %>% arrange(Djurnr, Datum, MilkingAm)
i <- 2
41
while (i <= dim(test)[1]) {
if (test$Djurnr[i] == test$Djurnr[i - 1] & test$Datum[i] ==
test$Datum[i - 1] & test$MilkingAm[i] == test$MilkingAm[i -
1]) {
test$dummyVF[i - 1] <- max(test$dummyVF[i],
test$dummyVF[i - 1])
test$dummyHF[i - 1] <- max(test$dummyHF[i],
test$dummyHF[i - 1])
test$dummyVB[i - 1] <- max(test$dummyVB[i],
test$dummyVB[i - 1])
test$dummyHB[i - 1] <- max(test$dummyHB[i],
test$dummyHB[i - 1])
test <- test[-i, ]
} else {
i <- i + 1
}
}
test$Djurnr <- as.numeric(test$Djurnr)
############################## dataset test maipulation complete

#Once all data manipulation is complete, run following to perform


#random forest regression as well as some descriptive statistics

# install.packages(’randomForest’)

library(randomForest)
library(gplots)

set.seed(1111)

16
# Run random forest regression on each of the
# teats.
forestVF <- randomForest(VF.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestHF <- randomForest(HF.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestVB <- randomForest(VB.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)
forestHB <- randomForest(HB.4 ~ Datum + MilkingAm +
DIM + DIM2 + DIM3 + DIMdel + MilkingAm * DIM +
MilkingAm * DIM2 + MilkingAm * DIM3 + MilkingAm *
DIMdel + Djurnr, data = train, importance = TRUE,
keep.forest = TRUE)

# Number of trees
plot(forestVF)
plot(forestVB)
plot(forestHF)
plot(forestHB)

# MSE:
mean((predict(forestVF, test) - test$VF.4)^2)
mean((predict(forestVB, test) - test$VB.4)^2)
mean((predict(forestHF, test) - test$HF.4)^2)
mean((predict(forestHB, test) - test$HB.4)^2)

# RMSE:
sqrt(mean((predict(forestVF, test) - test$VF.4)^2))
sqrt(mean((predict(forestVB, test) - test$VB.4)^2))
sqrt(mean((predict(forestHF, test) - test$HF.4)^2))
sqrt(mean((predict(forestHB, test) - test$HB.4)^2))

# importance
importance(forestVF)
importance(forestHF)
importance(forestVB)

17
importance(forestHB)

# importance plots
varImpPlot(forestVF)
varImpPlot(forestHF)
varImpPlot(forestVB)
varImpPlot(forestHB)

# plots of all teats


plot(test[, "DIM"], test[, "VF.4"])
total = rbind(train, test)

# histogram of days in milk to see distribution


hist(total[, "DIM"], breaks = 20, xlab = "Days in milk",
main = "Histogram of Days in milk")

# Plot the means for all 4 teats


plotmeans(total[, "VF.4"] ~ total[, "DIM"], xlab = "Days in milk",
bars = FALSE, n.label = FALSE, ylim = c(0, 6),
col = "white", ccol = "blue", ylab = "Left Forward",
main = "Means of left forward teat milk yield per day after calving")
plotmeans(total[, "HF.4"] ~ total[, "DIM"], bars = FALSE,
n.label = FALSE, ylim = c(0, 6), col = "white",
ccol = "blue", xlab = "Days in milk", ylab = "Right Forward",
main = "Means of right forward teat milk yield per day after calving")
plotmeans(total[, "VB.4"] ~ total[, "DIM"], bars = FALSE,
n.label = FALSE, ylim = c(0, 6), col = "white",
ccol = "blue", xlab = "Days in milk", ylab = "Left Rear",
main = "Means of left rear teat milk yield per day after calving")
plotmeans(total[, "HB.4"] ~ total[, "DIM"], bars = FALSE,
n.label = FALSE, ylim = c(0, 6), col = "white",
ccol = "blue", xlab = "Days in milk", ylab = "Right Rear",
main = "Means of right rear milk yield per day after calving")

# xlab and ylab didn’t work for above, print


# another one just to be able to copy lables to
# paint.
plotmeans(HB.4 ~ DIM, data = total, ylim = c(0, 6),
col = "white", ccol = "blue", xlab = "Days in milk",
ylab = "Mean Yield", main = "Means of right rear teat milk yield per day after calvi

# End of code

18

You might also like