You are on page 1of 33

SOR 0511

Statistics in Bioinformatics

Project: Part 2

Compiled by: Tristan Camilleri (254280M)

Department of Statistics and Operations Research


University of Malta

20 June 2021
Table of Contents
Q1 – Principal Component Analysis.......................................................................................................3
Appendix I – PCA code.......................................................................................................................7
Q2 – Cluster analysis..............................................................................................................................9
Value of k.........................................................................................................................................10
Hierarchical clustering.....................................................................................................................12
Practical applications of clustering analysis to Bioinformatics.........................................................13
Appendix I........................................................................................................................................14
Q3. Experimental design......................................................................................................................16
a. Genomic size in crustaceans....................................................................................................16
Appendix I – R-script....................................................................................................................20
b. Leukemia treatments – gene expression.................................................................................22
Appendix I – R-script....................................................................................................................25
c. Power analysis.........................................................................................................................26
d. Complete block design.............................................................................................................26
Appendix I – R-script....................................................................................................................28
Q4 – Regularization in Regression.......................................................................................................29
a. Model fitting............................................................................................................................29
Ridge Regression Model..............................................................................................................30
LASSO Regression Model.............................................................................................................30

Table 1 Importance of components.......................................................................................................3


Table 2 Coefficients of the linear combinations for the first three principal components....................4
Table 3 Means of variables by group.....................................................................................................5
Table 4 Correlation matrix for the variables..........................................................................................5

Figure 1 Scree plot showing the variances explained by the principal components versus the principal
component number for standardised variables (centred and scaled)...................................................4
Figure 2 Scree plot showing the variances explained by the principal components versus the principal
component number for mean-centred variables..................................................................................4
Figure 3 Plot for PC1 and PC2 and Biplot for the first two principal components.................................5
Figure 4 Visualisation of correlation matrix (table 4).............................................................................6
Figure 5 (left) Heat map of euclidean distances obtained using factoextra; (right) Heat map of
Euclidean distances and relevant dendrogram obtained using pheatmap............................................9
Figure 6 Plot of Number of clusters k vs Total Within Sum of Square.................................................10
Figure 7 Plot of Number of clusters k vs Average width of silhouette.................................................10
Figure 8 Plot of number of clusters k vs Gap statistic..........................................................................11
Figure 9 k means cluster plot with k = 2 for the two variables X and Y................................................11
Figure 10 Cluster dendrogram.............................................................................................................12
Figure 11 Agglomerative clustering performed using AGNES..............................................................12

2
Q1 – Principal Component Analysis
The code to conduct the Principal component analysis presented as appendix I to this section is
executed and the following data is obtained.

> PCA$sdev^2

[1] 1.28289365 0.68440462 0.08184448 0.07763829 0.07589231 0.07290600 0.06936940

[8] 0.06725458 0.06299673 0.06086336 0.05397146 0.04802909 0.04529197 0.04094270

[15] 0.03469158

> plot(PCA, type = "l")

> summary(PCA)

Table 1 Importance of components

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8


Standard 1.133 0.8273 0.28608 0.27864 0.27549 0.27001 0.26338 0.25933
deviation
Proportion 0.465 0.2481 0.02966 0.02814 0.02751 0.02642 0.02514 0.02438
of
Variance
Cumulative 0.465 0.7130 0.74271 0.77085 0.79836 0.82479 0.84993 0.87431
Proportion

PC9 PC10 PC11 PC12 PC13 PC14 PC15


Standard 0.25099 0.24671 0.23232 0.21916 0.21282 0.20234 0.18626
deviation
Proportion 0.02283 0.02206 0.01956 0.01741 0.01642 0.01484 0.01257
of Variance
Cumulative 0.89714 0.91920 0.93876 0.95617 0.97259 0.98743 1.00000
Proportion

From the scree plots presented in figures 1 and 2, and the variances listed in table 1, the principal
components that would be retained are PC1 and PC2. This is because:

1. The two principal components represent 71.3% of the cumulative variance;


2. Principal components with eigenvalues <1 would mean that the component actually explains
less than a single explanatory variable. PC 1 and 2 have eigen values larger than 1. It can be
debated that PC3 could also be retained, since it has an eigenvalue close to 1, however the
proportion of variance for this PC is less than 3%. Therefore, it was decided not to consider
this principal component.

Rotation (n x k) = (15 x 15):


3
Table 2 Coefficients of the linear combinations for the first three principal components

PC1 PC2 PC3


X1 -0.532 0.486 0.137
X2 0.310 0.442 -0.095
X3 0.106 0.136 -0.216
X4 -0.066 -0.134 0.129
X5 -0.287 -0.000 -0.330
X6 0.044 -0.057 0.054
X7 -0.130 0.123 -0.082
X8 -0.049 -0.454 -0.170
X9 -0.500 -0.011 -0.031
X10 0.223 -0.039 0.223
X11 0.058 -0.223 0.538
X12 0.095 0.392 0.035
X13 0.381 0.252 -0.360
X14 -0.099 -0.100 -0.440
X15 0.188 -0.170 -0.076

Taking into consideration only the first two principal components, the output of PCA, the two
components are given by:

Y 1=−0.53 X 1+0.31 X 2+0.11 X 3−0.06 X 4 – 0.29 X 5+ 0.04 X 6 – 0.13 X 7 – 0.05 X 8 – 0.50 X 9+0.22 X 10+0

Y 2=0.49 X 1+0.44 X 2+ 0.14 X 3−0.13 X 4−0.00 X 5−0.06 X 6 +0.12 X 7 – 0.45 X 8−0.01 X 9−0.04 X 10−0.

The expressions for the selected principal components indicate that X1 (-0.53) is the variable that
contributes most to the first principal component, followed by closely by X9 (-0.50). Variables X13
(0.38) and X2 (0.31) have a smaller contrasting effect.

The second principal component is mostly influenced by variable X1 (0.49), supported by the effect
of X2 (0.44) and X12 (0.39) and contrasted by X8 (-0.45).
Table 3 Means of variables by group.

GROUP X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15

1 -0.064 1.201 0.470 -0.335 -0.470 -0.066 0.073 -0.853 -0.858 0.326 -0.315 0.872 1.024 -0.334 0.101

2 0.565 0.268 0.168 -0.123 0.069 -0.119 0.141 -0.460 0.135 -0.079 -0.234 0.380 0.126 -0.024 -0.128

3 0.047 0.848 0.345 -0.214 -0.315 -0.081 -0.006 -0.639 -0.486 0.174 -0.253 0.586 0.641 -0.177 0.017

Figure 2 Scree plot showing the variances explained


Figure 1 Scree plot showing the variances explained by the principal components versus the principal
by the principal components versus the principal component number for mean-centred variables.
component number for standardised variables
(centred and scaled).
4
With respect to the first principal component (PC1) and a review of the means of the variables
calculated by group, the members of group 2 tend to have a larger average value for X1 (0.565)
when compared to groups 1 (-0.064) and 3 (0.047). On average, the variable X9 for group 2 (0.135) is
different from those of groups 1 (-0.858) and 3 (-0.486). On average, the variable X13 is larger for
group 1 (1.024) when compared to groups 2 (0.126) and 3 (0.641). Similarly, albeit to a lesser extent,
the mean of the variable X2 for group 1 is larger than that of groups 2 (0.268) and 3 (0.848).

Figure 1 Plot for PC1 and PC2 and Biplot for the first two principal components

From the biplot in figure 3 it can be observed that the most influential variables are X1, X9, X2 and
X13, since they have the longest arrows. Moreover, X9 seems to only affect the first principal
comoponent and not the second one. X1, X2 and X13, all seem to have a (seemingly similar) impact
on the two principal components. X1 seems to be uncorrelated to X2 (-0.17) and X4 (0.02), since
they appear to be at 90˚ to each other. The variables X9 and X2 are likely to be positively correlated
to the variables X1 (0.72) and X13 respectively (0.74). This can also be seen in figure 4.
Table 4 Correlation matrix for the variables

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15


X1 1.00
X2 -0.17 1.00
X3 -0.12 0.48 1.00
X4 0.02 -0.44 -0.21 1.00
X5 0.62 -0.49 -0.27 0.21 1.00
X6 -0.23 0.02 -0.01 -0.09 -0.19 1.00
X7 0.56 -0.08 -0.06 -0.09 0.39 -0.19 1.00
X8 -0.34 -0.59 -0.31 0.37 0.08 0.16 -0.21 1.00
X9 0.72 -0.58 -0.36 0.23 0.69 -0.19 0.46 0.14 1.00
X10 -0.64 0.40 0.25 -0.16 -0.61 0.10 -0.40 0.00 -0.70 1.00
X11 -0.44 -0.25 -0.11 0.14 -0.20 0.10 -0.30 0.36 -0.18 0.22
X12 0.19 0.64 0.34 -0.34 -0.25 -0.06 0.14 -0.66 -0.24 0.14 -0.37 1.00
X13 -0.44 0.74 0.43 -0.40 -0.64 0.03 -0.29 -0.42 -0.72 0.53 -0.08 0.45 1.00
X14 0.14 -0.43 -0.22 0.17 0.31 -0.03 0.10 0.25 0.34 -0.30 0.02 -0.27 -0.38 1.00
X15 -0.69 0.11 0.11 0.00 -0.48 0.25 -0.37 0.25 -0.53 0.43 0.35 -0.11 0.36 -0.12 1.00
The high dimensionality of bioinformatic data presents a unique challenge in analysis of
bioinformatics data. Principal component analysis (PCA) is a useful technique that can be used to

5
reduce the dimensions of the data being considered without losing generality. PCA can be used on a
diverse range of bioinformatic data, such as:

1. Gene expression data for the


construction of linear
combinations of gene
expressions, being the principal
components (PCs). The PCs can
effectively explain variation of
gene expressions, and may lower
its dimensionality;
2. Protein tertiary structure
prediction from amino acid
sequence. PCA can be used in
protein refinement models, in
order to establish a low-
dimensional space where the
sampling (and optimization) is
Figure 4 Visualisation of correlation matrix (table carried out via particle swarm optimizer
4) (PSO). The reduced space is found via
PCA performed for a set of low-energy protein models previously found using different
optimization techniques.

6
Appendix I – PCA code

PCA_TC<-SOR0511DataforPCATristan # rename dataframe so that it can be worked without


# modifying the orignal data

summary(PCA_TC)

PCA_TC$Group <-factor(PCA_TC$Group) # declare variable 'Group' as a categorical variable


PCA <- prcomp(PCA_TC[,1:15],center=TRUE, scale=FALSE)
PCA
PCA$sdev^2

plot(PCA, type = "l")


abline(h = 0.1, col="red", lty=5) # plot a line at eigenvalue = 1
legend("topright", legend=c("Eigenvalue = 0.1"),
col=c("red"), lty=5, cex=0.6)

summary(PCA)

attach(PCA_TC)

aggregate(X1~Group, FUN=mean)
aggregate(X2~Group, FUN=mean)
aggregate(X3~Group, FUN=mean)
aggregate(X4~Group, FUN=mean)
aggregate(X5~Group, FUN=mean)
aggregate(X6~Group, FUN=mean)
aggregate(X7~Group, FUN=mean)
aggregate(X8~Group, FUN=mean)
aggregate(X9~Group, FUN=mean)
aggregate(X10~Group, FUN=mean)
aggregate(X11~Group, FUN=mean)
aggregate(X12~Group, FUN=mean)
aggregate(X13~Group, FUN=mean)
aggregate(X14~Group, FUN=mean)
aggregate(X15~Group, FUN=mean)

library(ggfortify)
autoplot( prcomp(PCA_TC[,1:15],center = TRUE), data = PCA_TC, colour = 'Group')

library(devtools)
install_github("vqv/ggbiplot") ## downloading from a repository
library(ggbiplot)

g <- ggbiplot(PCA, obs.scale = 1, var.scale = 1,


groups = Group, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
g

7
library(corrplot)
corr_matrix <- round(cor(PCA_TC[,1:15]),2)
corrplot(corr_matrix, type = "upper")

8
Q2 – Cluster analysis

The dataset consists of 1000 observations of 2 variables.

X Y
MIN. -1.694 -2.0778
1ST 1.051 0.9979
QU.:
MEDIAN 2.878 3.1110
MEAN 3.011 3.0574
3RD QU. 5.020 5.0588
MAX. 7.765 7.8851

The data is then scaled and the following descriptive statistics are obtained.

MI MED MEAN SD MAX


N
X -2.1 -0.1 0 1 2.2
Y -2.3 0.0 0 1 2.1

The euclidean distance is then calculated and a six by six matrix is extracted to view some of the
results obtained. This is represented in table 3.

1 2 3 4 5 6
1 0.00 0.82 0.87 0.37 0.92 1.11
2 0.82 0.00 0.88 0.62 0.37 0.39
3 0.87 0.88 0.00 1.04 0.61 0.83
4 0.37 0.62 1.04 0.00 0.86 0.99
5 0.92 0.37 0.61 0.86 0.00 0.24
6 1.11 0.39 0.83 0.99 0.24 0.00

Figure 2 (left) Heat map of euclidean distances obtained using factoextra; (right) Heat map of Euclidean distances and relevant dendrogram
obtained using pheatmap

9
The heat map presented in figure 5 (left) presents the euclidean distances for all the observtion
combinations. The plot indicates that there are regions of high dissimilarity (marked in blue) and
regions of low dissimilarity (marked in red). A value of 0 (darkest red) is obtained when the two
observations being compared have a euclidean distance of 0, whilst a value of 5 (darkest blue)
indicates the highest euclidean distance value calculated between two observations. Based on these

values, an attempt at basic clustering is obtained. This is better represented in the figure 5 (right),
whereby a dendrogram is applied to the heatmap. This cluster dendrogram represents the hierchical
relationship between the observations being considered.

Value of k
a. Elbow method

The results of the Elbow method to determine k, as represented in figure 6 indicate that 2 is the
optimal number of clusters at it appears to be the bend in the “elbow”.

Figure 3 Plot of Number of clusters k vs Total Within Sum of Square

10
b. Silhouette method

Figure 4 Plot of Number of clusters k vs Average width of silhouette

The results of the Silhouette method to determine k, as represented in figure 7, show that 2 clusters
maximise the average silhouette values.

c. Gap statistic method

Similarly to the previous two methods to determine the value of k for the scaled dataset, a value of
two (clusters) is being suggested.

Figure 5 Plot of number of clusters k vs Gap statistic

11
A value of 2 shall be used as k to perform the cluster analysis.

The two clusters have a size of 500 observations each. From the two clusters that can be observed in
figure 9, it can be said that there is good separation between the two clusters with the observations
being located around the centers of the clusters and no overlap between the clusters. This also
indicates that the choice of two clusters is considered to be adequate.

Hierarchical clustering
Taking into consideration the relatively large sample size of 1,000 found in the dataset it is decided
to proceed with an agglomerative clustering using Ward’s method.

Figure 6 k means cluster plot with k = 2 for the two variables X and Y

Figure 7 Cluster dendrogram

12
The cluster dendrogram presented in figure 10, tallies with the results obtained from k means
clustering whereby the two clusters appear to have an equal distribution. The decreasing height in
the jumps between stages indicate that the clusters become progressively closer to each other.

Morever, the large height in the first jump, indicates that the two major clusters were assessed to be
far apart. This also tallies with the complete separation between the members of the two clusters

Figure 11 Agglomerative clustering performed using AGNES


identified in k means clustering.

13
The hierarchical clustering was repeated using AGNES. An agglomerative coefficient of 0.99 was
obtained, which suggests a balanced clustering structure, as also indicated in figure 11.

Practical applications of clustering analysis to Bioinformatics


In view of the data richness of the bioinformatics fields, clustering has become highly important in
the analysis of data. The aim of clustering is to group data objects into a set of disjoint clusters, such
that objects within a cluster have high similarity to each other, while objects in separate classes are
more dissimilar. Different clustering techniques have been used extensively in the analysis of
molecular biology, genetics, genomics and proteomics data.

The application of clustering algorithms to gene expression data is proving to be useful in functional
genomics. The large volume of genes, as well as the complexity of biological networks, renders the
processing and interpretation of data (often consisting of millions of measurements) a highly
complex task. The use of clustering techniques is a very useful tool in the data mining process to
reveal natural structures and identify interesting patterns within the data.

The clustering of gene expression data has been proven to be useful in making known the natural
structure inherent in gene expression data, understanding gene functions, cellular processes, and
subtypes of cells, mining useful information from noisy data, and understanding gene regulation.

The clustering of gene expression data is pivotal in homology identification, with very practical
applications, such as vaccine development.

14
Appendix I

R-script used for the above analysis

clust <- SOR0511DataforclusteringTristan # renaming of dataset

library(stats)
library(factoextra)
library(pheatmap)
library(cluster)
summary(clust)

clust.scaled <- scale(clust) # creating a scaled version of the dataset

desc_stats <- data.frame(


Min = apply(clust.scaled, 2, min), # minimum
Med = apply(clust.scaled, 2, median), # median
Mean = apply(clust.scaled, 2, mean), # mean
SD = apply(clust.scaled, 2, sd), # Standard deviation
Max = apply(clust.scaled, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)

desc_stats

dist.eucl <- dist(clust.scaled, method = "euclidean")

round((as.matrix(dist.eucl))[1:6, 1:6],2)

fviz_dist(dist.eucl)

pheatmap(dist.eucl, scale = "row")

# elbow method to determine k

fviz_nbclust(clust.scaled, kmeans, method = "wss")

# silhouette method to determine k

fviz_nbclust(clust.scaled, kmeans, method = "silhouette")

# gap-statistic method to determine k

fviz_nbclust(clust.scaled, kmeans, method = "gap_stat")

# Cluster analysis

km.data <- kmeans(clust.scaled, 2, nstart = 25) # k = 2

km.data$size # Cluster size

km.data$centers # Cluster means

fviz_cluster(km.data, data = clust.scaled, palette = "jco", ggtheme = theme_minimal())

# Hierarchical clustering - Agglomerative

# Ward's method

15
hc <- hclust(dist(clust.scaled), method = "ward.D2")

# dist(clust.scaled) uses Euclidean distance as default distance measure

fviz_dend(hc, cex = 0.5, k = 2, palette = "jco")

#AGNES

hccomp2 <- agnes(dist(clust.scaled), method = "complete")

hccomp2$ac

fviz_dend(hccomp2, cex = 0.5, k = 2, palette = "jco")

fviz_dend(hccomp2, k = 2, horiz = TRUE, rect = TRUE, rect_fill =

TRUE, rect_border = "jco", k_colors = "jco", cex = 0.1)

16
Q3. Experimental design

a. Genomic size in crustaceans

An ANOVA model was computed on the dataset for the three crustaceans (the factor). The following
results were obtained:

> summary(anovamodel)

Call:

lm(formula = genome.size ~ crust, data = gen.size)

Residuals:

Min 1Q Median 3Q Max


-0.38000 -0.27625 -0.04125 0.32937 0.45750

Coefficients:

ESTIMATE STD. ERROR T VALUE PR(>|T|)


(INTERCEPT) 1.0500 0.1795 5.851 0.000244 ***
CRUSTCOPEPOD -0.5375 0.2538 -2.118 0.063271 .
S
CRUSTISOPODS 0.9775 0.2538 3.851 0.003899 **
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3589 on 9 degrees of freedom

Multiple R-squared: 0.8028, Adjusted R-squared: 0.7589

F-statistic: 18.32 on 2 and 9 DF, p-value: 0.0006721

The dummy variable, in this case, is on copepods and isopods, and not barnacles thus making it
redundant. At a significance level of 0.05, the p-value is 0.00067 which indicates that there is a
statistically significant difference between the mean genomic sizes of the three crustacean groups.

The model that is obtained is therefore the following:

Y i=1.05−0.538 F2 i +0.978 F3 i +∈i

where Y i is the log10 of the genome size, F 2i is 1 when ith observation is a copepod or otherwise 0,
and F 3i is 1 when the ith observation is an isopod or otherwise 0.

17
Using the aov() function, the following ANOVA table is obtained:

Df Sum Sq Mean Sq F value Pr(>F)


Crust 2 4.72 2.3598 18.32 0.000672 ***
Residuals 9 1.16 0.1288
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This reaffirms the fact that there is a significant difference (p-value 0.000672) between the groups of
crustaceans at a significance level of 0.05.

Shapiro-Wilk normality test

data: anovamodel$residuals

W = 0.87087, p-value = 0.06703

The results of the Shapiro-Wilk normality test indicate that the data is normally distributed at a
significance level of 0.05, since the p-value that is obtained is 0.067, and therefore the null
hypothesis cannot be rejected. Therefore, the data is normally distributed and the normality
assumption for ANOVA testing is satisfied.

crust: barnacles

Shapiro-Wilk normality test

data: dd[x, ]

W = 0.9627, p-value = 0.7959

-------------------------------------------------------------------------

crust: copepods

Shapiro-Wilk normality test

data: dd[x, ]

W = 0.8593, p-value = 0.2577

-------------------------------------------------------------------------

crust: isopods

Shapiro-Wilk normality test

data: dd[x, ]

W = 0.79569, p-value = 0.09471

18
The Shapiro-Wilk normality test was also performed on each group, and the p-values (0.80, 0.26 and
0.09) for the three groups are also higher than 0.05. The null hypothesis is acceptable and therefore
all three groups follow a normal distribution.

Breusch-Godfrey test for serial correlation of order up to 1

data: anovamodel

LM test = 0.67606, df = 1, p-value = 0.4109

The null and alternative hypothesis for the Breusch-Godfrey test are as follows:

H0 (null hypothesis): There is no autocorrelation at any order less than or equal to p.

H1 (alternative hypothesis): There exists autocorrelation at some order less than or equal to p.

Since the p-value (0.4109) is higher than the significance level of 0.05, we cannot reject the null
hypothesis and therefor there is no significant autocorrelation between the residuals in the data.

studentized Breusch-Pagan test

data: anovamodel

BP = 1.171, df = 2, p-value = 0.5568

Since the p-value (0.5568) of the Breusch-Pagan is higher than the significance level of 0.05, the null
hypothesis cannot be rejected, which means that the data is homoscedastic.

Therefore, all the assumptions behind the model are considered as satisfied.

Through the calculation of the leverages, cook’s distance and studentised residuals, no outliers were
identified.

A post hoc analysis was carried out to assess the pairwise comparisons, as shown below:

Bonferroni method

Pairwise comparisons using t tests with pooled SD

data: genome.size and crust

Barnacles Copepods
Copepods 0.18981 -
Isopods 0.01170 0.00063

At a significance level of 0.05, the barnacles-isopods (0.012) and copepods-isopods (0.0006)


comparisons are considered to be different in a statistically significant manner. The following is the
homogenous subgroup arising from the post-hoc test at 0.05 level of significance:

 Barnacles, Copepods

Tukey method

95% family-wise confidence level

19
factor levels have been ordered

Fit: aov(formula = genome.size ~ crust, data = gen.size)

diff lwr upr p adj


barnacles- 0.5375 -0.1711386 1.246139 0.1409662
copepods
isopods- 1.5150 0.8063614 2.223639 0.0005512
copepods
isopods- 0.9775 0.2688614 1.686139 0.0098060
barnacles

At a significance level of 0.05, the p-values for the isopods-copepods comparison (0.0006) and the
isopods-barnacles comparison (0.0098), indicate that there is a significant difference. The
comparison of barnacles-copepods (p-value = 0.14) indicates that there is no significant difference
between these two groups. The results are in agreement with those obtained from the Bonferroni
method.

Scheffe method

Study: output ~ "crust"

Scheffe Test for genome.size

Mean Square Error : 0.1288389

means genome.size std r Min Max


Barnacles 1.0500 0.3275159 4 0.67 1.40
Copepods 0.5125 0.3423814 4 0.25 0.97
Isopods 2.0275 0.4025233 4 1.65 2.40

Alpha: 0.05 ; DF Error: 9

Critical Value of F: 4.256495

Minimum Significant Difference: 0.7405417

Means with the same letter are not significantly different.

genome.size groups
Isopods 2.0275 a
Barnacles 1.0500 b
Copepods 0.5125 b

The mean genome size for barnacles, copepods and isopods are 1.05, 0.51 and 2.03 respectively. At
a significance level of 0.05, isopods belong to group A, while barnacles and copepods belong to
group B. Therefore, there is a significant difference between isopods and barnacles or copepods,
however, there is no significant difference between barnacles and copepods. This is also in
agreement with the results obtained from the previous two methods.
20
Appendix I – R-script

library(lmtest)

library(MASS)

library(agricolae)

copepods <- data.frame(crust= rep("copepods", 4), genome.size = c(0.25, 0.25, 0.58, 0.97))

barnacles <- data.frame(crust= rep("barnacles", 4), genome.size = c(0.67, 0.9, 1.23, 1.4))

isopods <- data.frame(crust= rep("isopods", 4), genome.size = c(1.71, 2.35, 2.4, 1.65))

gen.size <- rbind(copepods, barnacles, isopods)

attach(gen.size)

#fitting one-way ANOVA model

anovamodel<-lm(genome.size~crust,data=gen.size)

#summary of ANOVA model

summary(anovamodel)

#ANOVA table for ANOVA model

output<-aov(genome.size~crust, data=gen.size)

summary(output)

#checking normality of ANOVA model residuals

shapiro.test(anovamodel$residuals)

by(genome.size, crust, shapiro.test)

#checking no serial correlation and homoscedasticity assumption of ANOVA model

bgtest(anovamodel)

bptest(anovamodel)

#calculating leverage values

Leverages <- hatvalues(anovamodel,type='rstandard')

n <- length(genome.size)

p<-3

Cutofflev <- (2*p)/n

which(Leverages > Cutofflev)

#calculating Cook's distance

Cook <- cooks.distance(anovamodel,type='rstandard')

which(Cook > 1)

21
#Studentized residuals

sranovamodel<-studres(anovamodel)

which(abs(sranovamodel)>2)

#post hoc anlaysis

#bonferroni method

pairwise.t.test(genome.size, crust, p.adjust.method = "bonferroni",

pool.sd = TRUE, paired = FALSE,alternative="two.sided")

#tukey method

TukeyHSD(output, "crust", ordered = TRUE, conf.level=0.95)

#scheffe method

scheffe.test(output,"crust", alpha=0.05, group=TRUE,console=TRUE)

22
b. Leukemia treatments – gene expression

The library ibd is used to assess whether the experiment is a balanced incomplete block design. For
the purpose of the function bibd, the number of treatments (n) is 4, the number of blocks (b) is 4 –
where the methods represents the block -; the size of each block (k) is 3 and the number of times
r (k −1)
each treatment appears in the design (r) is 3. Since λ= ; λ = 2. Therefore, each pair appears
( n−1 )
together twice, making it a balanced incomplete block design.

When the number of treatments to be compared is large, there will be a need for a larger number of
blocks to accommodate all the treatments. This results in a large number of resources (financial,
human, etc…) being required to run the experiment. Incomplete block designs can be used in cases
where there is a limitation in the number of available experimental units that are available, whereby
it would not be possible to accommodate for all treatments. The advantage of incomplete block
designs is that each block receives only some of the selected treatments and not all the treatments.
This maybe due to for example the need to minimise the number of animals used in a toxicological
assessment, whereby for reasons of cost and animal welfare the number of animals available for a
study is limited. The efficiency of such incomplete block designs is, in general, not less than the
efficiency of a complete block designs and therefore it makes them acceptable and desirable in
certain situations.

The following balanced incomplete block design is obtained when using the above parameters (4, 4,
3, 3, 2) in the bibd() command:

design [1] [2] [3]


Block-1 1 3 4
Block-2 2 3 4
Block-3 1 2 4
Block-4 1 2 3

By running the ANOVA test, the following ANOVA table is obtained.

Anova Table (Type III tests)

Response: Gene.Expression

Sum Sq Df F value Pr(>F)


(Intercept) 8948.7 1 13767.198 8.528e-10 ***
factor(Method) 66.1 3 33.889 0.0009528 ***
factor(Treatment 22.7 3 11.667 0.0107387 *
)

Residuals 3.3 5

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

At a statistical significance level of 0.05, it can be observed that the type of treatment (p-value =
0.011) has a significant effect on gene expression.

The following linear model is then obtained:

23
lm(formula = Gene.Expression ~ factor(Method) + factor(Treatment),

data = genexp)

Residuals:

1 2 3 4 5 6 7
7.500e-01 1.250e-01 -8.750e-01 -3.750e-01 3.750e-01 1.088e-14 -7.500e-01
8 9 10 11 12
-1.250e-01 8.750e-01 -3.750e-01 3.750e-01 1.035e-14

Coefficients: Estimate Std. Error t value Pr(>|t|)


(Intercept) 72.2500 0.6158 117.334 8.53e-10 ***
factor(Method)2 2.1250 0.6982 3.043 0.02864 *
factor(Method)3 -4.7500 0.6982 -6.803 0.00105 **
factor(Method)4 -0.8750 0.6982 -1.253 0.26554
factor(Treatment) 0.2500 0.6982 0.358 0.73492
2
factor(Treatment) 0.6250 0.6982 0.895 0.41173
3
factor(Treatment) 3.6250 0.6982 5.192 0.00349 **
4
---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8062 on 5 degrees of freedom

Multiple R-squared: 0.9599, Adjusted R-squared: 0.9117

F-statistic: 19.94 on 6 and 5 DF, p-value: 0.002396

Y i=72.25+2.125 F 2 i−4.75 F 3i −0.875 F 4 i +0.25 B2 i+ 0.625 B3 i +3.625 B 4 i +∈i

Where Y i is the gene expression level; F 2i is 1 when method 1 is used, otherwise it is a 0; F 3i is 1


when method 3 is used, otherwise it is a 0; F 4 i is 1 when method 4 is used, otherwise it is a 0; B2 i is
1 when treatment 2 is administered, otherwise it is a 0; B3 i is 1 when treatment 3 is administered,
otherwise it is a 0; B4 i is 1 when treatment 4 is administered, otherwise it is a 0. It can also be seen
that the estimates of the coefficient for the dummy variables for Method 2 (0.029), Method 3
(0.001) and Treatment 4 (0.003) are statistically significant at a 0.05 significance level.

shapiro.test(model2$residuals)

Shapiro-Wilk normality test

data: model2$residuals

W = 0.96945, p-value = 0.905

The results of the Shapiro-Wilk normality test indicate that the data is normally distributed at a
significance level of 0.05, since the p-value that is obtained is 0.905, and therefore the null

24
hypothesis cannot be rejected. Therefore, the data is normally distributed and the normality
assumption for ANOVA testing is satisfied.

Breusch-Godfrey test for serial correlation of order up to 1

data: model2

LM test = 0.18499, df = 1, p-value = 0.6671

Since the p-value (0.6671) is higher than the significance level of 0.05, we cannot reject the null
hypothesis and therefor there is no significant autocorrelation between the residuals in the data.

bptest(model2)

Studentized Breusch-Pagan test

data: model2

BP = 10.312, df = 6, p-value = 0.1121

Since the p-value (0.1121) of the Breusch-Pagan is higher than the significance level of 0.05, the null
hypothesis cannot be rejected, which means that the data is homoscedastic.

Therefore, all the assumptions behind the model are considered as satisfied.

25
Appendix I – R-script
library(ibd)

library(lmtest)

#balanced incomplete block design

bibd(4, 4, 3, 3, 2)

attach(genexp)

#balanced incomplete block design - ANOVA table

model1<-aov.ibd(Gene.Expression~factor(Method)+factor(Treatment),data=genexp)

model1

#balanced incomplete block design - block treatment model

model2<-lm(Gene.Expression~factor(Method)+factor(Treatment),data=genexp)

summary(model2)

#checking normality of ANOVA model residuals

shapiro.test(model2$residuals)

#checking no serial correlation and homoscedasticity assumption of ANOVA model

bgtest(model2)

bptest(model2)

26
c. Power analysis

i. For the purpose of this one-way ANOVA, given that the effect size (f) is 0.25, the number
of levels (k) is 5, the sample size per treatment (n) is 39, and the significance level is
0.05; using the command:

pwr.anova.test(f=0.25, k=5, n=39, sig.level = 0.05)

yields a power of 0.798 or 79.8%.

ii. Given that the level of significance of 0.05, f of 0.25 and k of 5 are retained, to obtain a
power of 0.95, the sample size per treatment needed would be determined using the
following command:

pwr.anova.test(f=0.25, k=4,power=0.95,sig.level=0.05)

This yields a sample size of 69.67 per treatment group. Since the number of treatment
groups is 5, then there would need to be a total sample of 350.

NOTE: the command pwr.anova.test() is available in the pwr package which is loaded through the
following command – library(pwr)

d. Complete block design

A complete block design is an experiment in which participants are divided into blocks and are then
assigned to the different treatments or conditions in such a manner that each treatment level
observation appears once in each block. In the patients dataset, each patient is considered as a
block, with each patient being subject to the three different treatments only once, and each
treatment level appears only once for each block/patient. Therefore, it can be considered a
(randomised) block design.

Non-parametric tests are more suitable to assess ordinal data such as the response/ranking of the
treatment found in the dataset. Therefore, non-parametric tests are considered to be a more
appropriate option for the analysis of the data found in the patients dataset.

Since each patient is treated with the three different treatments and their response to the treatment
is ranked by the same patient, the Friedman test is considered to be an appropriate test. Since the
rating of one treatment is most likely related to the rating that a subject gives to another treatment,
the treatments can be considered as four related samples.

By performing the Friedman test on the dataset, the following is obtained:

Friedman rank sum test

data: value, factor(variable) and factor(Patient)

Friedman chi-squared = 10.75, df = 2, p-value = 0.004631

At a significance level of 0.05, the p-value obtained from the Friedman test (0.0046) indicates that
there is a significant difference between the treatments.

27
The Bonferroni method is used to assess whether there is any pairwise differences. The following
results are obtained:

Pairwise comparisons using Wilcoxon signed rank test with continuity correction.

data: patients.long$value and factor(patients.long$variable)

Treatment.1 Treatment.2
Treatment.2 0.456 -
Treatment.3 0.035 0.120
P value adjustment method: Bonferroni

The Bonferroni adjusted p-values indicate that at a significance level of 0.05, there is a significant
difference between Treatment 1 and Treatment 3 (0.035). However, there is no significant difference
between Treatment 1 and Treatment 2 (0.456) and Treatment 2 and Treatment 3 (0.120). Since
there is no significant difference in the medians for treatments 1 and 2, treatment 3 is the treatment
that is giving rise to the overall significant difference observed from the Friedman test. Therefore, it
can be concluded that Treatment 3 is the best treatment method.

Since there is a significant difference between treatments 3 and 1, but no significant difference
between treatments 3 and 2, it can be concluded that treatment 1 is the worst treatment. This is
also apparent from the value of the p-values for treatment 1 and treatment 2 (0.456), which is larger
than that for treatment 2 and treatment 3 (0.120), indicating that the median of treatment 2 is
closer to treatment 1 than to treatment 3. The above is also reflected in a visual assessment of the
scores for the treatments with patients rating treatment 1 with a significant number of 3 (worst) and
treatment 3 with a majority of 1 (best).

28
Appendix I – R-script

#nonparametric complete block design

library(stats)

library(reshape2)

patients.long <- melt(patients, id.vars=c("Patient")) # converting the data into three columns

attach(patients.long)

#non parametric randomized block design using Friedman test

friedman.test(value,factor(variable),factor(Patient))

pairwise.wilcox.test(patients.long$value, factor(patients.long$variable), p.adjust.method =


"bonferroni",paired = TRUE,alternative="two.sided")

29
Q4 – Regularization in Regression

a. Model fitting

The dataset Swiss was divided into a training set (observations 1 to 40) and a training set
(observations 41 to 47). The following code is used:

#Regression Analysis

rm(list=ls(all=TRUE)) # this clears all existing variables

set.seed(123)

library(glmnet)
library(dplyr)
library(psych)
data(swiss)

swiss <- datasets::swiss

summary(swiss)
View(swiss)
y <- swiss %>% select(Fertility)%>% as.matrix()

x <- swiss %>% select(-Fertility)

# Creating the training set

yTr<-y[1:40]
xTr<-x[1:40,]
# Test set
yTest<-y[41:47]
xTest<-x[41:47,]

The training and testing datasets were then centred and standardised using the following code:

# Centering the response variable


yTrc<-scale(yTr, center = mean(yTr), scale = FALSE)
yTrc<-as.matrix(yTrc)

yTestc<-scale(yTest, center = mean (yTr), scale = FALSE)


yTestc<-as.matrix(yTestc)

# Standardising the data matrix


xTrs<-scale(xTr, center = colMeans(xTr), scale = TRUE)
xTrs<-as.matrix(xTrs)
xTests<-scale(xTest, center = colMeans(xTr), scale = apply(xTr, 2, sd))
xTests<-as.matrix(xTests)

The lambda value was then determined, as follows:

lambdas <- seq(0,5, length.out=100)


30
lambdas

Ridge Regression Model

ridge_swiss <- cv.glmnet(xTrs, yTrc, alpha=0, lambda = lambdas, standardize = FALSE, nfolds = 10)

lambda_swiss <- ridge_swiss$lambda.min


lambda_swiss

When the above code was run for a lambda range between 0 and 1, a lambda value of 1 was
obtained. This range was increased further to 0 to 5 and resulted in a lambda value of 1.414.

Using the above lambda value, a model was fitted as follows, which was then used to obtain an
estimate of the MSEP.

# Fitting RR model on the training set using lambda_swiss


model_swiss <- glmnet(xTrs, yTrc, alpha = 0, lambda = lambda_swiss)

# Compute the MSEP


y_hat_swiss <- predict(model_swiss, xTests)
MSEP_swissRR <- (1/length(yTestc)) * t(yTestc - y_hat_swiss) %*% (yTestc - y_hat_swiss)
MSEP_swissRR

This resulted in a mean square error value of prediction of 194.74.

The parameters for the model as obtained through the following:

# parameter estimates
model_swiss$beta

5 x 1 sparse Matrix of class "dgCMatrix"


s0
Agriculture -1.929267
Examination -3.369175
Education -1.598661
Catholic 3.871317
Infant.Mortality 2.830565

Therefore, the model equation would be:

Y i=3.87 F i−3.37 F 2i +2.83 F 3i −1.93 F 4 i −1.60 F 5 i

Where F i is the variable Catholic, F 2i is the variable Examination, F 3i is the variable Infant
Mortality, F 4 i is the variable Agriculture and F 5i is the variable Education.

The two most influential parameters on fertility are Catholic (positively correlated) and Examination
(negatively correlated).

LASSO Regression Model

The above steps were adapted and repeated to obtain the LASSO Regression Model.

31
# LASSO Regression Model

lambdas <- seq(0,5, length.out = 100)


lasso_swiss <- cv.glmnet(xTrs, yTrc, alpha = 1, lambda = lambdas, nfolds = 10)

lambda_swiss <- lasso_swiss$lambda.min


lambda_swiss
A lambda value of 0 is obtained.

# LASSO Regression Model


lambdas <- seq(0,5, length.out = 100)

lasso_swiss <- cv.glmnet(xTrs, yTrc, alpha = 1, lambda = lambdas, nfolds = 10)

lambda_swiss <- lasso_swiss$lambda.min


lambda_swiss

# Fitting the LASSO model on the training set using the lambda swiss value
model_swiss2 <- glmnet(xTrs, yTrc, alpha = 1, lambda = lambda_swiss, standardize = TRUE)

# MSEP computation
y_hat_swiss <- predict(model_swiss2, xTests)
MSEP_swissLasso <- (1/length(yTestc)) * t(yTestc - y_hat_swiss) %*% (yTestc - y_hat_swiss)
MSEP_swissLasso

# parameter estimates
model_swiss2$beta

This resulted in a mean square error value of prediction of 174.68 and the following parameters:
s0
Agriculture -2.893201
Examination -3.741029
Education -2.027840
Catholic 4.399095
Infant.Mortality 2.919209

Y i=4.40 F i−3.74 F 2 i +2.92 F 3 i−2.89 F 4 i−2.03 F5 i

Where F i is the variable Catholic, F 2i is the variable Examination, F 3i is the variable Infant
Mortality, F 4 i is the variable Agriculture and F 5i is the variable Education.

Similar to the ridge regularised regression model, the two most influential parameters on fertility are
Catholic (positively correlated) and Examination (negatively correlated).

b. Since the mean square error value obtained from the LASSO regression model (174.68) is much
smaller than that obtained from the Ridge regression model (194.74), the LASSO regularised
regression model has the best predictive power from the two models. This is also supported by
the difference in lambda values.

32
c. Since the lambda value for the LASSO regression model (0) is smaller than the lambda value
obtained for the ridge model (1.414), the LASSO regularised regression model has the highest
predictive performance since it introduces the least bias into the model.

33

You might also like