You are on page 1of 21

1B Mathematical and Computational Biology

Weeks 1-2 Notes

High dimensional data are common in biology. How can we visualize these datasets?

We’ll work with a dataset from randomly selected individuals that had their blood tested for
evidence of antibodies against different pathogens.

Note that at this stage, we do not expect you to have the coding skills to recreate these plots
(although you should have them shortly). Instead, you should try and understand
conceptually how we try and visualize high dimensional data. I include some code

I will first bring in the data and have a look at the first 10 rows.

The pd.set_option allows me to look at up to 15 columns

This is what we get:


We see there is information on age, sex and the antibody response to 13 pathogens. To
begin with, let’s look at histograms for each column.

This gives me a histogram with 15 bins each.

We can next look at the pairwise correlation to give us a better sense of how related pairs of
responses are. We will just focus on a subset of pathogens for now.
We can see that some pairs are largely uncorrelated while others are more correlated. For
example, the four dengue serotypes appear to give similar responses – this is to be expected
as they are ultimately quite similar viruses.

We can quantify the correlation between pairs and visualise that too.
PCA and multidimensional scaling

We would like to reduce the number of dimensions we work with. This will aid our ability to
work with complex data. We will look at two approaches to do this: Principal Components
Analysis (PCA) and Multidimensional scaling (MDS).

Principal Components Analysis

• Principal Components Analysis allows us to represent correlated data from many


dimensions into fewer orthogonal (uncorrelated) dimensions, ordered by the
magnitude of variance explained.

We will use our example of a dataset where different antibody assays have been run on the
same samples. These are for the four dengue viruses and Japanese encephalitis. Both
hemagglutination inhibition assays and plaque reduction neutralisation tests have been used.

The slides provide a good visual guide as to what the process is doing using a toy example of
10 points from two of the columns only: the HI titers against dengue serotype 1 (DENV-1) and
serotype 2 (DENV-2).

We first center each column by subtracting the mean of the column.


We would then like to identify the line that goes through the origin and explains most of the
variance in the data.

For each candidate line, we project points onto that line and then calculate the sum of
squared distances from the projected points to the origin.

We want the line that has the maximum sum of squared distances. So in this example, the
following line has the maximum sum of squared distances (giving a value of 26.9 versus 4.2
above). We call this line the first principal component (or PC1). We can identify the eigenvalue
of this line as SS/(n-1) = (d12 + d22 + d32 + d42 +……. d102 )/(n-1), SS is the sum of squared distances
and n is the number of points. The eigenvalue provides the magnitude of variability along
this Principal Component. Here we have an eigenvalue of 26.9/(10-1) = 2.99.
The values in the x and y directions that give a unit increase along this line gives the
eigenvector for PC1. The eigenvector therefore gives the direction of the Principal
Component

The 2nd Principal Component is the orthogonal direction with the next highest amount of
variance. As we only have two dimensions in our data this is simply the orthoganal line to PC1
that goes through the origin.

We can now reproject our data into the two orthogonal directions.

We can use the eigenvalues to characterize the variance explained by each PC. Variance
explained for a PC = eigenvalue for that PC/sum of all eigenvalues. In this example 93% of the
variance is explained by the first principal component and 7% by the second principal
component.

Note that we have not ‘lost’ any information (yet). This is simply a ‘reprojection’ of data
(coordinates) into a new system, ordered by proportion of variance explained.

We can now repeat this for the whole dataset. We can then plot the variance explained by
each dimension. We see that over half of the variance can be explained by a single Principal
Component.
We could decide that we are comfortable focusing on 70% of the variance – and just work
with the first two dimensions. This will help visualising the data. The biggest negative is that
it becomes harder to interpret the axes – so e.g., a one unit difference in PC1 is harder to
interpret that a one unit increase in a titer against DENV-1.

Identifying eigenvalues and eigenvectors

More formally, the eigenvector of a (square) matrix are non-zero vectors which when
multiplied by the square matrix would result in a scalar multiple of the vectors.

𝑨𝒗 = 𝜆𝒗

Where A is our n x n square matrix of interest. A vector v (dimensions of n x 1) is said to be an


eigenvector of A if and only if Av = λv for some scalar λ (the eigenvalue).

We can rewrite this to:


𝑨𝒗 − 𝝀𝑰𝒗 = 𝑂

Where I is an n x n identify matrix (the diagonal is 1s and the rest are 0s) and O is a zero matrix
(just zeros). This becomes:

(𝑨 − 𝝀𝑰)𝒗 = 𝑂

For this to have a solution, the determinant should be equal to 0.

|𝑨 − 𝝀𝑰| = 𝑂

We can solve this ‘characteristic equation’ for all 𝝀s, which are the eigenvalues. We can then
substitute in (𝑨 − 𝝀𝑰)𝒗 = 𝑂 the value of 𝝀 and solve for the v.

Multidimensional scaling

In some experiments, we can only measure similarity or dissimilarity. For example is response
to stimuli similar or different? This is frequent in psychophysical experiments, preference
surveys, etc.
Multidimensional scaling (MDS) attempts to preserve pairwise distances. It attempts to
construct a configuration of n points in Euclidean space by using information about the
distances between the n patterns

Usually this is done minimizing the “stress” (discrepancy between original and reduced
distances).

For example we can take the pairwise distances between different cities (here in France) –
from that we can identify where to place the cities relative to each other. In this case, as the
original dots come from essentially a 2D space (ignoring the curvature of the earth) – the
stress will be very close to 0.

Calais

con Rouen
A len s
a lai s
365 C M an Reims
49 419 Le ns Paris
lr ea
167 433 138 O
ris Alencon
220 293 204 135 Pa im
s
355 278 340 270 146 R e 100km
en Orleans
146 215 199 210 140 284 Rou Le Mans

Note, we have no idea where on earth these dots are – we just know their position relative
to each other.

The steps we need to take to conduct MDS are as follows:

1. Take all the points in our dataset and place them onto an n-dimensional space by
giving them n coordinates (so e.g., in the example above, this could be 2 coordinates
– an X and a Y).

2. Using Pythagoras theorem calculate the Euclidean (straight line) distance between all
pairs of points. This creates something that we call a dissimilarity matrix.

3. We next compare the dissimilarity matrix we obtained with the equivalent matrix of
the input points (i.e., the raw data). This is done by calculating something we call the
‘stress function’. This could be the sum of squared distances.

4. Finally, we adjust the coordinates and see if we can obtain positions that further
minimize the stress.

Classical MDS is based upon Euclidean distances. In classical MDS, we can identify the optimal
locations of points using eigenvalues and eigenvectors. The coordinate matrix X that we are
interested in obtaining can be identified from eigenvalue decomposition of B = X X’ (we call
this a scalar product matrix).

We can obtain B from the dissimilarity matrix D by multiplying the squared dissimilarities with
the matrix J = I – n-1 1 1’ where I is a n x n identity matrix (i.e., diagonal of 1s with all the rest
0s), n are the number of points we have and 1 is an n x 1 matrix full of 1s. This procedure is
called ‘double centering’.

Let’s use an example of the distances between four places in the UK:

Camb. Oxford Hinxton South’n


Cambridge 0 107.9 12.8 179.4
Oxford 107.9 0 105.4 95.1
Hinxton 12.8 105.4 0 171.4
Southampton 179.4 95.1 171.4 0

This represents our dissimilarity matrix D. The squared distances are then:

0 11642.41 163.84 32184.36


11642.41 0 11109.16 9044.01
𝐷 (") =# .
163.84 11109.16 0 29377.96
32184.36 9044.01 29377.96 0

The J matrix is:

1 0 0 0 1 1 1 1 0.75 −0.25 −0.25 −0.25


0 1 0 0 1 1 1 1 1 −0.25 0.75 −0.25 −0.25
𝑱=# .− # . =# .
0 0 1 0 4 1 1 1 1 −0.25 −0.25 0.75 −0.25
0 0 0 1 1 1 1 1 −0.25 −0.25 −0.25 0.75

The B matrix is therefore

5152.544 −2193.040 4653.168 −7612.671


1 " −2193.040 2103.786 −2343.871 2433.125
𝐵 = − 𝐽𝑫 𝑱 = # .
2 4653.167 −2343.871 4317.631 −6626.928
−7612.671 2433.125 −6626.928 11806.474

We can extract the two eigenvalues from this as 21569.013 and 1809.803 (in fact as the
underlying data comes from two dimensions, the remaining eigenvalues will be 0). The
eigenvectors are:

0.488 0.066
−0.199 −0.832
𝐸$ = 7 8 , 𝐸" = 7 8
0.438 0.309
−0.728 0.457
The coordinates we want are in X, which can be calculated by multiplying the eigen
vectors and eigenvalues:

0.488 0.066 71.72 2.80


−0.199 −0.832 √21569.013 0 −29.19 −35.38
𝑋=# .; = =# .
0.438 0.309 0 √1809.803 64.38 13.13
−0.728 0.457 −106.90 19.44

We can plot these coordinates and we see the relative positions are what we would expect:
20

Oxford Cambridge
10
coords[,2]

Hinxton
0
-10
-30

Southampton

-100 -50 0 50

coords[,1]

We can also calculate the distances between the new points – which is identical to the
distances between the original points.

0 107.9 12.8 179.4


107.9 0 105.4 95.1
𝐷 %&' =# .
12.8 105.4 0 171.4
179.4 95.1 171.4 0

In this case the value of the minimized stress function would be zero (i.e., we have found a
perfect match).
Just to double check we can now do X X’ to recover our B matrix.

Relationship between PCA and MDS


MDS only requires a dissimilarity matrix (or distance matrix) as input. Whereas PCA works
with underlying data matrix. In some instances, when working with MDS we may have
access to the underlying data (e.g., locations on a map, rather than just the distances
between them) from which we calculate Euclidean distances. In this case, classical MDS is
the same as PCA.
Hypothesis testing

• A statistical hypothesis test is assessing the consistency of a ‘null hypothesis’ with


observed data.
• The test is often interpreted using a p-value, which is the probability of observing the
result given that the null hypothesis is true.
• p-value (p): Probability of obtaining a result equal to or more extreme than was
observed in the data.

p-values are very common throughout biology, so getting familiar with them is important.

• In interpreting the p-value of a significance test, you specify a significance level (a).
• A common value for a is 5%.
• The p-value is compared to a. A significance test is claimed to be “statistically
significant” if the p-value is less than the significance level. This means that the null
hypothesis is rejected.
• You have two options:
• p <= alpha: reject H0
• p > alpha: fail to reject H0

To begin with, let’s work with an example where we can get a direct measure of the p-value.
We will work with a binomial problem. To help illustrate the problem, let’s work with a coin.

We have a coin and would like to know whether it’s biased to land on heads.

• We use sample data to assess whether the population proportion (here p) for a
dichotomous variable equals a specific value. This means we will conduct an
experiment when we flip the coin n times, record each time the side it lands on. We
will then use that information to assess whether the coin is biased.
• The null hypothesis (often denoted H0) is that the probability of heads is 0.5.
• The alternative hypothesis is that the probability of heads proportion is greater than
0.5
• Assumptions:
• Random samples
• Independent observations
• The variable of interest is binary (only two possible outcomes).
• The number of trials, n, is fixed ahead of time.

We can use the probability mass function to calculate the probability of observing X heads:
Note that the p-value is the probability of obtaining a result equal to or more extreme than
was observed in the data. We therefore need to consider the probabilities attached to more
extreme values too.
!

𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 2 𝑃(𝑋|𝑛, 𝜋)
"
If we got 15 heads from 20 coin flips, we would get a p-value of 0.021. This means using an
alpha value of 0.05 we had evidence to reject the null hypothesis that the probability of
getting a heads is 0.5, and accept the alternative hypothesis that the coin has a greater
probability of giving heads.

Often we’re usually not so interested in the direction. We often just want to assess whether
the coin is biased. The alternative hypothesis is then ‘different from 0.5’.

In this toy example you would need to consider having 15 or greater heads, or 15 or greater
tails. In this binomial case, this is equivalent to just doubling the p-value we got before (so
0.04).

Type I and type II errors

The p-value is ultimately a probability and so given the p-value, we could make an error in
our interpretation. There are two types of error:

• Type I error: probability of rejecting the null hypothesis when it is true. Usually, it is
the significance level of the test. It is denoted as α
• Type II error: probability of not rejecting the null hypothesis when it is false. It is
denoted as β.

Decreasing one type of error increases the other, so in practice we fix the type I error and
choose the test that minimizes type II error.

Statistical power
Statistical power is the probability that the test correctly rejects the null hypothesis. i.e., the
probability of a true positive result. It is only relevant when the null hypothesis is rejected.
The greater the statistical power, the lower the probability of making a Type II (false
negative) error. The power of a test is the inverse of the probability of a Type II error.

We want experiments to have high statistical power. Experiments with too low statistical
power will lead to invalid conclusions.

• Low Statistical Power: Large risk of committing Type II errors, e.g. a false negative.
• Experiments are often designed to have a statistical power of 80% (or greater). This
means a 20% probability of encountering a Type II area.
• Note this different to the 5% probability of encountering a Type I error.

One sample T test

The purpose of the One Sample T Test is to determine if sample observations could have
come from a process that follows a specific parameter (like the mean). It is typically
implemented on small samples.

The numerator gives you an estimation of the signal (i.e., how far away is our observed
value from the hypothesized value) and the denominator gives you an estimation of the
noise (i.e., how variable is my data).

Let’s apply this approach to an experimental dataset where seven mice were given
treatment following trauma. The duration of their survival was then recorded.
3.0
2.0
Frequency
1.0
0.0

0 50 100 150 200


Survival (days)

It is known that mean survival for these mice is 129 days. We would like to know whether
these treated mice have the same survival as healthy mice.

The mean survival of these seven mice is 86.9 and the standard deviation is 66.8. We can
plug these numbers into our t-test equation above and we get:
86.9 − 129
𝑡= = −1.67
66.8/√7

So there is 1.67 times as much signal as noise. So is there a significant reduction in survival?

To answer this question, we use T distributions. The T distribution provides the density
function under the null distribution (in this case that the mean survival is 129 days). The
distribution depends on the sample size, with smaller samples having fatter tails.

• Thicker tails indicate that t-values are more likely to be far from zero even when the
null hypothesis is correct. The changing shapes are how t-distributions factor in the
greater uncertainty when you have a smaller sample.
• Once sample sizes get to ~30, they become very close to the normal distribution.

0.4 DF=4
DF=7
DF=30
Normal dist.
0.3
Density
dt1

0.2

0.1

0.0

-4 -2 0 2 4
T value

ps

So going back to our example, we put a vertical line at our observed signal to noise ratio on
a T distribution with sample size of 7.

We then calculate the size of the area under the curve at the extreme to extract our p-value
(in this case 0.073). If we use an alpha value of 0.05, this would say that there is no evidence
of a reduced survival. i.e., we fail to reject the null hypothesis of no difference.
As before, we may just want to know if there has been a significant difference in survival
(without caring whether it’s higher or lower). In this case we would place two lines on the t-
distribution on either sides of 0, and add up both extremes.

The interpretation is now whether our observation is greater or less than the hypothesized
value. Again, there is no evidence to support that there has been a significant change in
survival.

The key assumptions of the T-test are:


• The data are continuous (not discrete).
• The data (approximately) follow a normal probability distribution.
• The test sample is a random sample from the population. Each individual in the
population has an equal probability of being selected in the sample.
• Samples are independent to each other
• No extreme outliers in the dependent variable

Bootstrap

As we see from our dataset, maybe the data is not approximately normally distributed.
3.0
2.0
Frequency
1.0
0.0

0 50 100 150 200


Survival (days)

In this circumstance, we can use a bootstrap approach.

With bootstrapping, we repetitively resample the dataset with replacement to create many
new ‘bootstrap’ datasets. Each of these datasets has its own mean. Each data point has an
equal probability of being inclusion in the resampled datasets. With replacement means we
can select a data point more than once within a resampled dataset. The resampled datasets
are the same size as the original dataset.

For example, let’s do 5,000 bootstrap iterations of the mouse data and calculate the mean
each time.

800
600
Frequency

400
200
0

50 100 150

We will learn in the next lecture about how we can use this to create bootstrap confidence
intervals.

We can use the bootstrap process for hypothesis testing too. We cannot just resample the
data though as the empirical distribution wouldn’t be consistent with H0 as the mean would
be different. Instead, we can initially transform our dataset (𝑧# s) as follows:

𝑧D$ = 𝑧# − 𝑧̅ + 𝜇%

Where 𝑧# is each original datapoint and 𝑧̅ is the mean of our data and 𝜇% is the
hypothesized mean of the null distribution. The mean of this distribution of 𝑧Ds
$ will be 𝜇% .

We repeatedly resample from 𝑧D& , 𝑧D' … 𝑧I ! with replacement. Each time, we recreate the test
statistic:
𝑚𝑒𝑎𝑛(𝑧̃ ) − 𝜇%
𝑡(𝑧̃ ∗ ) =
𝑠𝑑(𝑧̃ )/√𝑛

The p-value is then the proportion of bootstrap samples that are equal to or more extreme
than our test statistic when you use the original data.

Going back to our example with the mouse data. The original data is: 94 197 16 38 99 141
23 (mean=86.9).
• Transformed data (𝑧̃ ): 136.1 239.1 58.1 80.1 141.1 183.1 65.1 (mean = 129, test
statistic = -1.67)
• Resample 1: 80.1 80.1 80.1 183.1 136.1 239.1 136.1 (mean = 133.6, test statistic =
0.20)
• Resample 2: 65.1 183.1 58.1 136.1 239.1 183.1 183.1 (mean = 149.7, test statistic =
0.82)
• Resample 3: 58.1 183.1 80.1 136.1 80.1 239.1 239.1 (mean = 145.1, test statistic =
0.56)
• .
• .
• .
• .

We repeat this process many times. We can then plot a histogram of the result:

We obtain a p-value of 0.096 for the on-tailed test. This is not too different from the value
obtained using the t-distribution. In practise the T-distribution often does quite well in these
kind of questions and can be robust to things that don’t look very normal.
Method of moments

In biology, we are often presented with some experimental data that we would like to use to
estimate some characteristics (or parameters).

Let’s pretend we have a coin that we toss ten times, the results are 0 ,0, 1, 1, 0, 1, 1, 1, 1, 1
(where 1 is a head and 0 a tail). We would like to know what is the probability that this coin
lands on heads.

One way to estimate this, is through something called ‘Method of Moments’. This involves
equating ‘sample moments’ with ‘theoretical moments’:

𝐸(𝑋 ) ) is the theoretical moment of the distribution about the origin for k=1,2,…
&
𝑀) = ∑!#*& 𝑋#) is the kth sample moment, for k=1,2,…
!

So we start with k=1, and equate the first (k=1) sample moment about the origin 𝑀& =
& !
∑#*& 𝑋# = 𝑋Q with the first theoretical moment 𝐸(𝑋).
!

Next, we do k=2, and equate the second (k=2) sample moment about the origin 𝑀' =
& !
∑#*& 𝑋#' with the second theoretical moment 𝐸(𝑋 ' ).
!

We continue (k=3, k=4…), until we have as many parameters as we have equations. We can
then solve for the parameters.

Bernoulli example

Going back to our coin, we see that each coin flip comes from a Bernoulli distribution with
parameter p. We know that Bernoulli distributions have a theoretical mean of p. Therefore in
this example, the first theoretical moment 𝐸(𝑋) = 𝑝. There is only one parameter, so we only
need one equation:
!
1
𝑝 = 2 𝑋#
𝑛
#*&

This is a very simple example, as we don’t need to do any rearranging, the estimate is simply:
!
1
𝑝̂ = 2 𝑋#
𝑛
#*&

Note that we’ve put a ‘hat’ on the p to indicate that it is an estimate of p.


In our example, we got seven heads out of 10 coin flips, therefore our estimate for p is
7/10=0.7.

Normal distribution example

Normal distributions have two parameters – the mean (𝜇) and the variance (𝜎 ' ). We want to
write our theoretical moments in terms of 𝜇 and 𝜎. For the first moment, we know that
E(X)=𝜇. For the second moment, 𝐸(𝑋 ' ), this is a little harder. However, remember that the
variance of a function, 𝑉𝑎𝑟(𝑋) = 𝐸(𝑋 ' ) − 𝐸(𝑋)' . We can rearrange this to:

𝑉𝑎𝑟(𝑋) + 𝐸(𝑋)' = 𝐸(𝑋 ' )

And we know that in normal distributions, 𝑉𝑎𝑟(𝑋) = 𝜎 ' and 𝐸(𝑋) = 𝜇. Therefore:

𝐸(𝑋 ' ) = 𝜎 ' + 𝜇'

We if now equate the first two sample moments with their theoretical moments, we have our
two equations:

𝑀& = 𝜇

𝑀' = 𝜎 ' + 𝜇'

We’d like to be able to estimate 𝜇 and 𝜎, so let’s rearrange these equations but substituting
𝜇 = 𝑀& into the second equation.

𝑀' = 𝜎 ' + 𝑀&'

𝜎 ' = 𝑀' − 𝑀&'

We can now plug in the sample moments for 𝑀& and 𝑀' to get our estimates for the mean
and the variance:
!
1
𝜇̂ = 2 𝑋#
𝑛
#*&
! ! '
1 1
b' = 2 𝑋#' − c 2 𝑋# d
𝜎
𝑛 𝑛
#*& #*&
Generalized linear models

You should have covered GLMs in either 1A mathematical biology or during the lecture series
provided in the summer materials (available on Moodle). Here we just provide some example
python code on how to run GLMs as it will be useful, especially for the miniprojects.

The first model is assessing the relationship between sex (male/female) and requiring ICU.
The second model also includes age as an additive term. The final model includes an
interaction too.

The ‘.summary’ will provide a summary of the model results:

Of particular interest are normally the coefficient estimates. In this example, we see that sex
has a coefficient estimate of 0.59. This is a log-odds ratio, so comparing males have 0.59 the
log odds of requiring icu as compared to females (95%CI: 0.31-0.88).

You might also like