You are on page 1of 16

Computational Genomics Midterm

MSCBIO 2070/02-710
Spring 2015

March 25, 2015

This exam has 8 questions, for a total of 100 points.

Name: ______________________________________________________________________

Instructions:
 Write clearly. You only need to provide explanations in the places that specifically ask
for it.

 If you need more room to work out your answer to a question, use the back of the page.
Make sure to indicate that we should check the back of the page for the rest of your
answer.

 This exam is open book. Calculators are allowed, but no computers, PDAs, or other
communication devices.

 You have 1 hour and 30 minutes. Good luck!

No. Topic Max. Score Your Score


A Evolutionary Distances 18
B Molecular Evolution 12
C Phylogeny 12
D Scoring Matrices 10
E Normalization, DE genes, and 16

NGS
F Multiple Hypothesis Testing 10
G Clustering 12
H Time Series 10

1
A. Evolutionary Distances [18 points]

Consider the alignment of an ancestral sequence S0 and a descendent sequence S1:

S0: GCCGTCAGAAATTTAGCACTGATCACAGCCTCGTCTCTGA
S1: GCCCTCAGGGAATTAGCACTAATCATAACTCCGTCTGTGT

1. [3 points] Are the events S0 = A and S1 = G independent?


How about the events S0 = A and S1 = A, are they independent?

Answer: First compile the frequency table


S0
A G C T
A 7 2 0 1
S1 G 2 5 1 0
C 0 1 9 1
T 1 0 2 8

As there are 10 A’s among of the 40 bases of S0, it means P(S0 = A) = 10/40 = ¼.
P(S1=G) = 8/40 = ⅕.

Of the 40 aligned pairs, 2 pairs for which S0 = A and S1 = G, hence P(S0 = A and S1 = G) =
2/40 = 1/20.

P(S0 = A and S1 = G) = P(S0 = A) * P(S1 = G) => independent.

P(S1 = A) = 1/10 and P(S0 = A and S1 = A) = 7/40 which is not equal to P(S0 = A) * P(S1 = A)
and hence events S0 = A and S1 = A are not independent,

2. [3 points] Write the transition matrix for the conditional probabilities of base
substitutions from S0 to S1.

Answer: Transition matrix

0.7 0.25 0.00 0.1


0.2 0.625 0.083 0.0
0.0 0.125 0.75 0.1
0.1 0.0 0.167 0.8

2
3. [2 points] How many mutations did you find in the list of base pairs? Use this data to
compute Jukes-Cantor distance between S0 and S1.

Answer: out of 40 sites, 11 have mutated which means

p = fraction of sites that disagree in comparing S0 and S1 = 11/40

JC-distance = -¾ ln(1 - 4/3p) = -¾ ln(1-4/3*11/40) = 0.34

4. [2 points] In the JC model what is an appropriate value for α?

Answer: α denotes the rate of observable substitutions over one time step.
11 out of 40 have undergone mutation from S0 to S1. We can take α = 11/40 here.

3
5. [6 points] How many mutations are transitions and transversions? If you have to use the
2-parameter Kimura model, what would the transition matrix be? Recall the two-
parameter Kimura model uses a Markov matrix where the mutation rate for transitions is
β, mutation rate for each transversions is γ and self-transitions (diagonal entries) are
given by 1 - β - 2 γ. Also, compute the Kimura 2-parameter distance between S1 and S2.
Recall, 2-parameter Kimura distance is given by (-1/2) ln(1 - 2 β – γ) – (1/4) ln(1 - 2 γ).

Answer: transitions: 7 mutations (within AG or within CT) ; tranversions = 4 (from AG


to CT and vice versa).
Kimura 2-parameter model: β = fraction of transitions= 7/40 and 2* γ = fraction of
tranversions, hence γ = (4/40)/2 = 2/40

For the transition matrix each diagonal entry = 1 - β - 2* γ = 29/40


Transition matrix:
29/40 7/40 2/40 2/40
7/40 29/40 2/40 2/40
2/40 2/40 29/40 7/40
2/40 2/40 7/40 29/40

Kimura distance = (-1/2) ln(1 - 2 β – γ) – (1/4) ln(1 - 2 γ)


Kimura distance = 0.354

6. [2 points] Which of the JC distance and Kimura distance is likely to be a more


reasonable measure? Justify.

Answer: Kimura: beta and gamma are different with beta being 3.5 times gamma.
if JC is assumed, then beta = gamma (each equal to alpha/3).
Kimura is more likely to be reasonable.

4
B. Molecular Evolution – Part 2 [12 points]

1. [3 points] Given the a portion of the aligned sequences of a protein coding region of a
gene from an organism (GENE2) and its evolutionary ancestor (GENE1), what are the
numbers of synonymous and non-synonymous mutations? See Lecture 4 Slide 23 for an
example of synonymous vs. non-synonymous mutations. The codon table is shown
below.

GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC
GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC

Answer:
R - V - G - L - A - T - S
GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC
GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC
S - Q - G - L- S - T - K

Synonymous mutations = 2, non-synonymous mutations = 4.

5
2. [3 points] Are the mutations from GENE1 to GENE2 advantageous, deleterious, or
neutral See (Lecture 4 Slide 24)? Explain.

Answer: dn/ds = 4/2 = 2 > 1, therefore the mutations are likely advantageous.

3. [3 points] Does your answer in the previous question support a selection based or neutral
theory of evolution (where the rates of non-synonymous and synonymous mutations are
the same)? Explain.

Answer: dn/ds ≠ 1, therefore this gene sequence favors a selection based theory of evolution.

4. [3 points] Why does the simple measure of similarity between these two sequences
underestimate the evolutionary distance between them?

Answer: Multiple substitutions at a single site are counted only as a single mutation and hidden
mutations are not counted at all.

6
C. Phylogeny [12 points]

OTU: noncommittal term used for objects of study (be they species, populations or individuals)

1. [2 points] In a rooted ultrametric tree with 4 OTUs (A, B, C, D), the distance between the
root and A is equal to the distance between the root and C.

TRUE FALSE
Answer: TRUE

2. [2 points] In a rooted additive tree with 4 OTUs (A, B, C, D), the distance between the
root and A is equal to the distance between the root and C.

TRUE FALSE
Answer: FALSE

3. [1 points] UPGMA produces ultrametric trees.

TRUE FALSE
Answer: TRUE

4. [1 points] Neighbor-Joining produces ultrametric trees

TRUE FALSE
Answer: FALSE

5. [3 points] GENE1 is found in species A and B, but not in C, D, E, and F. GENE2 is


found in species C and D, but not in the other species. Which of the following rooted
phylogenetic trees is supported by these findings? There may be more than one tree that
could fit this data.

Answer: Trees 1 and 2.

7
6. [3 points] Using maximum parsimony, reconstruct the ancestral nucleotide at the internal
nodes of the following tree by labeling the ancestors at each node, 1-7. If more than one
nucleotide is possible, indicate which they are.

Answer:
1. G/T
2. A/T
3. T/G
4. A
5. G/T/A
6. T
7. T

8
D. Scoring Matrices [10 points]

Alignment scores are log-odd scores:


1 p
Sab= log ⁡( ab )
λ fafb

We will use the word target frequencies to denote collectively: pab , f a , f b.

If we expect to find a and b aligned together in homologous sequences more often than we
expect them to occur by chance ( pab > f a f b ), then the odds ratio is greater than one and the score
is positive. Positive scores mean conservative substitutions, and negative scores indicate non-
conservative substitutions. But this definition is purely statistical, with no relation to
biochemistry. Keep this in mind as you solve the questions below.

1. [4 points] In BLOSUM62 you will find that tryptophan pairs (W/W) score +11 while
leucine pairs (L/L) score only +4. In other words, the identity pairs (W/W, L/L, ..) do not
all get the same score. Explain why this might be the case for W/W and L/L.

Answer: It depends on the ratio: p_LL/p_L vs p_WW/p_W. Since they are both positive, it is
clear that p_LL > p_L and p_WW > p_W. If p_LL = p_WW and p_L > p_W, then we can see
why s_WW is larger than s_LL. It is also possible that p_LL > p_WW and p_L > p_W but the
ratio favors s_WW over s_LL.

As it turns out, in the homologous alignment data that BLOSUM62 was trained on, p_LL =
0.0371 > p_WW = 0.0065 but p_L = 0.099 is more frequently found than p_W = 0.013

9
2. [6 points] Let’s make up a DNA score matrix where we want to optimize the matrix for
finding 88% identity elements. Assume all mismatches are equally probable and the
composition of both alignments and background sequences is uniform at 25% for each
nucleotide. Assuming λ = 0.25, what is the score you will assign for a match (such as
AA, GG, CC, TT) and what is the score you will assign for a mismatch (such as AG, CT
and so on) (hint: round up the scores where convenient).

Answer:
Match probability: set p_AA and so on = 0.22
Mismatch probability: set p_AG and so on = 0.01 for each of 12 mismatches
Background probability = 0.25
Match score = ¼ log (0.22/(0.25^2)) = ~5
Mismatch score = ¼ log(0.01/(0.25^2)) = ~ (-7)

10
E. Normalization, DE genes and NGS [16 points]

1. Let C be the set of cancer samples in our data and H be the set of healthy samples. We
know that prior to normalization, expression values (or read counts for a RNA-Seq
experiment) for gene A in all samples for C are higher than values for gene B in these
samples whereas values for gene A in all samples of H are lower than the values of B in
these samples.

Denote by Ci (A) the normalized value for gene A in cancer cell i and Ci (B) the
normalized value for B in that cell. Circle all answers that can apply (of course, you will
be penalized for circling answers that cannot be true).

1. [2 points] Using scale factor normalization


a. Ci (A) > Ci (B)
b. Ci (A) < Ci (B)
c. Ci (A) = Ci (B)
Answer: a. Scale factor is a linear transformation and maintains the original
relationship between the values.

2. [2 points] Using invariant set normalization


a. Ci (A) > Ci (B)
b. Ci (A) < Ci (B)
c. Ci (A) = Ci (B)
Answer: a. Invariant set is a non-decreasing function and strictly increasing for
different values (though the slope can be different between different locations).
Now assume we used scale factor normalization. We observe that after normalization, A and B
have the same standard deviation across all cells and within each population (Cancer and
Healthy). Denote by pB the p-value we obtained for A using a statistical test. Which of the
following holds (chose the most accurate answer)?

3. [3 points] If we used a t-test to compute the p-value then:


a. pB < pA
b. pB ≤ pA
c. pB > pA
d. Impossible to tell
Answer: c. Since the difference in the means is greater for A, and the standard
deviation is the same, its p-value would be more significant.

11
4. [3 points] If we performed randomization tests using the same random sets for both
genes (i.e. in each randomized setting we are computing the parameters for both
genes and the p-value was based on this randomization):
a. pB < pA
b. pB ≥ pA
c. pB > pA
d. Impossible to tell
Answer: d. By the very nature of this test it is stochastic and in this case it
could be that in some cases we will see a difference that is lower than the
difference we see for B across healthy and cancer cells but higher than the
difference we see for A.

Now assume we performed scale factor normalization and consider two other genes, Z and X.
Let AvC(Z) denote the average expression of gene Z in cancer cells and Av H(Z) denote its
average in healthy cells. Assume | AvC(Z) – AvH(Z)| > |AvC(X) – AvH(X)|and that Z and X have
the same variance in both cell type. Answer TRUE / FALSE and briefly explain below.

5. [3 points] Using log likelihood ratio test, the p-value for Z is lower (more significant)
than the p-value for X

TRUE FALSE

Answer: TRUE. Since the variance is the same, and so are the number of samples
and DOF, the only thing that matters is the difference in means which is more
significant for Z.

6. [3 points] Using SAM the p-value for Z is lower (more significant) than the p-value
for X

TRUE FALSE

Answer: FALSE. The question does not tell us what is the actual expression level for
both X and Z. SAM includes a correction terms for lowly expressed genes and this
can lead to lowering the significance of Z even if the average difference for it is
larger.

12
F. Multiple Hypothesis Testing [10 points]

1. Assume we have 5 samples from cancer patients, X samples from healthy patients and we
are measuring N genes. We found a group of genes A that all have a differential p-value
< 0.001.

a. [4 points] If we used a randomization test to compute the p-values, what is the


minimum number of healthy samples we have in our cohort? Briefly explain.

Answer: We need at 8 healthy samples, since we need to be able to select at least


1000 subsets of samples and (13 chose 5) > 1000 while (12 chose 5) < 1000.

b. [6 points] Assume we used a t-test for computing the p-value. If we know that the
FDR for genes in A is 0.01%, and that the Bonferonni corrected p-value for genes in
A is at most 0.05, what is the size of N? What is the size of A?

Answer:

N = 50. If the Bonferroni corrected p-value is .05 and the uncorrected is 0.001, then the
number of genes is .05/.001 = 50.

A = 5. If we have a total of 50 genes, we expect .05% of the genes to have a p-value of <
0.001. Since we know the actual FDR is 1/5 of that (0.01%) then we have 5 genes in A.

13
G. Clustering [12 points]

Figure: Three clustering results for the question below.

Select all the clustering method(s) that will lead to the results in the Figure above. Fill in the
table below by marking T if the clustering method can lead to these results and F if it cannot.

Figure (a) Figure (b) Figure (c)


Gaussian mixture model T F F

k-means T F F

Hierarchical clustering with single T F T

linkage

14
H. Time Series [10 points]

Given a set of n gene expression control points over time (no duplicate time points), quadratic
spline fitting constructs n−1 piecewise second-order polynomials between the points. The splines
need to satisfy the following criteria:

 Each spline needs to pass through its left-most and right-most control points.
 The spline located on the left and right hand of that point should be continuous and have
an equal first derivative at that point.

Let S1=a x 2 +bx +c and S2=d x 2+ ex +f be the two quadratic splines that end (S1) and start (S2)
in the same point (see Figure below).

Figure: Quadratic splines for the questions below.

15
1. [5 points] How many equations are defined by control point 2 in the figure? Write all
these equations.

Answer:

Two equations are defined by this point.


d x i2 +e xi + f =a x i2 +b x i+ c
2 a xi + b=2 d xi +e

2. [5 points] How many free parameters do we need to fit in order to obtain n − 1 splines?
Briefly explain.

Answer: For each control point we have 3 so a total of (n-2)*3. Each equation constrains 2 of
the 3 parameters of the spline on the right. So a total of 3 for the first spline + 1 for all the other
splines leading to: 3+1*(n-2) = n+1.

16

You might also like