Professional Documents
Culture Documents
MSCBIO 2070/02-710
Spring 2015
Name: ______________________________________________________________________
Instructions:
Write clearly. You only need to provide explanations in the places that specifically ask
for it.
If you need more room to work out your answer to a question, use the back of the page.
Make sure to indicate that we should check the back of the page for the rest of your
answer.
This exam is open book. Calculators are allowed, but no computers, PDAs, or other
communication devices.
NGS
F Multiple Hypothesis Testing 10
G Clustering 12
H Time Series 10
1
A. Evolutionary Distances [18 points]
S0: GCCGTCAGAAATTTAGCACTGATCACAGCCTCGTCTCTGA
S1: GCCCTCAGGGAATTAGCACTAATCATAACTCCGTCTGTGT
As there are 10 A’s among of the 40 bases of S0, it means P(S0 = A) = 10/40 = ¼.
P(S1=G) = 8/40 = ⅕.
Of the 40 aligned pairs, 2 pairs for which S0 = A and S1 = G, hence P(S0 = A and S1 = G) =
2/40 = 1/20.
P(S1 = A) = 1/10 and P(S0 = A and S1 = A) = 7/40 which is not equal to P(S0 = A) * P(S1 = A)
and hence events S0 = A and S1 = A are not independent,
2. [3 points] Write the transition matrix for the conditional probabilities of base
substitutions from S0 to S1.
2
3. [2 points] How many mutations did you find in the list of base pairs? Use this data to
compute Jukes-Cantor distance between S0 and S1.
Answer: α denotes the rate of observable substitutions over one time step.
11 out of 40 have undergone mutation from S0 to S1. We can take α = 11/40 here.
3
5. [6 points] How many mutations are transitions and transversions? If you have to use the
2-parameter Kimura model, what would the transition matrix be? Recall the two-
parameter Kimura model uses a Markov matrix where the mutation rate for transitions is
β, mutation rate for each transversions is γ and self-transitions (diagonal entries) are
given by 1 - β - 2 γ. Also, compute the Kimura 2-parameter distance between S1 and S2.
Recall, 2-parameter Kimura distance is given by (-1/2) ln(1 - 2 β – γ) – (1/4) ln(1 - 2 γ).
Answer: Kimura: beta and gamma are different with beta being 3.5 times gamma.
if JC is assumed, then beta = gamma (each equal to alpha/3).
Kimura is more likely to be reasonable.
4
B. Molecular Evolution – Part 2 [12 points]
1. [3 points] Given the a portion of the aligned sequences of a protein coding region of a
gene from an organism (GENE2) and its evolutionary ancestor (GENE1), what are the
numbers of synonymous and non-synonymous mutations? See Lecture 4 Slide 23 for an
example of synonymous vs. non-synonymous mutations. The codon table is shown
below.
GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC
GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC
Answer:
R - V - G - L - A - T - S
GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC
GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC
S - Q - G - L- S - T - K
5
2. [3 points] Are the mutations from GENE1 to GENE2 advantageous, deleterious, or
neutral See (Lecture 4 Slide 24)? Explain.
Answer: dn/ds = 4/2 = 2 > 1, therefore the mutations are likely advantageous.
3. [3 points] Does your answer in the previous question support a selection based or neutral
theory of evolution (where the rates of non-synonymous and synonymous mutations are
the same)? Explain.
Answer: dn/ds ≠ 1, therefore this gene sequence favors a selection based theory of evolution.
4. [3 points] Why does the simple measure of similarity between these two sequences
underestimate the evolutionary distance between them?
Answer: Multiple substitutions at a single site are counted only as a single mutation and hidden
mutations are not counted at all.
6
C. Phylogeny [12 points]
OTU: noncommittal term used for objects of study (be they species, populations or individuals)
1. [2 points] In a rooted ultrametric tree with 4 OTUs (A, B, C, D), the distance between the
root and A is equal to the distance between the root and C.
TRUE FALSE
Answer: TRUE
2. [2 points] In a rooted additive tree with 4 OTUs (A, B, C, D), the distance between the
root and A is equal to the distance between the root and C.
TRUE FALSE
Answer: FALSE
TRUE FALSE
Answer: TRUE
TRUE FALSE
Answer: FALSE
7
6. [3 points] Using maximum parsimony, reconstruct the ancestral nucleotide at the internal
nodes of the following tree by labeling the ancestors at each node, 1-7. If more than one
nucleotide is possible, indicate which they are.
Answer:
1. G/T
2. A/T
3. T/G
4. A
5. G/T/A
6. T
7. T
8
D. Scoring Matrices [10 points]
If we expect to find a and b aligned together in homologous sequences more often than we
expect them to occur by chance ( pab > f a f b ), then the odds ratio is greater than one and the score
is positive. Positive scores mean conservative substitutions, and negative scores indicate non-
conservative substitutions. But this definition is purely statistical, with no relation to
biochemistry. Keep this in mind as you solve the questions below.
1. [4 points] In BLOSUM62 you will find that tryptophan pairs (W/W) score +11 while
leucine pairs (L/L) score only +4. In other words, the identity pairs (W/W, L/L, ..) do not
all get the same score. Explain why this might be the case for W/W and L/L.
Answer: It depends on the ratio: p_LL/p_L vs p_WW/p_W. Since they are both positive, it is
clear that p_LL > p_L and p_WW > p_W. If p_LL = p_WW and p_L > p_W, then we can see
why s_WW is larger than s_LL. It is also possible that p_LL > p_WW and p_L > p_W but the
ratio favors s_WW over s_LL.
As it turns out, in the homologous alignment data that BLOSUM62 was trained on, p_LL =
0.0371 > p_WW = 0.0065 but p_L = 0.099 is more frequently found than p_W = 0.013
9
2. [6 points] Let’s make up a DNA score matrix where we want to optimize the matrix for
finding 88% identity elements. Assume all mismatches are equally probable and the
composition of both alignments and background sequences is uniform at 25% for each
nucleotide. Assuming λ = 0.25, what is the score you will assign for a match (such as
AA, GG, CC, TT) and what is the score you will assign for a mismatch (such as AG, CT
and so on) (hint: round up the scores where convenient).
Answer:
Match probability: set p_AA and so on = 0.22
Mismatch probability: set p_AG and so on = 0.01 for each of 12 mismatches
Background probability = 0.25
Match score = ¼ log (0.22/(0.25^2)) = ~5
Mismatch score = ¼ log(0.01/(0.25^2)) = ~ (-7)
10
E. Normalization, DE genes and NGS [16 points]
1. Let C be the set of cancer samples in our data and H be the set of healthy samples. We
know that prior to normalization, expression values (or read counts for a RNA-Seq
experiment) for gene A in all samples for C are higher than values for gene B in these
samples whereas values for gene A in all samples of H are lower than the values of B in
these samples.
Denote by Ci (A) the normalized value for gene A in cancer cell i and Ci (B) the
normalized value for B in that cell. Circle all answers that can apply (of course, you will
be penalized for circling answers that cannot be true).
11
4. [3 points] If we performed randomization tests using the same random sets for both
genes (i.e. in each randomized setting we are computing the parameters for both
genes and the p-value was based on this randomization):
a. pB < pA
b. pB ≥ pA
c. pB > pA
d. Impossible to tell
Answer: d. By the very nature of this test it is stochastic and in this case it
could be that in some cases we will see a difference that is lower than the
difference we see for B across healthy and cancer cells but higher than the
difference we see for A.
Now assume we performed scale factor normalization and consider two other genes, Z and X.
Let AvC(Z) denote the average expression of gene Z in cancer cells and Av H(Z) denote its
average in healthy cells. Assume | AvC(Z) – AvH(Z)| > |AvC(X) – AvH(X)|and that Z and X have
the same variance in both cell type. Answer TRUE / FALSE and briefly explain below.
5. [3 points] Using log likelihood ratio test, the p-value for Z is lower (more significant)
than the p-value for X
TRUE FALSE
Answer: TRUE. Since the variance is the same, and so are the number of samples
and DOF, the only thing that matters is the difference in means which is more
significant for Z.
6. [3 points] Using SAM the p-value for Z is lower (more significant) than the p-value
for X
TRUE FALSE
Answer: FALSE. The question does not tell us what is the actual expression level for
both X and Z. SAM includes a correction terms for lowly expressed genes and this
can lead to lowering the significance of Z even if the average difference for it is
larger.
12
F. Multiple Hypothesis Testing [10 points]
1. Assume we have 5 samples from cancer patients, X samples from healthy patients and we
are measuring N genes. We found a group of genes A that all have a differential p-value
< 0.001.
b. [6 points] Assume we used a t-test for computing the p-value. If we know that the
FDR for genes in A is 0.01%, and that the Bonferonni corrected p-value for genes in
A is at most 0.05, what is the size of N? What is the size of A?
Answer:
N = 50. If the Bonferroni corrected p-value is .05 and the uncorrected is 0.001, then the
number of genes is .05/.001 = 50.
A = 5. If we have a total of 50 genes, we expect .05% of the genes to have a p-value of <
0.001. Since we know the actual FDR is 1/5 of that (0.01%) then we have 5 genes in A.
13
G. Clustering [12 points]
Select all the clustering method(s) that will lead to the results in the Figure above. Fill in the
table below by marking T if the clustering method can lead to these results and F if it cannot.
k-means T F F
linkage
14
H. Time Series [10 points]
Given a set of n gene expression control points over time (no duplicate time points), quadratic
spline fitting constructs n−1 piecewise second-order polynomials between the points. The splines
need to satisfy the following criteria:
Each spline needs to pass through its left-most and right-most control points.
The spline located on the left and right hand of that point should be continuous and have
an equal first derivative at that point.
Let S1=a x 2 +bx +c and S2=d x 2+ ex +f be the two quadratic splines that end (S1) and start (S2)
in the same point (see Figure below).
15
1. [5 points] How many equations are defined by control point 2 in the figure? Write all
these equations.
Answer:
2. [5 points] How many free parameters do we need to fit in order to obtain n − 1 splines?
Briefly explain.
Answer: For each control point we have 3 so a total of (n-2)*3. Each equation constrains 2 of
the 3 parameters of the spline on the right. So a total of 3 for the first spline + 1 for all the other
splines leading to: 3+1*(n-2) = n+1.
16