Professional Documents
Culture Documents
ABSTRACT
ng
do
gi
Motivation: The standard genetic code translates 61 codons into
ng co
ar
ch
pi n-
20 amino acids using fewer than 61 transfer RNAs (tRNAs). This
ap do
A
N
m co
is possible because of the tRNA’s ability to ‘wobble’ at the third
-tR
i
61 23-45 20
nt
AA
base to decode more than one codon. Although the anticodon–
A
codon mapping of tRNA to mRNA is a prerequisite for certain Codons tRNAs Amino acids
codon usage indices and can contribute to the understanding
i341
A.C.Roth
and that rare codon readings can be detected in the codon bias. T is constant for all models. In agreement with observations, intra-specific
Therefore, we apply binary anticodon–codon interaction of the codon usage bias (Cannarozzi et al., 2010; Tuller et al., 2010), the probability
tRNA and mRNA binding. The assessment of the relative binding of staying in the same state is higher. We capture this by adding a linear term
frequencies is not addressed. Codon usage can have a large influence to the diagonal, which puts more weight on the reuse of rare tRNA. The reuse
the rate of translation (Ingolia et al., 2009) and unequal reading of term is pre-estimated from data, with the intercept α = 0.065, and the gradient
β = −0.087, by an iterative procedure that can be described as follows.
different codons by the same tRNA is known to influence codon
Starting with an initial guess of transition probabilities (bases on tRNA
bias (Curran, 1995; Ran and Higgs, 2010). On the other hand, frequencies), the Viterbi algorithm computes the most likely sequence of
experimental evidence in yeast shows, that all codons are translated states, given the observed sequences of codons. From the inferred states, we
with similar speed (Qian et al., 2012). count the number of transitions and the parameters are updated accordingly,
(3) We divide all possible binary mappings into two mutually and the procedure is repeated until the transition probabilities converge. By
exclusive subsets of codon readings. This is necessary because this procedure, we found that reuse appears to be more pronounced for tRNA
highly degenerated amino acids can have a large number of coding for rare codons and therefore the weight is linearly increased for
mappings (in the order of billions). The large number of possibilities these. The values of the transition probabilities are evaluated by introducing
can lead to implausible readings randomly having a higher score normally distributed noise and verifying the degree of reproducibility of the
than the true reading. Therefore, we restrict the solution space to results.
The initial state vector π = [π1 π2 ] is computed from the eigen
a set of mappings that are plausible, that is, mappings that have
decomposition of the transition matrix, which gives the limiting tRNA usage.
i342
Codon count
11
R−squared = 0.9995 Ala−AGC
p−val= 0.0102
tRNA 58952 35580 47988 18336
tRNA
count
11 1 1 0 0
5
5 0 0 1 1 Ala−UGC
that can be read by a given tRNA. The sum of codon counts xi that is read
from the counts of consecutive synonymous codons at the instances of
by the ith tRNA is computed by
the amino acid regardless of the distance between them for all sequences
xi = nc codonc , meeting the requirements (see below). The counts are measured in z-scores,
c∈tRNAi that is, they are compared with expected usage of codon occurrences. This
where nc is a normalization factor to account for multiple codon reading by is done by subtracting the expected value from the observed counts and
tRNAi . The tRNA is allocated a fraction of codon counts based on the total dividing by the standard deviation, Z = (X −E[X])/σX , assuming that the
number of different synonymous codons the tRNA can read so that each codon counts are independent events and follow a binomial distribution. The
tRNA pick up the correct proportion of codon counts. The normalization z-transformed matrix give the observations in standard deviations away from
factor is computed by the mean. As expected, the diagonal has large positive values as expected
from prevalent biases (e.g. codon usage bias, nucleotide distribution,
rc
nc = , etc.). More interesting are the positive entries at the off-diagonals. These
t rt entries often coincide with known codon reading of isoacceptor tRNA
where r is the fraction of different codons that the tRNA can read. For and codons read by the same tRNA appear as block structures in the
example, codon c is read by tRNAi that can read four codons. Codon c can matrix (Fig. 5). These block structures are exploited to predict the codon
also be read by another tRNA that reads three codons. Consequently, the reading.
normalization factor then becomes nc = (1/4)/(1/4+13) = 3/7. A potential codon reading is evaluated as follows. The sum s of the counts
Now, we have the summed codon counts xi for a specific codon reading of codon pairs read by the same tRNA is computed by summing the elements
and the corresponding tRNA counts tRNAi as given from the genomic data. of the matrix of observed counts X, according to the codon reading being
According to the assumed dependence, the fit of a codon reading matrix is evaluated. If an element of the matrix is read by the same tRNA, the count
computed by the regression model is added, if not, it is subtracted. The diagonal that represents the bias for
tRNAi ∼ γ xi2 +i . reusing the same codon at consecutive instances is ignored. This procedure
is described by
The model uses only the quadratic term and the intercept always passes
through the origin. The value of the shape parameter γ is estimated from the s= Cij Xij , i = j,
i j
data during the regression. Thus, there is only one degree of freedom and it
is possible to fit two points, as in the case for amino acids with two tRNA where Cij is +1 if the codons are read by the same tRNA and −1 if not.
isoacceptors. The error term i represents the deviations of the observed The same procedure is performed for the matrix of ‘expected’ counts of
values from the expected value (i.e. the residual variance that is not explained consecutive codons. Thereafter, the z-transformation gives the total score
by the model). Here, we have chosen to evaluate the fit of the model by the by subtracting the total expected count from the total observed count and
coefficient of determination R2 . dividing by the total standard deviation. This provides the procedure to
evaluate different reading matrices. The codon reading matrix that gives
3.3 Autocorrelation of synonymous codons the highest total z-score is the predicted anticodon–codon mapping.
i343
A.C.Roth
Normalized Score
Leading codon
4 RESULTS AND DISCUSSION of the distribution. Of the three methods, the HMM has plausible
In this section, we will validate the methods and benchmark them, solutions that generally rank high in the distribution.
first by checking where the most plausible codon readings are in Next, we compare how the predictions compare against random.
the solution space. Then, we use a consensus procedure to see how Randomizing the codon usage using the global codon usage of the
often the three different methods agree with each other. Finally, organism removes the signal of codon reading. The randomization
we compare the predictions against experimental evidence. Using for CC and HMM is produced by taking the original amino acid
these methods and experimental evidence, we suggest an improved sequences and drawing random synonymous codons according to
prediction for codon reading in yeast. the codon frequencies. The REG method is randomized by drawing
from a skewed normal distribution based on the codon usage of
the genome. The best plausible solution is compared with the best
4.1 Plausible codon readings rank high in the solution plausible solutions of randomized data.
space To compute the significance of our solutions, we compare 99
To investigate the evolutionary signal in the codon usage bias due randomized sample points with the real solution. We use empirical
to the physiological constraints of tRNA decoding, we look at P-values based on a Monte Carlo procedure, which is based on rank
how the subset of plausible codon readings rank in comparison and do not rely on assumptions of the distribution. The P-values
with other binary codon readings. The plausible codon readings are calculated by p̂ = (r +1)/(n+1), where n is the total number
are based on arguments of nucleotide chemistry and on trends of simulations and r is the number of simulations that are equal
in the observed anticodon–codon mappings (the possibilities are or greater to the actual, non-randomized data value. Table 1 lists
depicted in Fig. 2B). The distributions of the scores of the three the empirical P-values of the solutions for amino acids in yeast.
methods for alanine in yeast are illustrated in Figure 6, where The solutions for the HMM method are all significant with P-values
the subsets of plausible codon readings are indicated by black from 0.01 to 0.02, which express that the solutions of HMM ranks
squares and the other implausible solutions by gray points. The first or second in comparison with the simulated data. This is not the
scores are normalized to range from 0 (lowest) to 1 (highest) case for CC and REG, which produce only a few solutions that are
by sn = (s−mins)/(maxs−mins). The best plausible solution for significantly different from random assignment. In particular for the
alanine ranks in the top of the distribution for all three methods. For atypical codon reading of arginine, the CC method performs poorly
REG and HMM, all plausible solutions are in the upper half of the where the majority of the simulated solutions have better scores then
distribution, whereas in the CC methods have a larger spread. the actual data. Although the block structures of the CC method
Repeating this for all non-trivial amino acids in yeast shows that often agree with known cases, CC tend to have less predictive
the highest scoring plausible codon readings tend to be at the top power than REG and HMM, likely from the inherent limitations
of the distributions (see Fig. S3). For CC, REG and HMM, all the of the CC method to predict cases broad codon readings where the
plausible solutions are in 25, 58 and 75% of the upper half part of tRNAs can read multiple codons. In summary, these observations
the distribution, respectively. The method of codon correlation has are indications of the signal of tRNA decoding properties in coding
the largest variation and the subset is often spread through the rest sequences.
i344
Number of predictions
Table 1. Results of randomization for non-trivial amino acids in yeast
Diffr. to random +/–
i345
A.C.Roth
2nd 2nd
Table 2. Evaluation of methods 1st T C A G 3rd
IGA T
Phe
Cys
Tyr
Anticodon Org Read WP CC REG HMM GmAA GψA GCA C
Ser
T
Trp Stop
mcm5 UCUa A
5
ncm UmAA
Arg S.c R 5
ncm UGA
Stop
Leu
Gln mcm5 s2 UUGa S.c R m5CAA CGA CmCA G
Glu mcm5 s2 UUCa S.c A T
AGG ICG
Gly mcm5 UCCa S.c R
His
Leu UAGa S.c N GAG GUG C
Leu
Pro
Arg
C
Pro ncm5 UGGa S.c N UAG ncm5UGG mcm5s2UUG A
Gln
Ser IGAa S.c D CUG CCG G
Ser ncm5 UGAa S.c R
T
Thr ncm5 UGUa S.c R IAU IGU
Asn
Ser
Val IACa S.c Y C
Ile
GUU GCU
Thr
A
Val ncm5 UACa S.c R ψAψ ncm5UGU mcm5s2UUU mcm5UCU A
cmo5 UGCe
Arg
Lys
Ala S.e N NA
Met
G
cmo5 UGUe CAUinit CAU CAU CUU CCU
Thr S.e V NA
Val cmo5 UACe S.e N NA T
Asp
Leu cmnm5 UAAd E.c R NA GUC GCC C
cmo5 UAGc
Gly
Ala
Val
Leu E.c D NA G
ncm5UAC ncm5UGC mcm5s2UUC mcm5UCC A
Val cmo5 UACb E.c N NA
Glu
Val mnm5 UUUf T.t R NA CAC CUC CCC G
i346
conditions of the cell (Dong et al., 1996). Genes expressed under signal that coincides with known codon reading of tRNAs. For this,
different conditions might have different codon usages and different we are using three sequence-based methods: an HMM, regression
tRNA pools that they are adapted to (Elf et al., 2003). Together, and CC. The predicted codon–anticodon mappings often agree with
the codon usage is shaped by these tRNA-related effects and their experimental results, in particular for the HMM method, which
decoding properties. There are two cases where the CC method is is better than the other two methods and outperforms the method
not able to handle. One is in the hypothetical case where all tRNAs of wobble parsimony. Furthermore, all three different methods
can read all codons for an amino acid. The other is that the method significantly agree on the predictions and the predictions generally
is of limited use for amino acids that are decoded by two codons. rank high in the solution space.
Associated with the methods are crucial assumptions made to
4.5.2 REG The regression method is an intuitively appealing aid the prediction, because the solution space is enormous. As we
method based on the correlation of tRNA abundance and codon show, the results based on these assumptions are sufficient to make
usage bias (Ikemura, 1981). However, there are some potential predictions that agree with experimental results. This provides a
shortcomings of the REG method. Transfer RNA is not constitutively valuable basis for further studies which can improve both the model
expressed and the abundance levels may fluctuate depending on and the assumptions.
regulation, which is ignored. Also, some organisms that have We proceed to exploit the detectable patterns of tRNA decoding to
low-tRNA copy numbers, which decreases the resolution of the predict codon reading in yeast. We use experimentally determined
i347
A.C.Roth
Guthrie,C. and Abelson,J. (1982) Organization and expression of tRNA genes in Ogle,J.M. and Ramakrishnan,V. (2005) Structural insights into translational fidelity.
Saccharomyces cerevisiae. In: The Molecular Biology of the Yeast Saccharomyces: Annu. Rev. Biochem., 74, 129–177.
Metabolism and Gene Expression, volume 11B, pages 487–528. Cold Spring Harbor Parker,J. (1989) Errors and alternatives in reading the universal genetic code. Microbiol.
Laboratory, Cold Spring Harbor, N.Y. Rev., 53, 273–98.
Ibba,M. and Söll,D. (2000) Aminoacyl-tRNA synthesis. Annu. Rev. Biochem., 69, Percudani,R. et al. (1997) Transfer RNA gene redundancy and translational selection
617–650. in Saccharomyces cerevisiae. J. Mol. Biol., 268, 322–330.
Ikemura,T. (1981) Correlation between the abundance of Escherichia coli transfer RNAs Percudani,R. (2001) Restricted wobble rules for eukaryotic genomes. Trends. Genet.,
and the occurrence of the respective codons in its protein genes. J. Mol. Biol., 146, 17, 133–135.
1–21. Plotkin,J.B. and Kudla,G. (2011) Synonymous but not the same: the causes and
Ingolia,N.T. et al. (2009) Genome-wide analysis in vivo of translation with nucleotide consequences of codon bias. Nat. Rev. Genet., 12, 32–42.
resolution using ribosome profiling. Science, 324, 218–223. Qian,W. et al. (2012) Balanced codon usage optimizes eukaryotic translational
Johansson,M.J.O. et al. (2008) Eukaryotic wobble uridine modifications promote a efficiency. PLoS Genet., 8, e1002603.
functionally redundant decoding system. Mol. Cell Biol., 28, 3301–3312. Rabiner,L. (1989) A tutorial on hidden, Markov models and selected applications in
Karamyshev,A.L. et al. (2004) Transient idling of posttermination ribosomes ready to speech recognition. Proc. IEEE, 77, 267–296.
reinitiate protein synthesis. Biochimie, 86, 933–938. Ran,W. and Higgs,P.G. (2010) The influence of anticodon-codon interactions
Kersey,P.J. et al. (2010) Ensembl genomes: extending ensembl across the taxonomic and modified bases on codon usage bias in bacteria. Mol. Biol. Evol., 27,
space. Nucl. Acids Res., 38, D563–D569. 2129–2140.
Kramer,E.B. and Farabaugh,P.J. (2007) The frequency of translational misreading errors Schulman,L.H. (1991) Recognition of tRNAs by aminoacyl-tRNA synthetases. Prog.
in E. coli is largely determined by tRNA competition. RNA, 13, 87–96. Nucl. Acid Res. Mol. Biol., 41, 23–87.
i348