You are on page 1of 7

Biochemical and Biophysical Research Communications 315 (2004) 10971103

BBRC
www.elsevier.com/locate/ybbrc

Correlations between nucleotide frequencies and amino acid


composition in 115 bacterial species
D. Bharanidharan, G. Ramya Bhargavi, Kavitha Uthanumallian, and N. Gautham*
Department of Crystallography and Biophysics, University of Madras, Guindy Campus, Chennai 600 025, India
Received 24 January 2004

Abstract

We studied the correlations between amino acid composition and mononucleotide and dinucleotide frequencies in 115 bacterial
genomes of varying G + C content. Observed amino acid frequencies were compared with those expected from the actual mono-
nucleotide and dinucleotide frequencies. Both mononucleotide and dinucleotide frequencies correlate well with the amino acid
frequency, with dinucleotide frequencies doing so better. Despite the strong correlations, some of the observed amino acid fre-
quencies, in particular for Arg, Val, Asp, Glu, Ser, and Cys, were consistently dierent from predicted values in all genomes. We
suggest that this variation from predicted values is a consequence of selection pressure at the level of amino acids, while the close
correspondence to the predictions in residues such as Thr, Phe, Lys, and Asn arises only from mutation and selection pressure at the
level of the nucleic acid sequences.
2004 Elsevier Inc. All rights reserved.

Keywords: Mononucleotide; Dinucleotide; Amino acid composition; Selection pressure

The relationship between the nucleotide base com- nounced at the third codon position, resulting often in a
position of a genome and the amino acid composition synonymous codon [5]. Such variation thus remains
of the corresponding proteome has been the subject of silent. However, not all the variation is of this variety.
many studies [116]. It is well known that the nucleotide There are also smaller compositional dierences in the
composition, as measured by the G + C content, shows rst and second codon positions. Together with the
extreme variation across dierent genomes. In bacterial non-synonymous changes at the third position, these
species alone, it ranges from 22.5% to 72.0% [17]. variations lead to marked dierences in the amino acid
Within any particular genome the bias extends across frequencies [4,1215]. Singer and Hickey [13] have
all regions of it, both coding and non-coding [2,8,9]. In demonstrated a correlation between the nucleotide
the coding regions, one possible consequence of this composition and the amino acid composition in 21
bias is variation in the amino acid composition of the completely sequenced eubacterial and archaeal ge-
corresponding proteome. To illustrate this using an nomes. They have noticed that protein sequences in
extreme example, a genome consisting of only G and C, G + C rich genomes contain a greater proportion of the
i.e., 100% G + C content, can code only for Gly, Ala, GARP amino acids, all of whose codons are also G + C
Arg, and Pro (or GARP) amino acids (on the basis of rich. Likewise, G + C poor genomes code mostly for the
the universal genetic code). In real genomes, with less FYMINK group of amino acids that have G + C poor
extreme, though still biased base composition, the ef- codons. Thus, nucleotide composition aects both
fects are not as strong as the above example anticipates, codon choice and amino acid composition.
though they are still noticeable [7,10,11,13]. One of the The relationship between nucleotide and amino acid
reasons for the attenuation in the eect is that the frequencies is modulated by compositional preferences
variation in the nucleotide composition is most pro- in the proteome, arising from structural and functional
constraints. Accordingly, some amino acids occur at less
*
Corresponding author. Fax: +91-44-2235-2494. than the rates expected from a random distribution of
E-mail address: gautham@unom.ac.in (N. Gautham). the 20 possible residues, or from a random distribution

0006-291X/$ - see front matter 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.bbrc.2004.01.129
1098 D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103

of the 64 possible codons. Others occur at higher than selection pressure at the level of amino acids, while the
expected rates [18]. Based on the theoretically predicted latter, i.e., close correspondence to the predictions,
amino acid compositions derived from the nucleotide arises only from mutation and selection pressure at the
frequencies of a set of 59 bacterial genomes, Lobry [10] level of the nucleic acid sequences.
showed that in these genomes the observed frequencies
of Arg, Ser, Pro, and Cys were signicantly less than
expected, while for Ala, Lys, Asp, and Glu it was more Materials and methods
than expected. Others have reported similar, but not
identical, results [3,12]. Some of this variation can be Source of the data. The data used in the analysis consisted of 115
bacterial gene sequences and the corresponding protein sequences,
explained as a consequence of selection pressure at the
obtained from the KEGG database [17]. Only sequences labelled as
level of protein structure and function, i.e., the struc- coding sequences (CDS) were used. Hypothetical and putative
tural and functional constraints mentioned above. For protein sequences were eliminated. We analysed 345,388 gene se-
example, both Cys and Pro, strongly inuence protein quences, with a minimum of 480 sequences in the case of Mycoplasma
tertiary structure, the former through disulphide genitalium, and a maximum of 8317 sequences in the case of Brady-
rhizobium japonicum. The average length of the protein sequences is
bridges, and the latter by terminating helices and in-
368 amino acids. The chosen sequences were used to evaluate the
troducing turns. However, similar reasons cannot be mononucleotide, dinucleotide, and amino acid composition of each
easily attributed for the discrepancy from expected gene, which was then averaged to give the overall relative frequencies
composition in the other amino acids. It has been for the genome. The expected relative frequencies were calculated, as
speculated [3,19] that this discrepancy is a consequence explained below, for each gene individually and the values were av-
eraged for each genome.
of the dinucleotide composition, rather than mononu-
Calculation of expected relative amino acid frequencies. Expected
cleotide frequencies. For example, Arg is coded by six amino acid frequencies based on mononucleotide frequencies of a
triplet codons, four of which begin with the dinucleotide genome were calculated as suggested by Lobry [10]. However, instead
CG. This particular dinucleotide is known to be sup- of calculating the expected amino acid composition solely as a function
pressed in many genomes, including some of the bacte- of the G + C content, we have used the actual relative frequencies of all
four nucleotides, PA , PC , PG , and PT , in each gene to calculate the
rial genomes [20]. This behaviour may lead to
corresponding expected relative frequencies for the 20 amino acids.
suppression of codons for Arg and hence to an under- The expected frequency of occurrence of each amino acid was obtained
representation of that amino acid. Elevation and sup- as the sum of the expected frequencies of occurrence of its codons.
pression of dinucleotide frequencies is a well-studied Thus, for example, for Gln
phenomenon since Bird [21] reported the strong sup- PGln PCAA PCAG 1
pression of CG in animal DNA. Other studies since then
PCAA and PCAG are expressed in terms of the independent probability of
[3,20,22,23] have reiterated the dierences between ob-
the three bases in the codon, normalised to take the three stop codons
served dinucleotide frequencies and those expected on TAA, TAG, and TGA into account, e.g.,
the basis of the nucleotide composition. So much so,
PC PA PA
these frequencies are often used as pattern recognition PCAA : 2
1  PT PA PA PT PA PG PT PG PA
tools in annotating DNA sequences [24,25]. However,
no previous study has rigorously analysed the link be- Expected amino acid frequencies based on observed dinucleotide
frequencies were calculated in a similar manner. Again, the relative
tween dinucleotide and amino acid composition. In this codon frequencies were rst obtained. For example, the probability of
paper, we explore this relationship in bacterial genomes, occurrence of the codon CAA was calculated as the product of three
by theoretically predicting amino acid composition gi- numbers: the probability of C at rst position, the conditional prob-
ven a specic dinucleotide and mononucleotide com- ability of A in the second position given C in the rst, and the con-
position. A comparison between the theoretical ditional probability of A in the third position given A in the second.
The rst number is the sum of the relative frequencies of the dinu-
composition so obtained, and the one actually observed cleotides that have C in the second position, i.e.,
in dierent bacterial proteomes shows that the observed
PC PAC PTC PGC PCC ; 3
amino acid percentages correlate well with those pre-
where PAC , etc., are the relative dinucleotide frequencies of AC, etc.
dicted on the basis of the dinucleotide frequencies, just
observed in the gene. This way of calculating what is essentially the
as they do with those predicted on the basis of mono- mononucleotide probability for C ensures that the numbers are cor-
nucleotide frequencies. However, despite the overall rectly normalised. Also, we take into consideration the fact that the
correlation between observations and predictions, some rst base of a codon occurs at the second position of the dinucleotide
of the amino acids, especially Arg, Val, Asp, Glu, Ser, that begins at the end of the previous codon. The second number is
and Cys occur at quite dierent frequencies from those calculated as
predicted. The pattern of variation for these residues is PAjC PCA =PCA PCT PCG PCC : 4
consistently the same in all the genomes studied, at Similarly the third number is
dierent G + C content. The other amino acids, in par- PAjA PAA =PAA PAT PAG PAC : 5
ticular Thr, Phe, Lys, and Asn, follow the predicted Again, the amino acid probabilities were calculated as the sum of the
values very faithfully. We suggest that the former eect, probabilities of the respective codons, normalised to allow for the stop
i.e., variation from predicted values, is a consequence of codons.
D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103 1099

Calculation of the expected relative dinucleotide frequencies. The GARP amino acids increase with G + C content of the
ratios of the observed relative dinucleotide frequency to the one genome, while those of FYMINK amino acids decrease.
expected from the mononucleotide relative frequencies were calculated
as follows. For a dinucleotide XY, this ratio is given as
For a majority of the amino acids, the observed values
closely follow the predicted ones. Since the predictions
RXY PXY =PX PY ; 6
are based on genome composition, this probably attests
where PX denotes the probability of occurrence of the nucleotide X and to the strong inuence of mutation pressure in designing
PXY is the probability of occurrence of the dinucleotide XY in the gene.
This ratio has also been referred to as genomic signature [20].
the genome. However, selection pressure by its action on
Statistical characterisation of composition. The distance between the mono and dinucleotide composition must also be con-
observed and predicted composition of amino acid for a given amino sidered [20,26]. Other common features of the data are
acid aa was calculated as: evident in Table 1, which gives the correlation coe-
s
1X n cients and the distance between the observed and pre-
Daa Paa;i obs  Paa;i pred2 ; 7
n i1 dicted amino acid frequencies. Both G + C rich amino
acids and G + C poor amino acids (except for Met) have
where Paa;i obs and Paa;i pred are the frequencies of the observed
and predicted amino acid compositions, respectively, and the sum- strong correlations between the observations and pre-
mation is over all the genomes (n) included in the calculations. The dictions, though this is not the case for the neutral
statistical signicance of this distance was calculated using a two-tailed amino acids. For these amino acids, the correlation
paired t test. coecients vary from )0.17 to 0.74 in relation to the
mononucleotide-based predictions, and from )0.01 to
0.84 for the dinucleotide-based predictions. In general,
Results and discussion the compositions of the neutral amino acids correlate
better with the dinucleotide frequencies than the
Fig. 1 shows the average composition of each amino mononucleotide frequencies. The dierences between
acid in all 115 genomes. Cys, Trp, His, and Met have the the observed and predicted composition, as measured by
least occurrences. Moreover, the low percentage is the distance, are less than 2.9% points for all amino
constant in all genomes, as indicated by the low value of acids, except Arg. Considering however that there are 20
the standard deviation. Clearly, the formation of disul- amino acids, and the frequency of each is very approx-
phide bridges by Cys is responsible for its scarcity. imately 5%, even these seemingly small dierences may
Again, the low percentage of Met may be because it is be highly signicant.
principally an initiation amino acid [16]. However, Of the GARP amino acids (Figs. 2AD) Pro and Arg
similar reasons are not available to explain the low show a similar pattern of behaviourat low G + C
percentage of His and Trp. The hydrophobic amino content (2232%) the observed values are the same as
acids Leu, Ala, and Ile have the highest average com- the predicted values, while at high G + C content (62
position, as well as a large range. Lys, Asn, and Arg also 72%) they are signicantly dierent. For Pro, the dis-
have a large range of compositions. tance D between observed and predicted composition
Fig. 2 shows observed and predicted values for the for the highest G + C content genomes is 6.44 percent-
115 bacterial genomes for all twenty amino acids. For age points (P < 1014 ) for mononucleotide frequency-
clarity, the predicted values are plotted as lines. The rst based predictions. For dinucleotide frequency-based
four gures are for GARP residues that have G + C rich predictions D 4:27 (P < 1010 ). Likewise for Arg,
codons, while the last six are for FYMINK residues that D 6:8 (P < 1019 ) for mononucleotide-based predic-
have A + T rich codons. The rest are classied as neutral tions, and 7.7 (P < 1014 ) for dinucleotide-based pre-
[11]. The overall behaviour seen in the gures has been dictions. The other two amino acids in this group, Gly
described earlier [13]. The observed frequencies of and Ala, follow an almost identically opposite pattern,
with predicted and observed percentages being the same
at high G + C content and signicantly dierent at low
G + C content (D 2:5 and 2.3, P < 1012 and <1013
for Gly; D 2:9 and 2.5, P < 1011 and P < 1011 for
Ala). These results are best interpreted as a consequence
of selection at the amino acid level, rather than muta-
tion or selection at the gene level. This is borne out by
the fact that the composition of these four amino acids
does not change appreciably between genomes. For
example, in the case of Arg, the observed percentages
range from 2.8% to 8.3% only, while the predicted
values have a much larger rangefrom 5.2% to 15.5%
Fig. 1. Average composition of each amino acid in all 115 bacterial in the case of mononucleotide frequency-based predic-
genomes. tions, and from 4.7% to 16.7% in the case of dinucleotide
1100 D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103

Fig. 2. The observed and predicted values in 115 bacterial genomes for all 20 amino acids. The gures (AD), GARP residues, have G + C rich
codons, while the gures (OT), FYMINK residues, have A + T rich codons. The rest are neutral.

frequency-based predictions. Similar values are seen Is the generally lower percentage of Arg seen in the
for the other three amino acids that belong to this proteomes a consequence of CG suppression? Our re-
group. sults show that this is not so. The dinucleotide ratio,
D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103 1101

Table 1
The correlation coecients (r), distance (D, in percentage), with signicance levels, between the observed and predicted amino acid frequencies, based
on mononucleotide composition and dinucleotide composition
Amino acid Mononucleotide-based Dinucleotide-based
r D P value r D P value
Gly 0.93 1.71 >5% 0.92 1.57 <102
Ala 0.95 2.90 <106 0.95 2.20 <103
Arg 0.90 4.47 <105 0.88 4.38 <104
Pro 0.90 2.85 <102 0.92 1.95 <102
Trp 0.72 0.56 <105 0.70 0.64 <106
Val 0.12 1.20 <102 0.48 1.95 <105
Ser )0.11 2.90 <104 0.15 2.48 <103
Thr 0.35 0.87 <104 0.70 0.52 >5%
His 0.56 0.74 <107 0.63 0.81 <105
Asp )0.17 2.03 <107 0.57 2.07 <1012
Glu 0.74 2.75 <1010 0.84 2.61 <1010
Leu )0.07 1.60 <101 )0.01 1.89 >5%
Gln 0.33 0.95 <103 0.43 0.82 <102
Cys 0.38 2.03 <105 0.09 2.54 <107
Phe 0.80 1.01 <102 0.88 1.32 >5%
Tyr 0.87 1.29 <101 0.86 0.95 >5%
Met 0.12 0.88 <103 0.51 0.60 <103
Ile 0.94 1.60 <101 0.97 1.45 <103
Asn 0.93 0.77 >5% 0.94 0.68 >5%
Lys 0.95 1.56 <105 0.98 1.01 >5%

Table 2
Ratio of the observed to predicted values of the dinucleotide frequencies
G+C % AA AT AG AC TA TT TG TC GA GT GG GC CA CT CG CC
2227 1.18 1.18 0.83 0.88 0.68 1.26 1.00 1.10 1.08 0.75 0.99 1.06 1.00 1.00 1.01 1.01
2732 1.14 1.14 0.85 0.84 0.67 1.06 1.09 1.06 1.04 0.80 1.00 1.20 1.07 0.94 1.00 1.02
3237 1.11 1.00 0.97 0.81 0.68 1.07 1.08 1.02 1.09 0.82 1.03 1.25 0.96 1.07 0.83 1.00
3742 1.18 1.00 0.88 0.84 0.72 1.18 1.08 0.94 0.98 0.84 0.92 1.29 1.00 0.98 0.94 0.96
4247 1.13 1.10 0.83 0.93 0.65 1.07 1.16 1.00 1.06 0.85 0.94 1.18 1.09 0.95 0.92 0.92
4752 1.13 1.03 0.81 0.92 0.62 1.03 1.19 1.00 1.01 0.91 0.87 1.28 1.10 0.96 0.99 0.84
5257 1.16 1.01 0.82 0.95 0.68 1.04 1.18 1.05 1.00 0.94 0.86 1.12 1.03 1.02 0.95 0.87
5762 1.20 1.03 0.80 0.88 0.74 1.18 1.17 0.90 0.91 0.88 0.96 1.22 1.00 0.99 1.01 0.90
6267 1.16 0.99 0.87 0.89 0.67 1.11 1.18 0.93 1.00 0.86 0.93 1.24 1.12 1.00 1.00 0.96
6772 1.14 1.21 0.79 0.90 0.72 1.00 1.24 0.83 0.94 0.85 0.87 1.18 1.10 0.83 1.02 0.89
Average 1.16 1.04 0.86 0.88 0.69 1.10 1.13 0.98 1.01 0.85 0.94 1.22 1.04 0.98 0.95 0.95
The G + C content was separated into ten bins at intervals of 5%, from 22% to 72%. The values were averaged for each bin.

Table 2, gives the extent of CG suppression in the Arg is that the number of such codons for this residue is
coding regions of genomes obtained as a ratio of the only a small fraction of the six that code for it. To
observed to predicted values, which were derived from summarize this discussion, Arg occurs at lower than
the theoretical sequences on the basis of the mononu- expected rates in the bacterial proteomes, and this sup-
cleotide frequencies. The average CG ratio in all the pression is not a consequence of CG suppression.
genomes is 0.95. The value ranges from 0.50 to 1.33. The case of Pro is somewhat dierent. Only in ge-
Table 2 also gives the dinucleotide ratios in dierent bins nomes of high G + C content is the frequency markedly
of G + C content. Clearly, there is no correlation be- lower than the expected value. However, as in the case
tween CG suppression (or elevation) and G + C content of Arg, this may be more because of the selection
on the one hand, and CG suppression and the sup- pressure towards a more or less constant percentage of
pression of Arg, on the other. Moreover, from the co- the amino acid. In any case, the dinucleotide ratio for
don usage tables for Arg for all bacterial genomes [27] CC (all four codons for Pro start with this dinucleotide)
we see that codons containing CG are not discriminated is close to 1.0 in all the genomes (Table 2), including
against. Thus, it is not the case that the Arg residues that those with high G + C content, and there is no rela-
we do observe in the proteome are all coded for by CG tionship between this dinucleotide frequency and Pro
decient codons, and that reason for the low values of composition. Of the other two amino acids in this
1102 D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103

group, Gly behaves like Pro and has a fairly constant and inverse relationship with the G + C content of the
percentage in all the genomes, no matter what the G + C genome. At low G + C content, the frequency is high,
content. It is notable that the two amino acids at the while at high G + C content it is low. The composition of
opposite ends of the spectrum of conformational exi- Ile is signicantly dierent from predicted values at high
bility both have such similar behaviour. That is, both G + C content (D 2:6 and 2.3, P < 1011 and
are nearly constant, and both follow the predicted pat- P < 1015 ). At low G + C content Tyr shows dierent
tern, except Pro in high G + C genomes, and Gly in low values (D 2:5 and 1.2, P < 1010 and P < 104 ). Lys
G + C genomes. Ala has a large variation in composition and Asn occur at almost exactly the same percentages as
between the genomes, ranging from 3.6% to 13.7%. The predicted, and so does Phe, though to a lesser extent.
predicted values for this amino acid also have a large In conclusion, the above results may be classied in
range, and the predicted patterns of variation follow the two dierent ways. First, we consider the correlation
observed pattern closely. between observed and predicted values. In the G + C
We next discuss the amino acids that are neutral poor and the G + C rich amino acids the two values are
with respect to the G + C content of their respective highly correlated in all genomes considered here.
codons. This is the largest of the three classes and con- Among the neutral amino acids, the values for Trp and
tains the following ten amino acids: Cys, Ser, Val, Asp, Glu are highly correlated. Thr and His show correlation
Glu, Trp, His, Gln, Thr, and Leu. The pattern of vari- only with the dinucleotide-based predictions, and Asp to
ation in the amino acid composition is of three types. a lesser extent. Others show no correlation. Second, we
Cys and Ser occur with lower frequencies than expected consider whether the predicted values are close to the
in all bacterial genomes. For Cys the distance D 2:3 observed values. Here there are three possibilities: the
and 2.5 (with signicance levels P < 109 and 107 ) for observed values could be higher than those predicted
predictions based on mononucleotide and dinucleotide (elevation); they could be lower (suppression) or they
frequencies, respectively. For Ser, D 2:9 and 2.5 could be approximately the same. Val, Asp, and Glu
(P < 104 and 104 ) for the two predictions, respec- belong to the rst category, Pro, Arg, Cys, and Ser to
tively. Cys, with two codons, has a strong eect on the the second and others to the third.
three-dimensional structure of the protein, and if it were These results show that, to a large extent, nucleotide
to occur with a higher proportion, it could lead to frequencies decide the amino acid composition. How-
misfolding through formation of non-native disulphide ever, the amino acid composition of the proteome is
bridges. Ser has six codons, four of which begin with the not decided entirely by nucleotide frequencies, and
dinucleotide TC, and its pattern of variation is remi- show the eect of selection pressure, with some, such
niscent of Arg, which also has six codons. At low G + C as Pro, Arg, Cys, and Ser, occurring at consistently
content the frequency is almost identical to that pre- less than predicted values, and some, such as Val, Asp,
dicted. At high G + C content, the dierence is the and Glu, occurring always at greater than predicted
largest. Like the relationship between Arg and the di- values. The amino acid composition of the proteome is
nucleotide CG, the under-representation of Ser is not a thus decided partly by mutation pressure (mononu-
consequence of TC suppression. Table 2 shows that, in cleotide frequencies) and partly selection pressure (di-
fact, there is no suppression of TC in any of the G + C nucleotide frequencies) on the genome, and partly by
bins. Again like Arg, and indeed like many of the other selection pressure on the proteome. This statement
amino acids, there is an apparent need for a constant may be interpreted in two ways. One, the observed
proportion of this amino acid, no matter what the G + C percentage of occurrence of each amino acid is the
content of the genome. The second pattern of variation resultant of a combination of random (mutation) and
is seen in Val, Asp, and Glu. These three residues occur directed (selection) forces. It then follows that while
with signicantly higher frequency than predicted in all the proteome requires a certain denite amount of
genomes (D 1:2 and 1.9, P < 102 and 105 for Val; each amino acid, random variations of this amount are
D 2:0 and 2.1, P < 107 and 1012 for Asp; and D 2:8 tolerated. The second interpretation assumes that the
and 2.6, P < 1010 and <1010 for Glu). The third pat- set of twenty amino acids is degenerate, and require-
tern of variation is seen in Trp, His, Gln, and Thr. The ments for a particular physical or chemical property in
frequencies of these amino acids are not dierent from the protein may be satised by more than one residue.
the predicted values (average D < 1:0 and average Therefore, changes in the genome towards higher or
P < 104 for both mononucleotide- and dinucleotide- lower G + C content are followed by changes favouring
based predictions) except Leu, which shows an elevation one or another set of amino acids in the proteome.
in high G + C content genomes (D 3:5 and 3.7, Our own results, as well as other, earlier work
P < 109 and 1013 ). [3,10,13,19], in general favour the second interpreta-
The nal class of amino acids are the FYMINK tion. These patterns in the amino acid compositions
group with G + C poor codons. As seen in Figs. 2OT, may thus be one facet of the wider degeneracy seen in
the amino acid compositions in this group have a close biological systems [28].
D. Bharanidharan et al. / Biochemical and Biophysical Research Communications 315 (2004) 10971103 1103

Acknowledgments [13] G.A.C. Singer, D.A. Hickey, Nucleotide bias causes a genome-
wide bias in the amino acid composition of proteins, Mol. Biol.
We acknowledge the nancial assistance from the following or- Evol. 17 (2000) 15811588.
ganisations of the Government of India: CSIR under the NMITLI [14] A.B. de Miranda, F. Alvarez-Valin, K. Jabbari, W.M. Degrave,
project, UGC under the SAP programme, and DST under the FIST G. Bernardi, Gene expression, amino acid conservation, and
programme. hydrophobicity are the main factors shaping codon preferences in
Mycobacterium tuberculosis and Mycobacterium leprae, J. Mol.
Evol. 50 (2000) 4555.
[15] R.D. Knight, S.J. Freeland, L.F. Landweber, A simple model
References based on mutation and selection explains trends in codon and
amino-acid usage and GC composition within and across
[1] N. Sueoka, Correlation between base composition of deoxyribo- genomes, Genome Biol. 2 (2001), research0010.10010.13.
nucleic acid and amino acid composition and protein, Proc. Natl. [16] D.P. Kreil, C.A. Ouzounis, Identication of thermophilic species
Acad. Sci. USA. 47 (1961) 11411149. by the amino acid compositions deduced from their genomes,
[2] G. Bernardi, G. Bernardi, Compositional constraints and genome Nucleic Acids Res. 29 (2001) 16081615.
evolution, J. Mol. Evol. 24 (1986) 111. [17] M. Kanehisa, S. Goto, KEGG: kyoto encyclopedia of genes and
[3] R. Hanai, A. Wada, The eects of guanine and cytosine variation genomes, Nucleic Acids Res. 28 (2000) 2730.
on dinucleotide frequency and amino acid composition in the [18] T.H. Jukes, R. Holmquist, H. Moise, Amino acid composition of
human genome, J. Mol. Evol. 27 (1988) 321325. proteins: selection against the genetic code, Science 189 (1975) 50
[4] G. DOnofrio, D. Mouchiroud, B. Aissani, C. Gauter, G. 51.
Bernardi, Correlations between the compositional properties of [19] S. Karlin, L. Brocchieri, A. Bergman, J. Mrazek, A.J. Gentles,
human genes, codon usage, and amino acid composition of Amino acid runs in eukaryotic proteomes and disease associa-
proteins, J. Mol. Evol. 32 (1991) 504510. tions, Proc. Natl. Acad. Sci. USA 99 (2002) 333338.
[5] A. Wada, Compliance of genetic code with base-composition [20] S. Karlin, C. Burge, Dinucleotide relative abundance extremes: a
deecting pressure, Adv. Biophys. 28 (1992) 135158. genomic signature, Trends Genet. 11 (1995) 283290.
[6] D.W. Collins, T.H. Jukes, Relationship between G + C in silent [21] A.P. Bird, DNA methylation and the frequency of CpG in animal
sites of codons and amino acid composition of human proteins, DNA, Nucleic Acids Res. 8 (1980) 14991504.
J. Mol. Evol. 36 (1993) 201213. [22] F.D. Amicis, S. Marchetti, Intercodon dinucleotides aect codon
[7] T.D. Porter, Correlation between codon usage, regional genomic choice in plant genes, Nucleic Acids Res. 28 (2000) 33393345.
nucleotide composition, and amino acid composition in the [23] A.F. Gentles, S. Karlin, Genome-scale compositional compari-
cytochrome P-450 gene superfamily, Biochim. Biophys. Acta sons in eukaryotes, Genome Res. 11 (2001) 540546.
1261 (1995) 394400. [24] P. Schattner, Searching for RNA genes using base-composition
[8] G. Bernardi, The human genome: organization and evolutionary statistics, Nucleic Acids Res. 30 (2002) 20762082.
history, Annu. Rev. Genet. 29 (1995) 445476. [25] N. Echols, P. Harrison, S. Balasubramanian, N.M. Luscombe, P.
[9] H. Musto, S. Caccio, H. Rodriguez-Maseda, G. Bernardi, Compo- Bertone, Z. Zhang, M. Gerstein, Comprehensive analysis of
sitional constraints in the extremely GC-poor genome of Plasmo- amino acid and nucleotide composition in eukaryotic genomes,
dium falciparum, Mem. Inst. Oswaldo Cruz 92 (1997) 835841. comparing genes and pseudogenes, Nucleic Acids Res. 30 (2002)
[10] J.R. Lobry, Inuence of genomic G + C content on average 25152523.
amino-acid composition of proteins from 59 bacterial species, [26] H. Akashi, R.M. Kliman, A. Eyre-Walker, Mutation pressure,
Gene 205 (1997) 309316. natural selection, and the evolution of base composition in
[11] P.G. Foster, L.S. Jermiin, D.A. Hickey, Nucleotide composition Drosophila, Genetica 102 (1998) 4960.
bias aects amino acid content in proteins coded by animal [27] Y. Nakamura, T. Gojobori, T. Ikemura, Codon usage tabulated
mitochondria, J. Mol. Evol. 44 (1997) 282288. from international DNA sequence databases: status for the year
[12] V. Wilquet, M. Van de Casteele, The role of the codon rst letter 2000, Nucleic Acids Res. 28 (2000) 292.
in the relationship between genomic GC content and protein [28] G.M. Edelman, J. Gally, Degeneracy and complexity in biological
amino acid composition, Res. Microbiol. 150 (1999) 2132. systems, Proc. Natl. Acad. Sci. USA 98 (2001) 1376313768.