Professional Documents
Culture Documents
Abstract
Computational prediction of eukaryotic promoters from the nucleotide sequence is one of the most attractive
problems in sequence analysis today, but it is also a very dicult one. Thus, current methods predict in the order of
one promoter per kilobase in human DNA, while the average distance between functional promoters has been
estimated to be in the range of 30±40 kilobases. Although it is conceivable that some of these predicted promoters
correspond to cryptic initiation sites that are used in vivo, it is likely that most are false positives. This suggests that
it is important to carefully reconsider the biological data that forms the basis of current algorithms, and we here
present a review of data that may be useful in this regard. The review covers the following topics: (1) basal
transcription and core promoters, (2) activated transcription and transcription factor binding sites, (3) CpG islands
and DNA methylation, (4) chromosomal structure and nucleosome modi®cation, and (5) chromosomal domains and
domain boundaries. We discuss the possible lessons that may be learned, especially with respect to the wealth of
information about epigenetic regulation of transcription that has been appearing in recent years. # 1999 Elsevier
Science Ltd. All rights reserved.
0097-8485/99/$ - see front matter # 1999 Elsevier Science Ltd. All rights reserved.
PII: S 0 0 9 7 - 8 4 8 5 ( 9 9 ) 0 0 0 1 5 - 7
192 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
Table 1
Servers and software for promoter ®nding
General gene®nders with pol-II detection and other feature detectors (MARs, CpG-islands)
GENSCAN (Burge and Karlin, 1997) http://CCR-081.mit.edu/GENSCAN.html
GRAIL Matis et al., 1996; Uberbacher et al., 1996) http://compbio.ornl.gov/Grail-1.3/
MAR-Finder (Singh, 1997) http:/www.ncgr.org/MarFinder/
WebGene (Milanesi et al., 1996; Milanesi and Rogozin, in press). http://itba.mi.cnr.it/webgene/
scription of genes involved in cell growth is one of the consider to be relevant for computational biologists
major causes of all forms of cancer. involved in construction of promoter ®nding algor-
Several dierent algorithms for the prediction of ithms, and also make suggestions for how this infor-
promoters, transcriptional start points, and transcrip- mation may be put to use. We will emphasize the
tion factor binding sites in eukaryotic DNA sequence concept of epigenetic regulation, (i.e., the `modulation
now exist (Bucher et al., 1996; Fickett and of gene expression achieved by mechanisms superim-
Hatzigeorgiou, 1997, Table 1). Not included in Table 1 posed upon that conferred by primary DNA sequence'
are general methods for discovery of patterns in biose- (Gasser et al., 1998)), since there is now a wealth of
quences. For a recent review see Brazma et al., (1998). evidence that such mechanisms play an important role
Although current algorithms perform much better than in all transcriptional control (see e.g., Simpson, 1991;
the earlier attempts, it is probably fair to say that per- Eden and Cedar, 1994; Paranjape et al., 1994; Geyer,
formance is still far from satisfactory. Thus, the gen- 1997; Gottesfeld and Forbes, 1997; Grunstein, 1997;
eral picture is that when promoter prediction Tsukiyama and Wu, 1997; Cavalli and Paro, 1998;
algorithms are used under conditions where they ®nd a Gasser et al., 1998; Gregory and HoÈrz, 1998; Kellum
reasonable percentage of promoters, then the amount and Elgin, 1998; Lu and Eissenberg, 1998).
of falsely predicted promoters (false positives) is far
too high. Existing methods predict in the order of one
promoter per kilobase, while it is estimated that the 2. Basal transcription and core promoters
human genome on average only contains one gene per
30±40 kilobases (Antequera and Bird, 1993). Promoter 2.1. A core promoter is a binding site for RNA
prediction algorithms have recently been thoroughly polymerase and general transcription factors
reviewed by Fickett and Hatzigeorgiou (1997); also see
Bucher et al., (1996), with regard to both function and Eukaryotes have three dierent RNA-polymerases
relative performance, and we will not attempt to repeat that are responsible for transcribing dierent subsets
that work here but refer interested readers to these of genes: RNA-polymerase I transcribes genes encod-
papers. Rather, we will review biological data that we ing ribosomal RNA, RNA-polymerase II (which we
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 193
Fig. 1. Core-promoter complexed with RNA-polymerase II and general transcription factors. Shown core promoter elements are
the TATA-box (TATA, usually around ÿ30), the initiator (Inr, around the start point), and the downstream promoter element
(DPE, around +30). The DPE is present in some TATA-less, Inr-containing promoters (Burke and Kadonaga, 1997).
will focus on in this review) transcribes genes encoding 2.2. More than one class of minimal promoters exist
mRNA and certain small nuclear RNAs, while RNA-
polymerase III transcribes genes encoding tRNAs and Is it, then, this minimal set of sequences required for
other small RNAs (Huet et al., 1982; Breant et al., basal transcription that computational biologists
1983; Allison et al., 1985). should strive to recognize? One problem with this con-
RNA polymerase II (pol-II) consists of more than cept is that, in fact, there is not just one class of mini-
ten subunits, some of which are partly homologous to mal promoters that have the above characteristics
the a, b, and b 0 subunits of the RNA polymerase in (Smale, 1994a). One important class of minimal (or
Escherichia coli (Allison et al., 1985). The eukaryotic core) promoters only consists of a TATA-box, which
pol-II enzyme is not in itself capable of speci®c tran- directs transcriptional initiation at a position about 30
scriptional initiation in vitro, but needs to be sup- bp downstream. The TATA-box has the consensus
plemented with a set of so-called general transcription TATAAAA which appears to be conserved between
factors (GTFs) (Wasylyk, 1988; Zawel and Reinberg, most eukaryotes although several mismatches are
1993; Orphanides et al., 1996). The most important of allowed (Hahn et al., 1989; Singer et al., 1990; Wobbe
and Struhl, 1990). It is bound by a subunit of TFIID
these factors are TFIIA, TFIIB, TFIID, TFIIE,
known as TBP (the TATA binding protein)
TFIIF, and TFIIH. The GTFs are named after the
(Hernandez, 1993; Burley and Roeder, 1996). It has
order in which they were puri®ed and discovered.
been found that TBP is present in the pre-initiation
Together, these factors, most of which consist of mul-
complexes with all three RNA polymerases
tiple subunits, contain approximately 30 polypeptides.
(Hernandez, 1993; Burley and Roeder, 1996; Lee and
Only when the polymerase and all the general tran-
Young, 1998) although new in vitro data suggest that
scription factors (with the possible exception of
under some speci®c circumstances it may be possible
TFIIA) are assembled on a piece of DNA transcription to initiate transcription without TBP (Wieczorek et al.,
can commence. Although the assembly of this so-called 1998), or with a TBP related factor (TRF)
pre-initiation complex was originally described to (Buratowski, 1997; Hansen et al., 1997). In addition to
occur in an ordered stepwise manner (Orphanides et TBP, TFIID also contains a number of TBP associated
al., 1996), newer experiments suggest that transcrip- factors (TAFs) (Tanese and Tjian, 1993; Goodrich and
tional initiation in vivo normally involves binding of a Tjian, 1994; Tansey and Herr, 1997).
holoenzyme complex including pol-II and many or all A second class of minimal promoters do not contain
of the GTFs in a single step (Greenblatt, 1997). The any TATA box (and are, therefore, referred to as
minimal pol-II promoter is typically de®ned as the set TATA-less). In these promoters, the exact position of
of sequences that is sucient for assembly of such a the transcriptional start point may instead be con-
pre-initiation complex, and for exactly specifying the trolled by another basic element known as the initiator
point of transcriptional initiation in vitro (Fassler and (Inr) (Smale, 1994a, 1997). The Inr is positioned so
Gussin, 1996, Fig. 1). Transcription that is initiated by that it surrounds the transcription start point, and has
this minimal set of proteins is referred to as basal tran- the (rather loose) consensus PyPyAN[TA]PyPy where
scription. Py is a pyrimidine (C or T), and N is any nucleotide
194 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
(Bucher, 1990; Smale, 1994b). The ®rst A is at the Lorch, 1995; Fassler and Gussin, 1996; Gottesfeld and
transcription start site, and the pyrimidine positioned Forbes, 1997; Grunstein, 1997)Ðinstead it is mainly
just upstream of that is often cytosine. Promoters con- an in vitro phenomenon that is useful for the analysis
taining only an Inr are typically somewhat weaker of the basal transcriptional machinery. But what is the
than TATA-containing promoters (Smale, 1997). reason for thisÐwhy are cryptic core promoters,
Interestingly, TBP has been shown also to participate which must exist in abundance in any genome, not uti-
in transcription initiated on TATA-less promoters lized in the living cell? At least part of the explanation
(Smale, 1997). In addition to the two promoter-classes appears to rely on the three-dimensional structure of
mentioned above, there are also promoters which have DNA in the nucleus, which has been found to have a
both TATA and Inr elements, and promoters that generally repressing eect on transcription (Paranjape
have neither (Smale, 1994a). et al., 1994; Kornberg and Lorch, 1995; Gottesfeld and
Another promoter element, which was recently dis- Forbes, 1997; Grunstein, 1997). We will return to chro-
covered in both human and Drosophila, is present in mosomal structure and the implications for promoter
some TATA-less, Inr-containing promoters about 30 activity and promoter prediction below. However, ®rst
bp downstream of the transcriptional start point we will consider the phenomenon of transcriptional ac-
(Burke and Kadonaga, 1997). This element, which is tivation.
known as the downstream promoter element (DPE),
appears to be a downstream analog of the TATA box
in that it assists the Inr in controlling precise transcrip-
tional initiation. 3. Activated transcription and regulatory sequences
2.3. Core promoters and promoter prediction 3.1. Control of transcription by transcription factors
binding to regulatory elements
What does all this mean for promoter prediction?
First of all, it is obvious that there is not one single It has been found that in order to sustain transcrip-
type of core promoter. Instead, several combinations tion in vivo, a core promoter needs additional short
of at least three small elements are capable of directing regulatory elements (Paranjape et al., 1994; Kornberg
assembly of the pre-initiation complex and sustaining and Lorch, 1995; Gottesfeld and Forbes, 1997;
basal transcription in vitro. Secondly, it appears that Grunstein, 1997). These elements are located at vary-
sequence downstream may also aect promoter func- ing distances from the transcriptional start point.
tion. This is important to keep in mind, since some Thus, some regulatory elements (so-called proximal el-
promoter prediction methods focus on the upstream ements) are adjacent to the core promoter, while other
part of the promoter only. Thirdly, it is obvious that elements can be positioned several kilobases upstream
under any circumstance the elements described above or downstream of the promoter (so-called enhancers).
cannot be the only determinants of promoter function. Both types of elements are binding sites for proteins
For instance, a sequence conforming perfectly to the (transcription factors) that increase the level of tran-
Inr consensus will appear purely by chance approxi- scription from core promoters (Wasylyk, 1988;
mately once every 512 bp in random sequence. Johnson and McKnight, 1989; Mitchell and Tjian,
Furthermore, in one study it was found that applying 1989; McKnight and Yamamoto, 1992; Zawel and
Buchers TATA box weight matrix (Bucher, 1990) to a Reinberg, 1993; Fassler and Gussin, 1996). This
set of mammalian non-promoter DNA sequences, phenomenon is referred to as activated transcription.
resulted in an average of one predicted TATA box Proteins that repress transcription by binding to simi-
every 120 bp (Prestridge and Burks, 1993). Although lar DNA elements also exist. Unlike the small number
this may in part be caused by the somewhat simpli®ed of GTFs, there are several thousands of dierent tran-
description of a binding site that is implicit in a pos- scription factors able to bind to regulatory elements.
ition weight matrix, it is nevertheless clear that there In fact, it has been estimated that factors involved in
are far more perfect TATA-boxes in, for instance, the transcriptional regulation make up several percent of
human genome than there are promoters. Since there the proteins encoded in the vertebrate genome.
is good evidence that such promiscuous transcriptional One very important way transcription factors
initiation does not take place in vivo (at least not to achieve transcriptional activation is by recruitment of
this degree), it is obvious that there are other import- the basal transcriptional machinery to the promoter
ant factors besides the core promoter elements. This is through protein±protein interactions, either directly or
further supported by the fact that basal transcription through adaptor proteins (Pugh and Tjian, 1990;
(de®ned as transcription initiated by only pol-II and Tanese and Tjian, 1993; Stargell and Struhl, 1996;
the GTFs on a minimal promoter) is practically non- Ptashne and Gann, 1997). In the case of binding sites
existent in vivo (Paranjape et al., 1994; Kornberg and located far from the promoter, it is believed that the
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 195
protein±protein interactions involve looping out of the that similar elements will be found purely by chance
intervening DNA (Adhya, 1989; Matthews, 1992). all over the genome. In accordance with this qualitat-
Regulatory regions, controlling the transcription of ive evaluation, statistical analysis also indicates that
eukaryotic genes, typically contain several transcription the density of (currently known) regulatory elements
factor binding sites strung out over a large region. does not contain sucient information to discriminate
Some of these individual binding sites are able to bind between promoters and non-promoters (Prestridge and
several dierent members of a family of transcription Burks, 1993; Zhang, 1998a).
factors, or perhaps dierent dimeric complexes of re- Although this may seem disheartening, it is import-
lated monomers. Which particular factor that binds to ant to remember that in the cell, after all, promoters
a given site, therefore, not only relies on the binding are recognized correctly by the transcriptional appar-
site, but also on what factors are available for binding atus. In some form the necessary information must
in a given cell type at a given time. It is by the modu- therefore be present. But what, then, is the reason for
lar and combinatorial nature of transcriptional regula- the poor performance? Firstly, one fundamental reason
tor regions that it is possible to precisely control the may be that in most computational approaches, pro-
temporal and spatial expression patterns of the tens of moters are being searched for in single stranded
thousands of genes present in higher eukaryotes sequence. The transcription apparatus, however, is
(Schirm et al., 1987; Dynan, 1989; Diamond et al., designed to deal with chromatin, not single stranded
1990; Lamb and McKnight, 1991; McKnight and DNA, as template (This situation is very dierent
Yamamoto, 1992). Thus, any given gene will typically from a search for, say, the correct exon-intron organiz-
have its very own pattern of binding sites for transcrip- ation in a gene, where the biological object being pro-
tional activators and repressors ensuring that the gene cessed by the splicing machinery is in fact single
is only transcribed in the proper cell type(s) and at the stranded pre-mRNA.) In a promoter prediction algor-
proper time during development. Other genes are ithm all this boils down to a proper way of represent-
expressed only in response to extracellular stimuli such ing the essential symmetries in double stranded versus
as for instance blood sugar level or viral infection, single stranded sequence. It is possible that the use of
while still others are expressed more or less constitu- strand-invariant encodings of DNA sequences may be
tively in most cell types. The latter class of genes helpful in this regard (Baldi et al., 1998). One example
include those encoding proteins involved in basal where such symmetries are known to be important, is
metabolism, and are sometimes referred to as `house- in connection with transcription factor binding sites
keeping genes'. Transcription factors themselves are, of that are functional in both orientations. As we
course, also subject to similar transcriptional regu- suggested above, another possibility is that perhaps it
lation, thereby forming transcriptional cascades and is necessary to reconsider the nature of the questions
feed-back control loops. Striking and beautiful we are asking. Is the problem in the form `discriminate
examples of the complexity of transcriptional regu- between all promoters and all non-promoters' possibly
lation include, for instance, the Drosophila even-skipped too large for any single algorithm? Perhaps speci®c
gene (Small et al., 1991; Jackle and Sauer, 1993), and sub-classes of promoters need speci®c sub-algorithms
the human b-globin gene (Evans et al., 1990; Minie et that look for speci®c signals in order to be detected,
al., 1992; Crossley and Orkin, 1993; Higgs and Wood, and the bigger problem can then only be solved by pie-
1993). cing together many such methods? The prediction of
muscle-speci®c promoters is one step in this direction
3.2. Transcription factor binding sites and promoter (Fickett, 1996a, b; Wasserman and Fickett, 1998). Of
prediction course, it is also conceivable that there is enough infor-
mation in the binding sites, but that we simply have
While this is all very nice and interesting from a bi- not yet ®gured out how to properly integrate the sig-
ologist's point of view, it seems to spell big trouble for nals. Another possibility is that there is some element
promoter prediction. Not only are there thousands of of the transcriptional regulation that cannot be directly
transcriptional regulators, many of which have recog- deduced from the DNA sequence, but instead relies on
nition sequences that are not yet characterized, but mechanisms operating at other levels. In fact, it is well
any given sequence element might be recognized by known that higher level chromatin folding represses
dierent factors in dierent cell types. Alternatively, a general transcription, and that unfolding of DNA
perfect consensus binding site near a promoter might through the action of histone acetyltransferases is im-
never be bound because the corresponding factor is portant for transcriptional activation (Paranjape et al.,
not present under the set of circumstances where the 1994; Kornberg and Lorch, 1995; Grunstein, 1997).
gene is transcribed. As in the case of core promoters, We will return to possible ways of attacking this par-
the fact that the regulatory elements are short and not ticular problem below.
completely conserved in sequence furthermore means Under all circumstances, it seems like a good idea to
196 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
look for additional signals (besides transcription factor (Li et al., 1992). Mutant mouse embryos display sig-
binding sites) that are correlated with the presence of ni®cantly lower levels of DNA methylation and die in
transcription starts. Of course, the fact that a signal is mid-gestation. It is believed that this phenotype is re-
helpful in determining whether a promoter is genuine, lated to the fact that DNA methylation in vertebrates
does not necessarily mean that the signal is involved in has been found to have a repressing eect on transcrip-
transcriptional regulation. Thus, the presence of a tional initiation, possibly mediated by the binding
downstream coding region is helpful in identifying pro- of a speci®c methyl-CpG binding protein (Boyes and
moters in bacteria, but the coding region generally has Bird, 1992; Eden and Cedar, 1994). It is likely to be
no in¯uence on the transcriptional process. Rather, it side-eects from the lack of methylation-mediated
is the evolutionary history of bacteria that has resulted repression that cause mutant mouse embryos to die.
in very compact genomes where promoters are located Based on these observations it has been suggested that
in short intergenic regions. A feature that may be cor- general DNA methylation may have evolved as a way
related with promoters in vertebrate genomes for simi- of reducing background noise transcription, and that
larly historical reasons, are the so-called CpG islands it made the subsequent development of the complex
that we will discuss in the next section. vertebrate lineage possible (Bird, 1993).
4.4. CpG islands and promoter prediction some core particle which consists of approximately 146
bp of DNA wrapped around an octamer composed of
The correlation between CpG islands and promoters two molecules each of the four core histones (H2A,
may, therefore, be historical rather than functional, H2B, H3, and H4) (Richmond et al., 1984; Luger et
but it is nevertheless likely to be useful in connection al., 1997). Higher order structures are formed by fold-
with promoter prediction. In fact, this might be just ing of nucleosomal arrays and are stabilized by inter-
the kind of global signal we mentioned above. There is action with other nuclear proteins including perhaps
currently no publicly available promoter ®nding soft- the linker histone H1 (Wole et al., 1997;
ware that utilizes this correlation, but Junier, Krogh, Ramakrishnan, 1997).
and Bucher are currently developing such an algorithm
that combines a search for CpG islands with an 5.2. Chromatin represses transcription
HMM-based detection of core promoter elements
(Krogh, personal communication). Furthermore, CpG This densely packed state limits the accessibility of
island detection can be performed using a feature in the DNA for the basal transcriptional apparatus and
the WebGene server (Milanesi and Rogozin, in press, has been found to inhibit transcriptional initiation in
Table 1) and is also available as a feature in the vivo (Paranjape et al., 1994; Kornberg and Lorch,
GRAIL gene ®nder, although it is not currently used 1995; Gottesfeld and Forbes, 1997; Grunstein, 1997).
as a signal for promoter ®nding in that method (Matis Compared to naked DNA, chromatin is therefore in a
et al., 1996; Uberbacher et al., 1996, Table 1). All state of transcriptional repression. This is presumably
these methods de®ne CpG islands according to one reason why cryptic core promoters seem to be
Gardiner-Garden and Frommer (1987). According to practically inactive in living cells, and is also important
this de®nition, a CpG island is a region that (1) is for the very tight regulation of gene expression in vivo.
more than 200 bp long, (2) has more than 50% G+C Thus, while activator proteins typically increase tran-
(i.e., pG pC > 0:5), and (3) has a CpG dinucleotide scription from a naked DNA template around ten-
frequency that is at least 0.6 of that expected on the fold, the activation seen in vivo can be a thousand-fold
basis of the nucleotide content of the region (i.e., or more. Hence, derepression of transcription by par-
pCpG > 0:6 pC pG ). However, as mentioned above tial unfolding of chromatin constitutes an important
the phenomenon is only useful for analysis of ver- part of gene regulation (Tsukiyama and Wu, 1997;
tebrate genomes, meaning that we will under all cir- Davie, 1998; Mizzen and Allis, 1998; Turner, 1998;
cumstances have to employ alternative methods in Workman and Kingston, 1998), and several transcrip-
connection with the non-vertebrate multicellular eukar- tion factors and transcriptional co-activators have been
yotic model organisms (e.g., Drosophila, Dictyostelium shown to work by disrupting or remodeling chromatin
and C. elegans ) currently being sequenced. structure (Brownell and Allis, 1995; Brownell et al.,
It has been noted that, at least in warm-blooded ver- 1996; Mizzen et al., 1996; Ogryzko et al., 1996; Pazin
tebrates, there is a correlation between CpG islands and Kadonaga, 1997; Mizzen and Allis, 1998; Turner,
and GC-rich isochores (Isochores are long genomic 1998; Workman and Kingston, 1998). Besides the gen-
segments with homogeneous base-composition, found erally repressive eect of chromatin on transcription,
in vertebrate genomes. They are divided into dierent there are also several known cases where precisely
families that are characterized by having dierent GC- positioned nucleosomes are directly involved in tran-
levels (Bernardi, 1993; Bernardi, 1995).) However, the scriptional regulation (Richard-Foy and Hager, 1987;
causality of this correlation is somewhat unclear Simpson, 1991; Hayes and Wole, 1992; Lu et al.,
(AõÈ ssani and Bernardi, 1991a; AõÈ ssani and Bernardi, 1994; Simpson et al., 1994; Wole, 1994; Zhu and
1991b; Cross et al., 1991; Cross and Bird, 1995). Thiele, 1996).
(Caption opposite).
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 199
Calladine et al., 1988; Bolshoy et al., 1991; Simpson, ing to suggest that this is a signal for positioning
1991; Hunter, 1993; Goodsell and Dickerson, 1994; Lu nucleosomes right at the transcriptional start point.
et al., 1994; Brukner et al., 1995a; Bolshoy, 1995; Iyer Positioning of nucleosomes near the transcriptional
and Struhl, 1995; Wole and Drew, 1995; Hunter, start point could be related to the tight regulation of
1996; Ioshikhes et al., 1996; Widom, 1996; Zhu and gene expression that is often observed in vivo. We
Thiele, 1996; Liu and Stein, 1997). This means that it suggest the use of this structural pro®le as one extra
is perhaps possible to capture essential features of even signal for promoter ®nding, and are currently in the
the epigenetic parts of transcriptional regulation process of developing such an algorithm. Brie¯y our
through structural analysis of the DNA sequence. We method is based on two sensors: one that detects the
suggest that the additional use of such signals is likely overall structural pro®le (the high-to-low bendability
to improve the performance of promoter prediction al- shift) and another that looks for a periodic sequence
gorithms (Benham, 1996; Karas et al., 1996; Pedersen pattern.
et al., 1998). It has been noted that a DNA-bendability pro®le
As one example of how the use of second-order averaged over all possible heptamers conforming to
characteristics of the sequence (in this case DNA bend- the very loose consensus sequence of the initiator el-
ability) can be used to investigate promoters, we will ement (PyPyA+1N[TA]PyPy) displays a single distinct
brie¯y describe some recent results from our group high-bendability peak at position +1. This is caused
(Pedersen et al., 1998). By analyzing a large set of by the fact that all eight triplets described by the sub-
unrelated i.e., non-sequence similar) human pol-II pro- consensus PyA+1N have high bendability (Pedersen et
moter sequences using sequence-dependent models of al., 1998). It was, therefore, tentatively suggested that
DNA structure, we have recently found what appears at least part of the sequence requirements for a func-
to be a general structural feature that is present in a tional Inr are of a structural nature. Based on this and
majority of the investigated promoter sequences (Fig. similar observations, it is tempting to suggest that
2). Speci®cally, computational analysis using three some of the sequence heterogeneity that is seen in tran-
independent models of DNA ¯exibility (Satchwell et scription factor binding sites, may in reality represent
al., 1986; Brukner et al., 1990, 1995a, b; Hassan and a more conserved structural motif underneath (Karas
Calladine, 1996) shows that a set of promoters with et al., 1996; Grove et al., 1998; de Souza and Ornstein,
low sequence similarity displays an average tendency 1998).
for low bendability upstream of the TATA-box, and
high bendability downstream of the transcriptional 5.4. Chromosomal domains and domain boundaries
start point. Within the downstream region there are
strong indications of periodic sequence and bendability An additional level of chromosomal structure that
patterns in phase with the DNA helical pitch. This per- may be relevant for promoter function is the somewhat
iodic pattern is very similar to that known from X-ray controversial organization of eukaryotic chromosomes
structures of the nucleosome core particle and tabula- into very large loops through attachment to a protein-
tions of preferred sequence locations on nucleosomes. aceous matrix (Laemmli et al., 1992; Cremer et al.,
These results, therefore, indicate that on average the 1993; Saitoh and Laemmli, 1993; Vazquez et al., 1993;
DNA in the region downstream of the start point in a Dillon and Grosveld, 1994; Bode et al., 1995;
large set of unrelated promoters is able to assume a Gardiner, 1995). The DNA sequences that bind to
macroscopically curved structure (e.g., to be wrapped nuclear matrix in vitro (and thus de®ne the base of
around protein) very similar to that of DNA in a these loops) are called Matrix or Scaold Attachment
nucleosome. Since the length of the high bendability Regions (MARs/SARs) (Laemmli et al., 1992; Bode et
region is approximately the same as the length of al., 1995). It has been suggested that DNA loops may
DNA wrapped around a histone octamer, it is tempt- correspond to units of gene regulation, i.e., regions
Fig. 2. Average bendability pro®les of the human promoter sequences. Position +1 corresponds to the transcriptional start point.
(a) A non-redundant set of 624 human promoters was aligned using a hidden Markov model, and the average bendability for each
position in the promoter calculated using DNase I-derived bendability parameters (Brukner et al., 1995a). Larger values correspond
to higher bendability (or propensity for major groove compressibility). The two peaks around position ÿ30 are caused by TATA-
box containing promoters. The pro®le has been smoothed by calculating a running average with a window of size 20. (b) Average
¯exibility pro®le calculated from the aligned promoters using a trinucleotide model based on preferred sequence location on nucleo-
somes (Satchwell et al., 1986). Lower values correspond to more ¯exible sequences which have less preference for being positioned
speci®cally. (c) Average ¯exibility pro®le based on propeller twist values from X-ray crystallography of DNA oligomers (Hassan
and Calladine, 1996). Higher (less negative) propeller-twist corresponds to higher ¯exibility. Pro®les (b) and (c) have been smoothed
by calculating a running average with a window of size 30. Note that all three pro®les show a tendency for higher ¯exibility down-
stream of the transcriptional start point. From (Pedersen et al., 1998)
200 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
guidelines for when a prediction should be accepted. Uberbacher et al., 1996; Burge and Karlin, 1997;
Most will probably agree that if a method consistently Lukashin and Borodovsky, 1998), and it would then
predicts start sites to be within relatively few bp of the be natural to also include other feature detectors nor-
annotated sites then the algorithm is doing very well. mally used in general gene ®nding (coding DNA, splice
This view also makes good biological sense: in many sites, poly-A signals, etc.).
cases promoters do display more than one start. In our experience, the simultaneous use of signals at
Furthermore, there are probably many cases where both the local and global levels has a consistently ben-
only one of several start sites is reported in available e®cial eect on the performance of feature detectors.
databasesÐa situation which necessarily must result in Examples include, the prediction of splice sites, signal
some degree of ambiguity in prediction methods con- peptides, translation start sites, and glycosylation sites
structed from these data. (Brunak et al., 1991; Hansen et al., 1995, 1998;
Korning et al., 1996; Nielsen et al., 1996, 1997;
6.2. Epigenetic control and long-range interactionsÐ Pedersen and Nielsen, 1997; Tolstrup et al., 1997).
more signals for promoter ®nding Presumably, the reason behind this behavior is that
instead of having a ®xed, absolute threshold for when
Another fundamental question is whether there is, in a putative local signal is genuine, global signals are in
fact, enough information present in the local DNA eect able to modulate the cut-o for the local signal.
sequence to de®ne promoters. Thus, statistical analysis Consider, for instance, the case of predicting signal
of the density of transcription factor binding sites peptide cleavage sites in protein sequences: obviously
suggests that this alone is insucient to unambigu- the likelihood that a putative cleavage site is func-
ously discriminate between promoters and non-promo- tional, depends on whether it is adjacent to a signal
ters (Prestridge and Burks, 1993; Zhang, 1998a). From peptide-like sequence (positioned upstream) or not. In
a biological point of view, this may seem to be a this context, it is perhaps also interesting that some
strange thoughtÐthe vertebrate cell, after all, is researchers have found that within any given promoter
capable of correctly controlling the expression of sev- sequence, it is usually the strongest signal that is cor-
eral tens of thousands of genes. However, such a situ- rect, regardless of the absolute strength (Hutchinson,
ation could arise due to epigenetic control of 1996; Audic and Claverie, 1998).
transcription, i.e., regulation by signals superimposed
upon the primary DNA sequence. In accordance with 6.3. Choice of data sets for construction of algorithms
this, it is known that chromatin folding and unfolding
do indeed play a very important role in transcriptional Careful selection of the data set that is used for con-
control, as does DNA methylation (Boyes and Bird, structing and testing a promoter ®nding algorithm is
1992; Eden and Cedar, 1994; Paranjape et al., 1994; of course of the utmost importance. Many existing
Kornberg and Lorch, 1995; Gottesfeld and Forbes, methods use recognition of transcription factor binding
1997; Grunstein, 1997). The fact that transcriptional sites, based on lists of sites present in databases such
initiation can be dependent on long range interactions as TRANSFAC (Wingender et al., 1998). Since this
between factors bound at promoters and other factors presumably is very similar to what the cell does, it is
bound at distant sites, further suggests that local DNA an intuitively pleasing approach and one that seems
sequence alone may not be sucient to de®ne a pro- likely to succeed. However, some of the sites present in
moter. these databases may have only limited experimental
We believe that one way of approaching this pro- evidenceÐa situation which could seriously aect the
blem is to include the use of more global signals in ad- performance of algorithms that use them. It is always
dition to the local signals mainly used by most current advisable to personally check the original references to
algorithms (TATA-box, Inr, CCAAT-box, etc.). sites of interest, although this will prove to be very
Exactly what global signal(s) to use is an open ques- hard work if more than a few sites are to be included
tion, but based upon the data reviewed here, two inter- in an analysis. There are undoubtedly also biologically
esting new candidates emerge: (1) CpG islands, which based problems regarding the use of lists of transcrip-
in vertebrate genomes often are correlated with promo- tion factor binding sites. Thus, it is well known that if
ter position, and (2) chromosomal domain boundaries a given transcription factor binds to sites present in a
or domain insulators, which may in some cases delimit set of, e.g., 50 dierent promoters, then the sequences
transcriptional units in genomic DNA. The use of sev- of these dierent sites may display considerable diver-
eral signals simultaneously might be implemented in a sity. The same will also be the case for the strength
manner similarly to the `grammatical' approaches (binding energies) of the corresponding protein-DNA
known from general gene ®nding methods interactions. For instance, the protein may in some
(Borodovsky and McIninch, 1993; Krogh et al., 1994; cases interact with other proteins that enable it to bind
Matis et al., 1996; Rosenblueth et al., 1996; to very weak sites. Under all circumstances: if all 50
202 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
binding sites are included in a database and sub- selected from the parameters of the extreme-value dis-
sequently used to design a method for recognizing the tribution which the resulting alignment scores follow
site, then the information present in the weak binding (Pedersen, et al. in preparation).
sites can be said to be overemphasized, resulting per-
haps in a method that predicts far too many false posi- 6.4. Perspectives: DNA structural analysisÐmore levels
tives. It is an interesting thought that binding energies to one signal?
in the dierent protein-DNA interactions could be
used to weight the dierent sequences when construct- There are indications that DNA sequence-dependent
ing recognition algorithms. Unfortunately, it is likely three-dimensional structure may be important for tran-
that the dierent conditions used in dierent exper- scriptional regulation both at the level of single bind-
imental setups will make such comparisons practically ing sites and at the level of entire promoter regions
useless. Furthermore, any truly successful method (Lisser and Margalit, 1994; Karas et al., 1996;
should, of course, be able to also predict the weaker Benham, 1996; Liu and Stein, 1997; Grove et al., 1998;
sites, so under-estimating their importance may prove Pedersen et al., 1998; de Souza and Ornstein, 1998).
problematic as well. We have no simple answer to this Thus, it is possible that sequence heterogeneity of the
problem. dierent binding sites of a transcription factor may, to
An alternative to using lists of transcription factor some degree, re¯ect an underlying conserved DNA
binding elements, is to rely on annotated transcription structure. Furthermore, it is also possible that the
start sites. The presence of an initiation site must same is true for sequence surrounding the transcrip-
necessarily imply that the surrounding sequence has tional start point. It may, therefore, be relevant to
promoter activity. One advantage to this approach is explicitly assist promoter ®nding algorithms in recog-
that usually there is very little doubt about the validity nizing structural features of DNA sequences. One way
of the experimental evidence for a transcriptional in- of doing this is to encode DNA sequences in the form
itiation site. The eukaryotic promoter database (EPD) of DNA structural parameters for all overlapping tri-
is based on such experimentally determined transcrip- plets or dinucleotides (Karas et al., 1996; Baldi and
tion starts and is furthermore curated, meaning that all Brunak, 1998; Pedersen et al., 1998). Preliminary inves-
included start sites have been checked with the original tigations suggest that such structural encodings do
literature (Perier et al., 1998). A new web-interface for indeed contain sucient information to recognize fea-
accessing this database seems very useful. It is of tures that are normally considered to be present at the
course also possible to extract sequences directly from sequence level (Baldi and Brunak, 1998; Tolstrup, per-
general nucleotide databases such as GenBank, sonal communication). Furthermore, it is an intriguing
EMBL, or the DNA Data Bank of Japan (Benson et possibility that this may be a way of approaching `epi-
al., 1998; Stoesser et al., 1998; Tateno et al., 1998) by genetic' signals such as chromatin folding and nucleo-
looking for feature keys which indicate that an in- some positioning from the sequence.
itiation site is present e.g., `mRNA' and `prim_tran-
script'). In this way it is usually possible to collect
larger data sets than when using EPD, but it is natu- Acknowledgements
rally also a tradeo with respect to the quality of the
data. AGP and SB are supported by a grant from the
How much sequence to include is an open question, Danish National Research Foundation. The work of
but based on biological data it at least seems that it is PB and YC is in part supported by an NIH SBIR
important to include sequence from both sides of the grant to Net-ID, Inc. We thank Lise Homann for
start point. The amount of sequence also depends on critical comments on this manuscript.
which prediction approach is chosen. Thus, it is
obvious that the use of global signals such as CpG
islands or MARs necessitates the inclusion of more References
¯anking DNA.
Once a set of sequences has been extracted it may be Adhya, S., 1989. Multipartite genetic control elements: com-
relevant to reduce the redundancy or homology pre- munication by DNA looping. Annu. Rev. Genet 23, 227±
sent within the data set. This is especially important 250.
when statistically based methods (including neural net- AõÈ ssani, B., Bernardi, G., 1991a. CpG islands: features and
works) are used, since statistics will otherwise be distribution in the genomes of vertebrates. Gene 106, 173±
biased for the over represented sequences that are pre- 183.
sent in all databases. In this regard, we have developed AõÈ ssani, B., Bernardi, G., 1991b. CpG islands, genes, and iso-
an automatic redundancy reduction technique that is chores in the genomes of vertebrates. Gene 106, 185±195.
based on the use of pairwise alignments and a cut-o Allison, L.A., Moyle, M., Shales, M., Ingles, C.J., 1985.
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 203
Extensive homology among the largest subunits of eukar- Approaches to the discovery of patterns in biosequences.
yotic and prokaryotic RNA polymerases. Cell 42, 599±610. J. Comput. Biol. 5 (2), 279±305.
Antequera, F., Bird, A., 1993. Number of CpG islands and Breant, B., Huet, J., Sentenac, A., Fromageot, P., 1983.
genes in human and mouse. Proc. Natl. Acad. Sci. USA Analysis of yeast RNA polymerases with subunit-speci®c
90, 11995±11999. antibodies. J. Biol. Chem 258, 11968±11973.
Audic, S., Claverie, J.-M., 1997. Detection of eukaryotic pro- Brownell, J.E., Allis, C.D., 1995. An activity gel assay detects
moters using Markov transition matrices. Computers a single catalytically active histone acetyltransferase subu-
Chem 21, 223±227. nit in Tetrahymena macronuclei. Proc. Natl. Acad. Sci.
Audic, S., Claverie, J.M., 1998. Visualizing the competitive USA 92, 6364±6368.
recognition of TATA-boxes in vertebrate promoters. Brownell, J.E., Zhou, J., Ranalli, T., Kobayashi, R.,
Trends Genet 14, 10±11. Edmondson, D.G., Roth, S.Y., Allis, C.D., 1996.
Baldi, P., Brunak, S., 1998. Bioinformatics: The Machine Tetrahymena histone acetyltransferase A: a homologue to
Learning Approach. MIT Press, Cambridge, MA. yeast Gcn5p linking histone acetylation to gene activation.
Baldi, P., Chauvin, Y., Brunak, S., Gorodkin, J., Pedersen, Cell 84, 843±851.
A.G., 1998. Computational applications of DNA struc- Brukner, I., Jurukovski, V., Savic, A., 1990. Sequence-depen-
tural scales. ISMB 6. dent structural variations of DNA revealed by DNase I.
Benham, C.J., 1996. Computation of DNA structural variabil- Nucleic Acids Res 18, 891±894.
ityÐa new predictor of DNA regulatory sequences. Brukner, I., Sanchez, R., Suck, D., Pongor, S., 1995a.
CABIOS 12, 375±381. Sequence-dependent bending propensity of DNA as
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., revealed by DNase I: parameters for trinucleotides.
Ouellette, B.F., 1998. Genbank. Nucleic Acids Res 26, 1± EMBO J 14, 1812±1818.
7. Brukner, I., Sanchez, R., Suck, D., Pongor, S., 1995b.
Trinucleotide models for DNA bending propensity: com-
Bernardi, G., 1993. The isochore organization of the human
parison of models based on DNase I digestion and nucleo-
genome and its evolutionary historyÐa review. Gene 135,
some positioning data. J. Biomol. Struct. Dyn 13, 309±
57±66.
317.
Bernardi, G., 1995. The human genome: organization and
Brunak, S., Engelbrecht, J., Knudsen, S., 1991. Prediction of
evolutionary history. Annu. Rev. Genet 29, 445±476.
human mRNA donor and acceptor sites from the DNA
Bird, A., 1980. DNA methylation and the frequency of CpG
sequences. J. Mol. Biol 220, 49±65.
in animal DNA. Nucleic Acids Res 8, 1499±1504.
Bucher, P., 1990. Weight matrix descriptions of four eukaryo-
Bird, A., 1993. Functions for DNA methylation in ver- tic RNA polymerase II promoter elements derived from
tebrates. Cold Spring Harbor Symp. Quant. Biol 58, 281± 502 unrelated promoter sequences. J. Mol. Biol 212, 563±
285. 578.
Bird, A., Tate, P., Nan, X., Campoy, J., Meehan, R., Cross, Bucher, P., Fickett, J.W., Hatzigeorgiou, A., 1996.
S., Tweedle, S., Charlton, J., Macleod, D., 1995. Studies Computational analysis of transcriptional regulatory el-
of DNA methylation in animals. J. Cell Sci. Suppl 19, 37± ements: a ®eld in ¯ux. CABIOS 12, 361±362.
39. Buratowski, S., 1997. Multiple TATA-binding factors come
Bode, J., Schlake, T., RiÂ;os-RamiÂrez, M., Mielke, C., back into style. Cell 91, 13±15.
Stengert, M., Kay, V., Klehr-Wirth, D., 1995. Scaold/ Burge, C., Karlin, S., 1997. Prediction of complete gene struc-
matrix-attached regions: structural properties creating tures in human genomic DNA. J. Mol. Biol 268, 78±94.
transcriptionally active loci. Int. Rev. Cyt 162A, 389±454. Burke, T.W., Kadonaga, J.T., 1997. The downstream promo-
Bolshoy, A., 1995. CC dinucleotides contribute to the bending ter element, DPE, is conserved from Drosophila to humans
of DNA in chromatin. Nature Struct. Biol 2, 447±448. and is recognized by TAFII60 of Drosophila. Genes Dev
Bolshoy, A., McNamara, P., Harrington, R.E., Trifonov, 11, 3020±3031.
E.N., 1991. Curved DNA without A-A: experimental esti- Burley, S.K., Roeder, R.G., 1996. Biochemistry and structural
mation of all 16 DNA wedge angles. Proc. Natl. Acad. biology of transcription factor IID (TFIID). Annu. Rev.
Sci. USA 88, 2312±2316. Biochem 65, 769±799.
Bonifer, C., Hecht, A., Sauressig, H., Winter, D.M., Sippel, Calladine, C.R., Drew, H.R., McCall, M.J., 1988. The intrin-
A.E., 1991. Dynamic chromatin: the regulatory domain or- sic structure of DNA in solution. J. Mol. Biol 201, 127±
ganization of eukaryotic gene loci. J. Cell. Biochem 47, 137.
99±108. Cavalli, G., Paro, R., 1998. Chromo-domain proteins: linking
Borodovsky, M., McIninch, J., 1993. Genemark: Parallel gene chromatin structure to epigenetic regulation. Curr. Opin.
recognition for both DNA strands. Comput. Chem 17, Cell Biol 10, 354±360.
123±133. Chen, Q.K., Hertz, G.Z., Stormo, G.D., 1997. Promfd 1.0: a
Boulikas, T., 1995. Chromatin domains and prediction of computer program that predicts eukaryotic pol II promo-
MAR sequences. Int. Rev. Cytol 162A, 279±388. ters using strings and IMD matrices. CABIOS 13, 29±35.
Boyes, J., Bird, A., 1992. Repression of genes by DNA meth- Chen, Q.K., Stormo, G.Z.H.G.D., 1995. Matrix search 1.0: a
ylation depends on CpG density and promoter strength: computer program that scans DNA sequences for tran-
evidence for involvement of a methyl-CpG binding pro- scriptional elements using a database of weight matrices.
tein. EMBO J 11, 327±333. Comput. Appl. Biosci 11, 563±566.
Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D., 1998. Coulondre, C., Miller, J.H., Farabough, P.J., Gilbert, W.,
204 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
1978. Molecular basis of base substitution hotspots in Gardiner, K., 1995. Human genome organization. Curr.
Escherichia coli. Nature 274, 775±780. Opin. Genet. Dev 5, 315±322.
Craig, J.M., Bickmore, W.A., 1994. The distribution of CpG Gardiner-Garden, M., Frommer, M., 1987. CpG islands in
islands in mammalian chromosomes. Nature Genetics 7, vertebrate genomes. J. Mol. Biol 196, 261±282.
376±382. Gasser, S.M., Paro, R., Stewart, F., Aasland, R., 1998.
Cremer, T., Kurz, A., Zirbel, R., Dietzel, S., Rinke, B., Epigenetic control of transcription. Introduction: the gen-
SchroÈck, E., Speichel, M.R., Mathieu, U., Jauch, A., etics of epigenetics. Cell. Mol. Life Sci 54, 1±5.
Emmerich, P., Schertan, H., Ried, T., Cremer, C., Lichter, Geyer, P.K., 1997. The role of insulator elements in de®ning
P., 1993. Role of chromosome territories in the functional domains of gene expression. Curr. Opin Gen. Dev 7, 242±
compartmentalization of the cell nucleus. Cold Spring 248.
Harbor Symp. Quant. Biol 58, 777±792. Goodrich, J.A., Tjian, R., 1994. TBP-TAF complexes: selec-
Cross, S., Kovarik, P., Schmidtke, J., Bird, A., 1991. tivity factors for eukaryotic trancription. Curr. Opin. Cell
Nonmethylated islands in ®sh genomes are GC-poor. Biol 6, 403±409.
Nucleic Acids Res 19, 1469±1474. Goodsell, D.S., Dickerson, R.E., 1994. Bending and curvature
Cross, S.H., Bird, A., 1995. CpG islands and genes. Curr. calculations in B-DNA. Nucleic Acids Res 22, 5497±5503.
Opin. Genet. Dev 5, 309±314. Gottesfeld, J.M., Forbes, D.J., 1997. Mitotic repression of the
Crossley, M., Orkin, S.H., 1993. Regulation of the beta-globin transcriptional machinery. TIBS 22, 197±202.
locus. Curr. Opin. Genet. Dev 3, 232±237. Greenblatt, J., 1997. RNA polymerase II holoenzyme and
Davie, J.R., 1998. Covalent modi®cation of histones: ex- transcriptional regulation. Curr. Opin. Cell Biol 9, 310±
pression from chromatin templates. Curr. Opin. Gen. Dev 319.
8, 173±178. Gregory, P.D., HoÈrz, W., 1998. Life with nucleosomes: chro-
de Souza, O.N., Ornstein, R.L., 1998. Inherent DNA curva- matin remodelling in gene regulation. Curr. Opin. Cell.
ture and ¯exibility correlate with TATA box functionality. Biol 10, 339±345.
Biopolymers 46, 403±415. Grove, A., Galeone, A., Yu, E., Mayol, L., Geiduschek, E.P.,
Diamond, M.I., Miner, J.N., Yoshinaga, S.K., Yamamoto, 1998. Anity, stability and polarity of binding of the
K.R., 1990. Transcription factor interactions: selectors of TATA binding protein governed by ¯exure of the TATA
positive or negative regulation from a single DNA el- box. J. Mol. Biol 282, 731±739.
ement. Science 249, 1266±1274. Grunstein, M., 1997. Histone acetylation in chromatin struc-
ture and transcription. Nature 389, 349±352.
Dickerson, R.E., Drew, H., 1981. Structure of a B-DNA
dodecamer. II. In¯uence of base sequence on helix struc- Hagerman, P.J., 1984. Evidence for the existence of stable
ture. J. Mol. Biol 149, 761±786. curvature of DNA in solution. Proc. Natl. Acad. Sci. USA
81, 4632±4636.
Dillon, N., Grosveld, F., 1994. Chromatin domains as poten-
Hahn, S., Buratowski, S., Sharp, P.A., Guarente, L., 1989.
tial units of eukaryotic gene function. Curr. Opin. Genet.
Yeast TATA-binding protein TFIID binds to TATA el-
Dev 4, 260±264.
ements with both consensus and nonconsensus sequences.
Drew, H.R., Travers, A.A., 1985. DNA bending and its re-
Proc. Natl. Acad. Sci. USA 86, 5718±5722.
lation to nucleosome positioning. J. Mol. Biol 186, 773±
Hansen, J., Lund, O., Tolstrup, N., Gooley, A., Williams, K.,
790.
Brunak, S., 1998. NetOglyc: Prediction of mucin type O-
Dynan, W.S., 1989. Modularity in promoters and enhancers.
glycosylation sites based on sequence context and surface
Cell 58, 1±4.
accessibility. Glycoconjugate Journal 15, 115±130.
Eden, S., Cedar, H., 1994. Role of DNA methylation in the Hansen, J.E., Lund, O., Engelbrecht, J., Bohr, H., Nielsen,
regulation of transcription. Curr. Opin. Genet. Dev 4, J.O., Hansen, J.-E.S., Brunak, S., 1995. Prediction of O-
255±259. glycosylation of mammalian proteins: speci®city patterns
Evans, T., Felsenfeld, G., Reitman, M., 1990. Control of glo- of UDP-GalNAC: polypeptide N-acetylgalactosaminyl-
bin gene transcription. Annu. Rev. Cell Biol 6, 95±124. transferase. Biochem J 308, 801±813.
Fassler, J.S., Gussin, G.N., 1996. Promoters and basal tran- Hansen, S.K., Takada, S., Jacobson, R.H., Lis, J.T., Tjian,
scription machinery in eubacteria and eukaryotes: con- R., 1997. Transcription properties of a cell type-speci®c
cepts, de®nitions, and analogies. Meth. Enz 273, 3±29. TATA-binding protein, TRF. Cell 91, 71±83.
Fickett, J.W., 1996a. Coordinate positioning of MEF2 and Hassan, M.A.E., Calladine, C.R., 1996. Propeller-twisting of
myogenin binding sites. Gene 172, GC19±GC32. base-pairs and the conformational mobility of dinucleotide
Fickett, J.W., 1996b. Quantitative discrimination of MEF2 steps in DNA. J. Mol. Biol 259, 95±103.
sites. Mol. Cell. Biol 16, 437±441. Hayes, J.J., Wole, A.P., 1992. The interaction of transcrip-
Fickett, J.W., Hatzigeorgiou, A.G., 1997. Eukaryotic promo- tion factors with nucleosomal DNA. BioEssays 14, 597±
ter recognition. Genome Res 7, 861±878. 603.
Fraser, P., Grosveld, F., 1998. Locus control regions, chroma- Hecht, A., Strahl-Bolsinger, S., Grunstein, M., 1996.
tin activation and transcription. Curr. Opin. Cell Biol 10, Spreading of transcriptional repressor Sir3 from telomeric
361±365. hetrochromatin. Nature 383, 92±96.
Frech, K., Danescu-Mayer, J., Werner, T., 1997. A novel Hernandez, N., 1993. TBP, a universal eukaryotic transcrip-
method to develop highly speci®c models for regulatory tion factor? Genes Dev 7, 1291±1308.
units detects a new LTR in genbank which contains a Higgs, D.R., Wood, W.G., 1993. Understanding erythroid
functional promoter. J. Mol. Biol 270, 674±687. dierentiation. Curr. Biol 3, 548±550.
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 205
Holliday, R., 1993. Epigenetic inheritance based on DNA Scaold-associated regioons: cis-acting determinants of
methylation. EXS 64, 452±468. chromatin structural loops and functional domains. Curr.
Huet, J., Sentenac, A., Fromageot, P., 1982. Spot-immunode- Opin. Gen. Dev 2, 275±285.
tection of conserved determinants in eukaryotic RNA Lamb, P., McKnight, S.L., 1991. Diversity and speci®city in
polymerases. Study with antibodies to yeast RNA poly- transcriptional regulation: the bene®ts of heterotypic
merases subunits. J. Biol. Chem 257, 2613±2618. dimerization. Trends Biochem. Sci 16, 417±422.
Hunter, C.A., 1993. Sequence-dependent DNA structure: the Lee, T.I., Young, R.A., 1998. Regulation of gene expression
role of base stacking interactions. J. Mol. Biol 230, 1025± by TBP associated proteins. Genes Dev 12, 1398±1408.
1054. Li, E., Bestor, T.H., Jaenisch, R., 1992. Targeted mutation of
Hunter, C.A., 1996. Sequence-dependent DNA structure. the DNA methyltransferase gene results in embryonic leth-
Bioessays 18, 157±162. ality. Cell 69, 915±926.
Hutchinson, G.B., 1996. The prediction of vertebrate promo- Lisser, S., Margalit, H., 1994. Determination of common
ter regions using dierential hexamer frequency analysis. structural features in Escherichia coli promoters by compu-
CABIOS 12, 391±398. ter analysis. Eur J Biochem 223, 823±830.
Ioshikhes, I., Bolshoy, A., Derenshteyn, K., Borodovsky, M., Liu, K., Stein, A., 1997. DNA sequence encodes information
Trifonov, E.N., 1996. Nucleosome DNA sequence pattern for nucleosome array formation. J. Mol. Biol 270, 559±
revealed by multiple alignment of experimentally mapped 573.
sequences. J. Mol. Biol 262, 129±139. Lu, B.Y., Eissenberg, J.C., 1998. Time out: developmental
Iyer, V., Struhl, K., 1995. Poly (dA:dT), a ubiquitous promo- regulation of heterochromatic silencing in. Drosophila.
ter element that stimulates transcription via its intrinsic Cell. Mol. Life Sci 54, 50±59.
DNA structure. EMBO J 14, 2570±2579. Lu, Q., Wallrath, L.L., Elgin, S.C.R., 1994. Nucleosome posi-
Jackle, H., Sauer, F., 1993. Transcriptional cascades tioning and gene regulation. J. Cell. Biochem 55, 83±92.
Drosophila. Curr. Opin. Cell Biol 5, 505±512. Luger, K., Maeder, A.W., Richmond, R.K., Sargent, D.F.,
Johnson, P.F., McKnight, S.L., 1989. Eukaryotic transcrip- Richmond, T.J., 1997. Crystal structure of the nucleosome
tion regulatory proteins. Ann. Rev. Biochem 58, 799±839. core particle at 2.8 AÊ resolution. Nature 389, 251±260.
Jones, P.A., Rideout, W.M., Shen, J.-C., Spruck, C.H., Tsai, Lukashin, A., Borodovsky, M., 1998. GeneMark.hmm: new
Y.C., 1992. Methylation, mutation and cancer. BioEssays solutions for gene ®nding. Nucleic Acids Res 26, 1107±
14, 33±36. 1115.
Karas, H., KnuÈppel, R., Schultz, W., Sklenar, H., Macleod, D., Charlton, J., Mullins, J., Bird, A., 1994. Sp 1
WIngender, E., 1996. Combining structural analysis of sites in the mouse aprt gene promoter are required to pre-
DNA with search routines for the detection of transcrip- vent methylation of the CpG islands. Genes Dev 8, 2282±
tion regulatory elements. CABIOS 12, 441±446. 2292.
Karpen, G.H., 1994. Position-eect variegation and the new Matis, S., Xu, Y., Shah, M., Guan, X., Einstein, J.R., Mural,
biology of heterochromatin. Curr. Opin. Genet. Dev 4, R., Uberbacher, E.C., 1996. Detection of RNA polymerase
281±291. II promoters and polyadenylation sites in human dna
Kellum, R., Elgin, S.C.R., 1998. Chromatin boundaries: punc- sequence. Comput. Chem 20, 135±140.
tuating the genome. Curr. Biol 8, R521±R524. Matthews, K.S., 1992. DNA looping. Microbiol. Rev 56,
Klug, A., Jack, A., Viswamitra, M.A., Kennard, O., Shakked, 123±136.
Z., Steitz, T.A., 1979. A hypothesis on a speci®c sequence- McGhee, J.D., Felsenfeld, G., 1980. Nucleosome structure.
dependent conformation of DNA and its relation to the Annu. Rev. Biochem 49, 1115±1156.
binding of the lac-repressor protein. J. Mol. Biol 131, 669± McKnight, S.L., Yamamoto, K.R., 1992. Transcriptional
680. Regulation. In: Cold Spring Harbor Laboratory Press, .
Knudsen, S. Promoter 1.0: a novel algorithm for the recog- Plainview, New York.
nition of pol-II promoter sequences, submitted. Mihaly, J., Hogga, I., Barges, S., Galloni, M., Mishra, R.K.,
Kondrakhin, Y.V., Kel, A.E., Romashchenko, N.A.K.A.G., Hagstrom, K., MuÈller, M., Schedl, P., Sipos, L., Gausz, J.,
Milanesi, L., 1995. Eukaryotic promoter recognition by Gyurkovics, H., Karch, F., 1998. Chromatin domain
binding sites for transcription factors. Comput. Appl. boundaries in the bithorax complex. Cell. Mol. Life Sci 54,
Biosci 11, 477±488. 60±70.
Kornberg, R.D., 1977. Structure of chromatin. Annu. Rev. Milanesi, L., Muselli, M., Arrigo, P., 1996. Hamming cluster-
Biochem 46, 931±954. ing method for signals prediction in 5' and 3' regions of
Kornberg, R.G., Lorch, Y., 1995. Interplay between chroma- eukaryotic genes. CABIOS 12, 399±404.
tin structure and transcription. Curr. Opin. Cell. Biol 7, Milanesi, L., Rogozin, I.B. in press. Prediction of human gene
371±375. structure. Bishop, M.J. (ed.) Guide to Human Genome
Korning, P.G., Hebsgaard, S.M., Rouze, P., Brunak, S., 1996. Computing. Academic Press. Cambridge.
Cleaning the GenBank Arabidopsis thaliana data set. Minie, M., Clark, D., Trainor, C., Evans, T., Reitman, M.,
Nucleic Acids Res 24, 316±320. Hannon, R., Gould, H., Felsenfeld, G., 1992.
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, Developmental regulation of globin gene expression. J.
D., 1994. Hidden Markov models in computational bi- Cell Sci. Suppl 16, 15±20.
ology: Applications to protein modeling. J. Mol. Biol 235, Mitchell, P.J., Tjian, R., 1989. Transcriptional regulation in
1501±1531. mammalian cells by sequence-speci®c DNA binding pro-
Laemmli, U.K., KaÈas, E., Poljak, L., Adachi, Y., 1992. teins. Science 245, 371±378.
206 A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207
Mizzen, C.A., Allis, C.D., 1998. Linking histone acetylation sequencing speci®c neural networks for promoter and
to transcriptional regulation. Cell. Mol. Life Sci 54, 6±20. splice site recognition. In: Hunter, L., Hunter, T.E. (Eds.),
Mizzen, C.A., Yang, X.-J., Kokubo, T., Brownell, J.E., Biocomputing: Proceedings of the 1996 Paci®c
Bannister, A.J., Owen-Hughes, T., Workman, J.L., Berger, Symposium, . World Scienti®c Publishing Co, Singapore.
S.L., Kouzarides, T., Nakatani, Y., Allis, C.D., 1996. The Richard-Foy, H., Hager, G.L., 1987. Sequence speci®c posi-
TAFII250 subunit of TFIID has histone acetyltransferase tioning of nucleosomes over the steroid-inducible MMTV
activity. Cell 87, 1261±1270. promoter. EMBO J 6, 2321±2328.
Nielsen, H., Engelbrecht, J., Brunak, S., von Heine, G., 1997. Richmond, T.J., Finch, J.T., Rushton, B., Rhodes, D., Klug,
Identi®cation of prokaryotic and eukaryotic signal pep- A., 1984. Structure of the nucleosome core particle at 7 AÊ
tides and prediction of their cleavage sites. Protein Eng 10, resolution. Nature 311, 532±537.
1±6. Rosenblueth, D.A., Thiery, D., Huerta, A.M., Salgado, H.,
Nielsen, H., Engelbrecht, J., von Heijne, G., Brunak, S., 1996. Colladio-Vides, J., 1996. Syntactic recognition of regulat-
De®ning a similarity threshold for a functional protein ory regions in Escherichia coli. CABIOS 12, 415±422.
sequence pattern: the signal peptide cleavage site. Proteins Saitoh, Y., Laemmli, U.K., 1993. From the chromosomal
26, 165±177. loops and the scaold to the classic bands of metaphase
Ogryzko, V., Schiltz, R.L., Russanova, V., Howard, B.H., chromosomes. Cold Spring Harbor Symp. Quant. Biol 58,
Nakatani, Y., 1996. The transcriptional coactivators p300 755±765.
and CBP are histone acetyltransferases. Cell 87, 953±959. Satchwell, S.C., Drew, H.R., Travers, A.A., 1986. Sequence
Orphanides, G., Lagrange, T., Reinberg, D., 1996. The gen- periodicities in chicken nucleosome core DNA. J. Mol.
eral transcription factors of RNA polymerase II. Genes Biol 191, 659±675.
Dev 10, 2657±2683. Schirm, S., Jiricny, J., Schaner, W., 1987. The SV40 enhan-
Paranjape, S.M., Kamakaka, R.T., Kadonaga, J.T., 1994. cer can be dissected into multiple segments, each with a
Role of chromatin structure in the regulation of transcrip- dierent cell type speci®city. Genes Dev 1, 65±74.
tion by RNA polymerase II. Annu. Rev. Biochem 63,
SchuÈbeler, D., Mielke, C., Maas, K., Bode, J., 1996. Scaold/
265±297.
matrix attached regions act upon transcription in a con-
Pazin, M.J., Kadonaga, J.T., 1997. SWI2/SNF2 and related
text-dependent manner. Biochem 35, 11160±11169.
proteins: ATP-driven motors that disrupt protein-DNA in-
Schug, J., Overton, G.C. 1977. Tess: transcription element
teractions? Cell 88, 737±740.
search software on the www. Technical Report CBIL-TR-
Pedersen, A.G., Baldi, P., Chauvin, Y., Brunak, S., 1998.
1997-1001-v0.0, of the Computational Biology and
DNA structure in human RNA polymerase II promoters.
Informatics Laboratory, School of Medicine, University of
J. Mol. Biol 281, 663±673.
Pennsylvania.
Pedersen, A.G., Nielsen, H., 1997. Neural network based pre-
Simpson, R.T., 1991. Nucleosome positioning: occurrence,
diction of translation initiation sites in eukaryotes: per-
mechanisms, and functional consequences. Prog. Nucleic
spectives for EST and genome analysis. ISMB 5, 226±233.
Acids Res. Mol. Biol 40, 143±184.
Pedersen, A.G., Nielsen, H., Brunak, S. Statistical analysis of
translational initiation sites in eukaryotes: patterns are Simpson, R.T., Roth, S.Y., Morse, R.H., Patterson, H.G.,
speci®c for systematic groups, (in prepararion). Cooper, J.P., Murphy, M., Kladde, M.P., Shimizu, M.,
1994. Nucleosome positioning and transcription. Cold
Perier, R.C., Junier, T., Bucher, P., 1998. The eukaryotic pro-
Spring Harbor Symp. Quant. Biol 58, 237±245.
moter database EPD. Nucleic Acids Res 26, 353±357.
Prestridge, D.S., 1995. Prediction of POL II promoter Singer, V.L., Wobbe, C.R., Struhl, K., 1990. A wide variety
sequences using transcription factor binding sites. J. Mol. of DNA sequences can functionally replace a yeast TATA
Biol 249, 923±932. element for transcriptional activation. Genes Dev 4, 636±
Prestridge, D.S., 1997. Signal scan: A computer program that 645.
scans DNA sequences for eukaryotic transcriptional el- Singh, G.B., 1997. Mathematical model to predict regions of
ements. CABIOS 7, 203±206. chromatin attachment to the nuclear matrix. Nucleic Acids
Prestridge, D.S., Burks, C., 1993. The density of transcrip- Res 25, 1419±1425.
tional elements in promoter and non-promoter sequences. Sippel, A.E., SchaÈfer, G., Faust, N., Hecht, A., Bonifer, C.,
Hum. Mol. Genet 2, 1449±1453. 1993. Chromatin domains constitute regulatory units for
Ptashne, M., Gann, A., 1997. Transcriptional activation by the control of eukaryotic genes. Cold Spring Harbor
recruitment. Nature 386, 569±577. Symp. Quant. Biol 58, 37±44.
Pugh, B.F., Tjian, R., 1990. Mechanism of transcriptional ac- Smale, S.T., 1994a. Core promoter architecture for eukaryotic
tivation by Sp1: evidence for coactivators. Cell 61, 1187± protein-coding genes. In: Conaway, R.C., Conaway, J.W.
1197. (Eds.), Transcription: Mechanisms and regulation. Raven
Quandt, K., Frech, K., Karas, H., Wingender, E., Werner, T., Press, New York, pp. 63±81.
1995. Matind and matinspectorÐnew fast and versatile Smale, S.T., 1994b. DNA sequence requirements for tran-
tools for detection of consensus matches in nucleotide scriptional initiator activity in mammalian cells. Mol. Cell.
sequence data. Nucl. Acids Res 23, 4878±4884. Biol 14, 116±127.
Ramakrishnan, V., 1997. Histone H1 and chromatin higher- Smale, S.T., 1997. Transcription initiation from TATA-less
order structure. Crit. Rev. Eukaryot. Gene Expr 7, 215± promoters within eukaryotic protein-coding genes. Bioch.
230. Biophys. Acta 1351, 73±88.
Reese, M.G., Harris, N.L., Eeckman, F.H., 1996. Large scale Small, S., Kraut, R., Hoey, T., Warrior, R., Levine, M., 1991.
A.Gorm Pedersen et al. / Computers & Chemistry 23 (1999) 191±207 207
Transcriptional regulation of a pair-rule stripe Drosophila. RNA polymerase B promoters of higher eukaryotes. CRC
Genes Dev 5, 827±839. Crit Rev Biochem 23, 77±120.
Solovyev, V., Salamov, A., 1997. The gene-®nder computer Widom, J., 1989. Toward a uni®ed model of chromatin fold-
tools for analysis of human and model organisms genome ing. Annu. Rev. Biophys. Biophys. Chem 18, 365±395.
sequences. In: Proceedings, Fifth International Conference Widom, J., 1996. Short-range order in two eukaryotic gen-
on Intelligent Systems for Molecular Biology (ISMB-97), omes: relation to chromosome structure. J. Mol. Biol 259,
294±302. 579±588.
Stargell, L.A., Struhl, K., 1996. Mechanisms of transcriptional Wieczorek, E., Brand, M., Jacq, X., Tora, L., 1998. Function
activation in vivo: two steps forward. Trends Genet 12, of TAF(II)-containing complex without TBP in transcrip-
311±315. tion by RNA polymerase II. Nature 393, 187±191.
Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Wingender, T.H.E., Hermjakob, I.R.H., Kel, A.E., Kel, O.V.,
Garcia-Pastor, M., Sterk, P., 1998. The EMBL nucleotide Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A.,
sequence database. Nucleic Acids Res 26, 8±15. Kolpakov, F.A., Kolchanov, N.L.P.N.A., 1998. Databases
Tanese, N., Tjian, R., 1993. Coactivators and TAFs: a new on transcriptional regulation: TRANSFAC, TRRD, and
class of eukaryotic transcription factors that connect acti- COMPEL. Nucl. Acids Res 26, 364±370.
vators to the basal machinery. Cold Spring Harbor Symp. Wobbe, C.R., Struhl, K., 1990. Yeast and human TATA-
Quant. Biol 58, 179±185. binding proteins have nearly identical DNA sequence
Tansey, W.P., Herr, W., 1997. TAFs: guilt by association. requirements for transcription in vitro. Mol. Cell. Biol 10,
Cell 88, 729±732. 3859±3867.
Tateno, Y., Fukami-Kobayashi, K., Miyazaki, S., Sugawara, Wole, A.P., 1994. Nucleosome positioning and modi®cation:
H., Gojobori, T., 1998. DNA data bank of Japan at work chromatin structures that potentiate transcription. TIBS
on genome sequence data. Nucleic Acids Res 26, 16±20. 19, 240±244.
Tolstrup, N., RouzeÂ, P., Brunak, S., 1997. A branch point Wole, A.P., Drew, H.R., 1995. DNA structure: implications
consensus from Arabidopsis found by non-circular analysis for chromatin structure and function. In: Elgin, S.C.R
allows for better prediction of acceptor sites. Nucleic (Ed.), Chromatin Structure and Gene Expression. IRL
Acids Res 25, 3159±3163. Press, Oxford, pp. 27±48.
Tsukiyama, T., Wu, C., 1997. Chromatin remodeling and Wole, A.P., Khochbin, S., Dimitrov, S., 1997. What do lin-
transcription. Curr. Opin. Gen. Dev 7, 182±191. ker histones do in chromatin? BioEssays 19, 249±255.
Turner, B.M., 1998. Histone acetylation as an epigenetic Workman, J.L., Kingston, R.E., 1998. Alteration of nucleo-
determinant of long-term transcriptional competence. Cell. some structure as a mechanism of transcriptional regu-
Mol. Life Sci 54, 21±31. lation. Annu. Rev. Biochem 67, 545±579.
Uberbacher, E.C., Xu, Y., Mural, R.J., 1996. Discovering and Zawel, L., Reinberg, D.R., 1993. Initiation of transcription by
understanding genes in human DNA sequence using RNA polymerase II: a multi-step process. Prog. Nucl.
GRAIL. Meth. Enz 266, 259±281. Acids Res. Mol. Biol 44, 67±108.
Vazquez, J., Farkas, G., Gaszner, M., Udvardy, A., Muller, Zhang, M.Q., 1998a. A discrimination study of human core-
M., Hagstrom, K., Gyurkovics, H., Sipos, L., Gausz, J., promoters. In: Altman, R., Donker, A.K., Hunter, L.,
Galloni, M., Hogga, I., Karch, F., Schedl, P., 1993. Klein, T.E. (Eds.), Proc. Paci®c Symp. Biocomputing.
Genetic and molecular analysis of chromatin domains. World Scienti®c, Singapore, pp. 240±251.
Cold Spring Harbor Symp. Quant. Biol 58, 45±54. Zhang, M.Q., 1998b. Identi®cation of human gene core-pro-
Wasserman, W.W., Fickett, J.W., 1998. Identi®cation of regu- moters in silico. Genome Res 8, 319±326.
latory regions which confer muscle-speci®c gene ex- Zhu, Z., Thiele, D.J., 1996. A specialized nucleosome modu-
pression. J. Mol. Biol 24, 167±181. lates transcription factor access to a C. glabrata metal re-
Wasylyk, B., 1988. Transcription elements and factors of sponsive promoter. Cell 87, 459±470.