Professional Documents
Culture Documents
REVIEWS Further
Quick links to online content
Annu. Rev. Biochem. 1990. 59:837-72
Copyright © 1990 by Annual Reviews Inc. All rights reserved
CONTENTS
INTRODUCTION
837
0066-4 154/90/0701-0837$02.00
838 VUORIO & DE CROMBRUGGHE
on the different collagens has been reviewed over the last few years (1, 2 ).
The present article focuses mainly on the relationships between the molecular
and supramolecular organization of specific collagens and the structure of
their genes. In a second part, we also briefly review some of the regulatory
elements of these genes.
Our discussion includes only molecules that have been assigned to the
collagen family based on their structural and functional features. The different
collagens are referred to as collagen types and are designated by Roman
numerals I-XIII. In all these molecules a major component of the protein is a
triple-helical structure of three polypeptide chains (a-chains) with a
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
o14(IV) COL4A4
as (IV) COL4As X q22
V (d(V) COLsAI r aI (V)ha2(V), 67 nm banded fibril
«2(V) COL5A2 2q24.3-q31 a I(V)O'2(V)O'3(V),
by Boston University on 05/09/13. For personal use only.
have also been reported for several invertebrates including sea urchins, fruit
flies, and slime molds.
Based on their supramolecular structures, the collagens are divided into two
main classes: fibril-forming (or fibrillar) collagens and non-fibril-forming
collagens. The former group contains molecules with long continuous triple
helices, which are the constituents of banded collagen fibrils. The non-fibril
forming collagens are more heterogenous and have been further classified
according to their molecular characteristics, supramolecular structures, and
the types of extracellular networks that they form.
Based 011 their protein and gene structures, types I, II, III, V , and XI
collagens have been assigned to the fibril-forming group. By forming highly
organized fibers and fibrils, these collagens provide the structural support for
840 VUORIO & DE CROMBRUGGHE
the body in skeleton, skin, blood vessels, nerves, intestines, and in the fibrous
capsules of organs.
gap = 40nm
D 67nm
""
rI
r--L--, collagen molecule 4.4D = 300nm
I
I
A "':=
::= ====:::::J
rJ======::::J ' B 'D
c:;:::========::::J
:::
C CI ========::::J::::j I
I I I I
Figure 1 Schematic presentation of the quarter-staggered assembly of collagen fibrils. In
dividual 300-nm collagen molecules overlap each other by a distance D of 67 nm or mUltiples of
67 nm. Each molecule is 4.4 D long. leaving a gap of 0 . 6 D (40 nm) between the ends of
non-overlapping molecules. The four arrows at the bottom of the figure mark the locations of the
cross-linking residues. one in the N-telopeptide. one in the C-telopeptide. and two others. 0.4 D
from each end of the triple-helix. Two different intermolecular cross-links are shown between
collagen molecules A and B. and C and D.
COLLAGEN GENES 841
sine residues in strictly conserved positions (Figure 1; for review see 2 4).
When cross-linking is inhibited, e.g. by lathyrogens, the tensile strength of
collagen fibrils is drastically reduced.
by Boston University on 05/09/13. For personal use only.
(3
+I �
"- "- :r
,,-I 0,- II-;- :r: 1-;-1 0,- 2 I- 2 00
(1). 2 , 2 � �O �
"1(IV) .1 I III I 1111 I II II II II I � - N
DI:::JI ._ "- <0
'" �
C')
g)
to "- en "- "-
� <0
�� � ..- ;! '"
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
N � �
:::: ;e� G
l�
N
on ..- <0
N
2
:r:
I- 2 -<
"2(IV) c:::
01 I III I III! I III II II II 0
onC'l ex:>
C'lC') C'l �
'" CS
Fibril-associated collagens ;! 0
by Boston University on 05/09/13. For personal use only.
..,. '"
...J C')
N
.J "'.J�
Ro
(.) (.) 0 ,-,ou
! 2
0
(.) 2 '-' 2'-'2
Microfi brillar (Type VI) collagen
t:i
tTl
Q{ 1(IX) /" 2(IX) D 1111
r-- "- en 0"'''' n
!:::'
C')
�
..,.
M
�
N
""
""
M;::!:: :;
0 G ;:0
'" N
'" �
l � '-' 2 0
" 1 (VI)la 2(V1) 01 �
0 on tc
� �
M
;:0
N �
al
C')
N
�
M
co
!;j!
c:::
a
--IN .....J ""'"
a
� aCJ aU
02 UZ ::r::
"1(XI/) __ II trl
C\J CI') CO) to
� "" �
I
I
Type X and Type VIII collagens
Type XIII collagen
� � G
J11111111
z: (.) 2 "'-�C'J�C'I') :3�
1- &��8�8 � 8 §E
<>1(X) co<>
Q'1(X1I1) Dill III! I I
�'"
� � 8������
a 1(VIII) � ';f, g:
� '"
"
Figure 2 Domain structures of the different collagen types, Open rectangles represent triple-helical domains with bars representing interruptions in the
Gly-X- Y repeat structure, Dark rectangles denote globular domains, and dotted areas signal peptides, The arrows and gaps mark the sites of posttranslational
cleavage, The sizes (in amino acid residues) are shown for each domain: SP, signal peptide; N-P, N-propeptide; N-TP, N-telopeptide, TH, triple-helix, C-TP,
C-telopeptide; CP, C-propeptide . For fibril-forming collagens the data are largely derived from cDNA sequences for human type I (24-29), II (30, 3 1 ), III
(32-36), a2(V) (37-40), and aI(XI) collagen (41) chains, The data for the other collagens are based on eDNA clones for chick aJ(IX) and a2(1X) chains
(42-46), for chick al (XII) collagen (47, 48), for chick al (X) (49, 50) and rabbit aI(VIII) collagen (45 , 5 1 ) , for human al(IV) and a20V) collagens
(52-56), for human type VI collagen chains (57-60) , and for human aI (XIll) collagen (6 1 , 62),
COLLAGEN GENES 843
nonfibrillar collagen genes, examples of exon sizes abound that are multiples
of 9 bp but different from 54, 45, and 99 bp.
The overall organization and succession of exons coding for the triple
helical domain of fibril-forming collagen genes is shown in Table 2 . The
available sequence data (63-83) indicate that all fibrillar collagen genes
display the exact same pattern of exon sizes shown in Table ::: with three
minor exc(:ptions: (a) in the al(l) collagen gene a single 108-bp exon (exon
33/34) replaces two 54-bp exons (exons 33 and 34) (68, 84), (b) in the a2(XI)
collagen gene the last 108-bp exon (exon 48) has been replaced by exons of 54
bp, 36bp, and 54 bp, with only 18 bp of triple-helical domain in the
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
C-terminal joining exon (11), and (c) in the cDNA sequence of human and
mouse a l(III) collagen an additional Gly-X-Y triplet appears near the N
terminal end of the triple-helix (34, 36, 83). Since the complete exon structure
by Boston University on 05/09/13. For personal use only.
of the type III collagen gene is not known, the exact location of this additional
triplet remains unknown. Although the amino acid residues in the X and Y
positions show considerable divergence within a collagen chain and between
different chains, the extreme degree of conservation for both the exon struc
tures and the pattern of exon sizes implies a common origin. The pattern
appears to have been established before the invertebrate-vertebrate radiation,
since a sea urchin collagen gene has a very similar structure (85). This high
degree of conservation also suggests that once the triple-helical molecular
structure and the highly organized supramolecular assembly of fibrils were
established, no changes in these structures were tolerated. The dramatic
Exon no. Size (bp) Exon no. Size (bp) Exon no. Size (bp)
7 45* 21 1 08 35 54
8 54 22 54 36 54
9 54 23 99 37 1 08
10 54 24 54 38 54
11 54 25 99 39 54
12 54 26 54 40 1 62
13 45 27 54 41 1 08
14 54 28 54 42 1 08
15 45 29 54 43 54
16 54 30 45 44 1 08
17 99 31 99 45 54
18 45 32 1 08 46 1 08
19 99 33 S4* 47 S4
20 54 34 54* 48 1 08*
a In addition to the 42 exons coding for 332 Gly-X-Y triplets, the N- and C-terminal
joining exons code for 1-3 and 5-7 triplets, respectively. The three minor deviations from
the wnserved pattern are marked with asterisks as discussed in the text.
844 VUORIO & DE CROMBRUGGHE
consequences of both Gly substitutions and exon deletions that are found in
mutant collagens illustrate this point.
the slower process of triple-helix formation would render the molecule more
susceptible to degradation. What is important is that the existence within a
procollagen molecule of one mutant chain is sufficient to cause degradation of
the entire molecule. Hence, a much smaller proportion of collagen molecules
accumulate:s in the matrix, leading to a severe deficiency of molecules
available for fibrillogenesis. This phenomenon has been called "protein
suicide" (102, 103).
A second possible mechanism to explain the dominant character of these
mutations is that the presence of mutant molecules in collagen fibrils would
cause defective phenotypes (87, 95). Because fibrils are made of many
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
However, not all mutations are equivalent in terms of the severity of their
phenotype (87, 88, 106, 107). Generally mutations in the al (l) chain have a
more severe phenotype than mutations in the a2(1) chain. This could be the
consequence of the 2: I ratio of these polypeptides in the type I collagen. A
mutation in the al chain will affect more molecules (75%) than a mutation in
the a2 chain (50%) (87, 89). In addition, mutations located closer to the
carboxy-terminal end of the triple-helical domain have generally more severe
effects than mutations that are further removed from the carboxy-terminal
end. The best indication of this "gradient" phenomenon came from a compari
son of Gly ......,.. Cys substitutions occurring at three different locations along the
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
triple-helical domain of the proa I (I) chain. Indeed the abnormal phenotype
becomes less severe when the Cys substitution moves toward the N-terminus
of the triple-helix of the proa I(I) chain (10 I). It is clear that the nature of the
by Boston University on 05/09/13. For personal use only.
substitution and the local environment where the substitution occurs also play
a role.
In summary, the two major properties of mutations in the triple-helical
domain of fibril-forming collagens are entirely consistent with the concept
that the rigidly conserved features of these molecules, which are determined
by the structural properties of their genes, are essential for the correct function
of these collagen molecules in their supramolecular complexes.
C-Propeptide Domain
The C-propeptides of fibrillar collagens share the highest degree of sequence
similarity both between different types and between species. The 243-247-
amino-acid C-propeptide is removed extracellularly by a specific C-protease,
leaving a short telopeptide of 11-27 amino acids attached to the triple-helix
(Figure 2 ). The C-propeptide has a globular structure that is stabilized by two
intra-chain disulfide bonds formed by the four carboxy-terminal Cys residues.
Another three or four Cys residues that are located closer to the telopeptide
participate in inter-chain disulfide bonding. The formation of these intra- and
inter-chain disulfide bridges precedes the assembly of the triple-helix and
allows the correct alignment of three a-chains that associate inside the cells
into a triple-helical molecule starting from the C-terminal end of the mole
cule. Mutations altering the structure of the C-propeptide have confirmed that
this domain plays an important role in chain association (108-110). The
mechanism selecting the correct a-chains for each molecule within cells
synthesizing more than one collagen type at the same time is not understood.
It has been suggested that the number of Cys residues available for inter-chain
disulfide bonding dctermines whether the a-chains form homo- or heterotrim
ers (39). The procollagen chains that form heterotrimers have only three Cys
residues for intcr-chain disulfide bonding [a2(1), a2(V), and al(XI)], where
as those that are capable of forming homotrimers [al(l), al(II), and al(III)]
have four.
COLLAGEN GENES 847
The locations of the Cys residues in the C-propeptides are strictly con
served. Their neighboring sequences are also conserved as well as the se
quence arolJnd the carbohydrate attachment site. The C-propeptide domain is
specified by four exons (exons 49 to 52). Since the last two exons are identical
in size, the small variations in the length of the propeptide result from small
changes in size for exons 49 and 50. The highest degree of sequence identity
is seen in exon 51 around the carbohydrate attachment domain; interestingly,
this homology is also evident for the nucleotide sequence (111). The joining
exon (exon 49) codes for the end of the triple-helix, the telopeptide, and the
beginning of the C-propeptide. The length of this triple-helical sequence
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
N-Propeptide Domain
The N-propeptides of fibrillar collagens exhibit a much higher degree of
divergence both in length and in domain structure than the rest of the
polypeptides. The exon organizations of the gene segments coding for this
domain also show much more divergence (64-67, 82, 115-119; Figure 3).
Size variation is also seen in the same chain from different species. The
amino-terminal propeptide consists of the following elements: a signal pep
tide, a Cys-rich globular domain, a short triple-helical region, and a short
globular domain ending in the N-telopeptide.
A 65-71.-amino-acid Cys-rich domain is coded for by exon 2 in the a 1 (1)
and al (III) gene but is not present in the a2(I) chain. In this gene two
vestigial exons of 11 and 15 bp code for a short globular domain. The function
of the Cys-rich domain remains unknown. A homologous domain has also
been detected in thrombospondin ( 1 20) and von Willebrand factor ( 1 2 1 ) . A
sequence for a similar Cys-rich domain is found in the a 1 (Il) collagen gene
but is only present in a fraction of the mRNAs (3 1 , 1 1 7 , 1 1 8). It is, therefore,
likely that this sequence undergoes alternative splicing in the al(II) collagen
pre-mRNA.
The triple-helical subdomain of the N-propeptide varies in length from 39
amino acids in the a 1 (1ll) chain to 79 amino acids in the a 1 (Il) and a2(V)
chains. This triple-helical sequence contains one interruption in the a l (Il) and
two intemlptions in the a2(V) chains (3 1 , 40, 82, 1 16) . The exons coding
848 VUORIO & DE CROMBRUGGHE
main
signal cys-rich globular telo- triple
peptide domain triple helical domain pertide helix
I
r-L--, I I
1 2 3 4 5 6
I::;;;;;;;;:M -m-- -D- ---c::J- -fiI]-
1 2 3 4 5 6
a 2(1) [JJ-- I -I- � ----c:::::J-- -IIIIJ--
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
1 2 3 4A 48 5A 58 6
a1(1I) 1321- --Il-O-O-D- -c:::3- -1111J---
1 2 3 4/5 6
by Boston University on 05/09/13. For personal use only.
for this domain show considerable divergence in size (see Figure 3). In the
a2(I) and al(III) genes this triple-helix is encoded by two exons, whereas the
a I (I) chain contains an additional triple-helical 36-bp exon, and the a1(II)
correspondence in the strictly conserved properties of the exons coding for the
triple-helical domain of these proteins. The very high degree of conservation
of the pattern of triple-helical exon sizes is in contrast to the diversity of exon
sizes that code for the N-propeptide domain. Different parts of the gene
underwent, therefore, different types of selective pressure. Changes were
tolerated in the N-propeptide domain, but not in the triple-helical domain.
Grouped under the non-fibril-forming heading are all collagens that fall
outside the fibril-forming collagens. This group is very heterogeneous both
structurally and functionally. Several of these collagens constitute the com
by Boston University on 05/09/13. For personal use only.
ponents of different extracellular matrix networks (types IV, VI, VII, VIII,
and possibly type X) or interact directly with the fibril-forming collagens
(type IX and possibly type XII) .
The exon structures and organizations of the genes for the non-fibril
forming collagens diverge from those of the fibril-forming collagen genes,
although the basic 9-bp unit coding for Gly-X-Y is clearly maintained.
Complete exon structures are known for two nonfibrillar collagen genes: those
for a2(IX) and a l (IV) collagens . The structures of these genes show different
degrees of divergence from the fibrillar gene model , less for the gene for type
IX, considc!rably more for the gene for type IV collagen. The divergence in
exon sizes and organization in these genes is very likely related to the
difference in structure and function between nonfibrillar and fibrillar col
lagens .
Fibril-Associated Collagens
This subgroup, which has been named FACIT for Fibril-Associated Collagens
with Intemlpted Triple-helices, contains the collagens IX and XII (45). Type
IX has been shown to be associated with type II collagen, whereas type XII,.
which shares many structural features with type IX, is thought to be associ
ated with type I collagen .
with non-triple-helical domains designated NC3 and NC2 (Figure 2). The
amino- and carboxy-terminal noncollagenous domains (NC4 and NCI) share
no homology with the propeptides of the fibrillar collagens and do not appear
to be proteolytically processed. The NC3 domain of the a2(IX) chain is five
amino acids longer than in the a I (IX) chain and contains an attachment site
for a glycosaminoglycan side chain (44) . This may account for the sharp kink
observed by electron microscopy in the NC3 domain ( 1 25 ) .
Both CaL l and COL3 domains of a l (IX) and a2(IX) chains contain one
discontinuity in the Gly-X-Y sequence, which can be accounted for by a
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
deletion of the X or Y amino acid. The other discontinuity, seen in the CaL l
domain of both chains, could similarly be explained by deletion of one Gly,
since the sequence can be written as Gly-X-Y-X-Y. No interruptions of the
Gly-X-Y repeat sequence are seen in the COL2 domain (45).
by Boston University on 05/09/13. For personal use only.
0'2(1)" 0 0 5 23 0 0 0 0 5 8 0 0 0 0 0 0 0
0'2(1X)-lb 0 3 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0
-2 1 0 1 4 0 1 0 0 0 0 0 0 0 0 0 0
O'I(lV)-1 2 2 3 2 2 1 1 3 3 0 0 0 0 0 0 0 0 0
-2 0 0 2 0 2 0 0 0 2
0'1(VI)C 1 5 1 .
0'1(XIII)C 2 2
n
a Representing fibril-fanning collagen genes o
b Line I, exons that are exact multiples of 9; line 2, exons for which the size was adjusted owing to short interruptions and/or split codons t"""
cRepresents partial sequence data t"""
:»
Cl
tTl
Z
Cl
tTl
Z
tTl
C/l
00
VI
852 VUORIO & DE CROMBRUGGHE
variability in the exon sizes: (a) short interruptions of the triple-helix that
correspond to single amino acid deletions, and (b) split codons at some exon
junctions . If allowance is made for deletions of 3 bp (one amino acid) , the
sizes of the three exons with interruptions become multiples of 9. In one
single case the splice occurs between the X and Y codons , resulting in
triple-helical exons of 33 and 1 47 bp (45).
The other deviation from the fibrillar collagen gene exon model is the
occurrence of split codons at the 5 ' - and 3 ' - ends of several exons. The split
codons do not, however, occur randomly within the 9 bp which encode
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
Gly-Xcy, but, remarkably, always involve the first nucleotide of the Gly
codon (G) . As in the fibrillar collagen genes, the structure of exons that
contain exact mUltiples of 9 bp is:
by Boston University on 05/09/13. For personal use only.
· . . agGGN-NNN-NNN-(GGN-NNN-NNN)n GGN-NNN-NNNgt .
(the small letters represent conserved intron sequences). Some exons contain
one additional G residue at their 3' -end, corresponding to the first base of the
following Gly codon:
· . . cagGGN-NNN-NNN-(GGN-NNN-NNN)n GGN-NNN-NNNGgt .
In order to maintain the Gly-X-Y pattern they are followed by exons lacking
one G residue at their 5 ' end.
· . . cagGN-NNN-NNN-(GGN-NNN-NNN)n GGN-NNN-NNNgt .
Some exons contain split codons at each end. Thus , exons that contain split
codons deviate from a multiple of 9 bp by only 1 bp and this 1 bp always
involves either the deletion of a single G residue at the 5 ' end of the exon or
the addition of a single G residue at the 3 ' end of the exon or both . In the latter
case, the exon size is also a multiple of9 bp. In order to maintain the Gly-X-Y
pattern, the exons with split codons must occur in pairs or clusters. In the
COL2 domain of thc a2(IX) gcnc, cxons with split codons occur in pairs and
in the COL I domain in clusters . The remarkable nonrandomness in the
location of the spJjt-codons, together with the existence of many exons with
lengths that are exact multiples of 9 bp and that start with a complete Gly
codon and end with a complete Y codon, suggests that in this gene the split
codons arose as secondary events in exons that initially contained a complete
Gly codon at their 5 ' end and a complete Y codon at their 3 ' ends and had
lengths that were exact multiples of 9 bp.
The considerable number of 54-bp exons in the type IX collagen suggests
that like the ancestral gene for the fibrillar collagens, the type IX collagen
COLLAGEN GENES 853
terminal end. By cDNA and genomic cloning, two different 5 ' -ends have
been discovered in the mRNA (cDNA) for the a l (IX) chain; the mRNA in
cornea is approximately 700 nucleotides shorter, lacking most of the se
quences coding for the NC4 domain ( 1 24) . The two transcripts arise from the
same gene by using alternate promoters and transcription start sites; the
synthesis of the shorter mRNA starts from an alternate exon 1 in the sixth
intron of the gene (45 , 1 29a) . From the seventh exon onward, the mRNAs are
identical . This represents the first case within the collagen gene family where
alternate transcription start sites are used to create two forms of the protein. A
somewhat similar situation is found in the a2(1) collagen gene, which uses a
different start site for transcription in cartilage compared to other tissues in
which the gene is expressed. However, the a2(1) polypeptide chain is not
made in this tissue, probably as a result of a translational block ( 1 29b) .
TYPE XII COLLAGEN This collagen type was initially discovered as a cDNA
clone homologous to type IX collagen cDNAs (47) . The corresponding
protein has been purified and shown to be a homotrimer with a molecular
weight of approximately 220,000 ( 1 30). The structural similarities between
type IX and type XII collagens suggest an analogous function for type XII
collagen: lateral association with type I collagen on the surfaces of fibrils (45 ,
48). Although immunolocalization and RNAse protection studies have shown
coexpression of type I and type XII collagen in several tissues, no direct
evidence is available for the presence of type XII collagen on fibril surfaces
( 1 3 1) .
The homologies between a l (Xll), a l (IX), and a2(1X) collagen sequences
are seen in the carboxy-terminal NC l , COLI, and NC2 domains, and be
tween the large amino-terminal NC3 domain of aI(XII) and the NC4 domain
of a l (lX) (48; Figure 2). While the conservation of exon sizes, location of
Cys residues, and the sequence homologies suggest a common ancestor for
854 VUORIO & DE CROMBRUGGHE
domain of type X contains 460 amino acids with eight imperfections in the
Gly-X-Y repeat structure (49, 50) . The nature of these imperfections (de
letions of single residues, either Gly or X or Y) is similar to those in type IX
and XU collagens. The structure of the rabbit a l (VIII) collagen chain is very
similar to that of a I (X), with similar imperfections in the same locations
within the triple-helix (51; Figure 2). Despite complete divergence of the NC3
domains, the overall sequence similarity between the a l (X) and a l (VIII)
chains is about 60% . Type VIII collagen is expressed in the Descement's
membrane of the eye, by a number of vascular endothelial cells in tissue
culture , and by some tumor-derived cells ( 1 33).
The gene structure of type X collagen (and probably type Vlll collagen) is
completely different from the multiexon structure of the other collagen genes .
The entire triple-helical domain is encoded by one single exon. In chick, the
entire gene contains only three exons and spans about 5 kb (45 , 50, 1 34) . One
possible hypothesis to account for this exon structure, which differs from the
organization of all other collagen genes , would be that the single exon that
encodes the triple-helix would have arisen by a homologous recombinational
event between a double-stranded eDNA and a previously existing gene that
contained introns .
Type IV Collagen
Expression of type IV collagen is restricted to basement membranes, where it
is the major component. The most common form of type IV collagen consists
of two a l chains and one a2 chain, but other forms also exist ( 1 35). There is
chemical evidence for at least two other polypeptide chains, which have been
named a3(1V) and a4(1V) ( 1 36, 1 37) , and recently a related a5(IV) gene was
identified that maps to the human X chromosome (7).
Type IV collagen molecules consist of three distinct domains: a central
triple-helix, a large C-terminal globular domain (NC l ) , and an N-terminal
COLLAGEN GENES 855
globular domain (NC2) (Figure 2). The a-chains of type IV collagen are not
proteolytic ally processed as are those of the fibrillar collagens.
Type IV collagen molecules are assembled into a flexible three
dimensional network. Based on electron microscopic and biochemical stud
ies, a model has been proposed whereby the 400-nm-Iong type IV collagen
molecules interact at both ends with other type IV collagen molecules ( 1 35).
At their amino-terminus, four molecules interact, two in parallel and two in
antiparallell orientation , through their 7S-domains. These domains consist of a
Cys-rich a.mino-terminal noncollagenous segment (NC 2), a 30-nm un
interrupted triple-helix, and a short non-triple-helical squence. Both disulfide
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
less. The genes for type IV collagen show the same type of deviations from
the fibrillar pattern as the type IX genes, i . e . interruptions of the Gly-X-Y
repeat, exons with split codons, and some exons that, although their lengths
are multiples of 9 bp, represent sizes not found in fibrillar collagen genes
(Table 3 ) . However, in the type IV collagen genes, these differences occur
much more frequently throughout the triple-helical domain . It is interesting to
note that in the gene segment specifying the triple-helical part of the N
terminal 7S domain of a l (IV) collagen, which is the part of the molecule
undergoing lateral associations with other type IV collagen chains, the exons
contain no interruptions or split codons and adhere strictly to the 9-bp rule
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
( 147, 1 49) . The sizes of these exons (90 bp, 45 bp, 45 bp, and 63 bp) are also
somewhat reminiscent of the structure of fibrillar collagen exons. In contrast,
nearly all the exons in the C-terminal two thirds of the a-chain contain split
by Boston University on 05/09/13. For personal use only.
codons ( 1 47).
Characterization of the gene structure of the a2(IV) chain, which shares a
considerable degree of amino acid sequence homology with the a l (IV) chain,
has revealed a striking difference in the exon organization of the C-terminal
portion of these genes ( 1 50 , 1 5 1 ) . Currently there is no good explanation for
this conservation of amino acid sequence but divergence of the exon structure .
The gene for Drosophila a l (IV) collagen, which shows similarities in amino
acid sequence , has an exon structure that is quite different from the vertebrate
type IV collagen genes , as is described later ( 1 52) .
Type VI Collagen
Protein chemical and electron microscopic characterization of this short-chain
collagen has been reviewed recently ( 153). Type VI collagen is a heterotrimer
of ai , a2, and a3 chains with a short (approximately 1 00 nm) triple-helix.
Dimerization occurs by antiparallel association of two monomers . Such dim
ers associate to form tetramers by lateral aggregation . These tetramers , which
are stabilized by disulfide bonds into a structure with large globular ends and a
short helical segment, associate, in tum, end-to-end and laterally to form a
class of microfibrils with a 1 00-nm periodicity. These microfibrils have a
ubiquitous distribution in connective tissues , but are not directly associated
with banded collagen fibrils ( l , 1 53).
The complete primary structures for the a l and a2 chains, and for the
central part of the a3 chain, have been derived from amino acid and cDNA
sequencing (57-60, 1 54; Figure 2) . The triple-helical domains of these
chains, which constitute less than one-third of their total mass (Figure 2) , are
335-336 amino acids long (58). They contain two short interruptions with a
spacing which, according to one model , would allow coiling of two anti
parallel triple helices in dimeric molecules as indicated by electron micros
copy ( 1 55). Each helical domain contains one Cys residue, which probably par-
COLLAGEN GENES 857
adherence to the 9-bp rule in the triple-helical domain, with exon sizes of
27-63 bp without split codons (B . Triieb, personal communication; Table 3).
The structure of a novel short-chain collagen, which has been entirely de
termined fmm cDNA clones (6 1 , 62) , shows three triple-helical and four
noncollagenous domains (Figure 2) . The corresponding protein has not yet
been isolated, and both the chain composition and the supramolecular struc
ture of type XlII collagen remain unknown . Polyclonal antibodies to a
synthetic peptide for the carboxy-terminal (NC4) domain recognize
polypeptides of 67 kd and 62 kd in Western blots (6 1 ) . The tissue distribution
for the mRNA determined by Northern and in situ hybridizations show the
highest levels in skin (epidermis and hair follicles), intestine (mucosal layer) ,
bone (intertrabecular mesenchyme), striated muscle (endomysium) , and car
tilage ( 1 56).
The most striking feature of type XIII collagen is the existence of at least
five alternatively spliced RNAs, which affect both collagenous (CaLl and
COL3) and noncollagenous (NC2 and NC4) domains (62). Each collagenous
domain contains one or two short interruptions of the Gly-X-Y repeat.
The gene for the a l (XIII) collagen is large since genomic clones spanning
25 kb cover only eleven 3 ' - terminal exons and one quarter of the coding
sequences ( 1 57). These exons contain no split codons and follow the 9-bp
rule. Exact adherence to these rules must be essential for maintaining the
correct reading frame and the G l y X-Y sequence during the alternative splic
-
ing events. Two of the exons are 54 bp, one 45 bp long (Table 3). In two cases
the differences in cDNA sequence result from alternative splicing of one
36-bp eXOIl, and of the carboxy-terminal joining exon, respectively. The
central triple-helical domain (COL2) of the a l(XIII) collagen chains is not
affected by alternative splicing, but variation in the lengths of the other two
collagenous domains presents a challenging puzzle: how are such trimeric
molecules assembled, or is there a mechanism selecting a-chains of the same
length for I;!ach trimer?
858 VUORIO & DE CROMBRUGGHE
short introns, and each codes for a protein with two triple-helical domains,
one of 27-33 amino acids , the other of 1 27- 1 32 amino acids. The latter
triple-helical domain contains one to three short interruptions of the Gly-X-Y
pattern ( 1 62). The three noncollagenous domains contain several Cys resi
dues. Based on the amino acid sequence similarities , the genes have been
further divided into three groups. In only two genes, designated COL8 and
dpy l 3 , the intron is located within the triple-helical domain. In COL8 the
intron is found between the X and Y codons of a Gly-X-Y triplet ( 1 6 1 ) ,
whereas i n dpy 1 3 the intron splits a Gly codon ( 1 60). Mutations i n two
different collagen genes (dpy l 3 and sgt l ) have been shown to affect the body
shape of C. elegans ( 1 60, 1 6 1 ) .
The protein coded by the Drosophila a 1 (IV) collagen genes shares con
siderable homology with the corresponding vertebrate a-chain in both domain
structure and sequence ( 1 52) . For instance, the locations of 1 1 out of 2 1
discontinuities of the Gly-X-Y repeat are conserved. In contrast, the gene
structures are much more divergent. The gene for Drosophila a l (IV) collagen
contains eight exons and seven relatively short introns (67-484 bp) , whereas
the human gene contains 52 exons and 5 1 introns ( 147 , 1 52 , 1 64) . Only three
of the intron locations coincide. Four of the seven introns in the Drosophila
gene are within the triple-helical domain , and their location does not appear to
follow the 9-bp exon pattern observed in vertebrate genes. The reason for the
small number and size of introns in C. elegans and Drosophila genes may
simply reflect the smaller sizes of their genomes (approximately 3% of the
human genome) and the much lower number of introns throughout their
genomes.
Characterization of cDNA and genomic clones for a procollagen of the sea
urchin Paracentrotus lividus revealed features that are typical of fibrillar
collagen genes of vertebrates (85 ) . A partial cDNA sequence codes for 478
amino acids of uninterrupted Gly-X-Y repeats followed by a globular domain
COLLAGEN GENES 859
of 252 amino acids. The globular domain contains seven Cys residues, a
carbohydrate attachment site, and a putative C-proteinase cleavage site. In
addition the cDNA shows conserved sequences in the putative cross-linking
sites in the telopeptide and the triple-helical domain . The gene structure of
this collagen chain shows a remarkable similarity with that of the vertebrate
fibril-forming collagen genes. The 14 exons coding for the triple-helical
domain that were characterized have sizes of 54 bp, 1 08 bp, and 1 62 bp and
always follow the Gly-X-Y pattern . Even the pattern of exon sizes follows
that of the fibril-forming collagen genes. It is clear that the vertebrate fibril
forming collagen genes and this echinoid gene are closely related and derive
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
(CAT). In practically all the ensuing transgenic mice strains, the expression of
the endogenous a2(I) collagen paralleled the expression of the endogenous
a2(1) collagen gene, i . e . high expression in tail , a tissue that is very rich in
type I collagen, somewhat less expression in bone and skin, and very little or
no expression in many other tissues ( 166; B . de Crombrugghe, G . Karsenty ,
L. A . Garrett, unpublished results). Hence, the 2000-bp segment 5 ' to the
transcription start site was sufficient to confer tissue-specificity. A similar
experiment was performed with an a 1 (II) collagen promoter-CAT chimeric
gene, which showed selective expression in tissues in which the endogenous
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
the collagen DNA sequences present in the transgenes contain the necessary
cis-acting elements, which by interacting with defined cellular factors de
termine the expression of these transgenes in specific cells and tissues of
intact animals. Some of the trans-acting factors that are responsible for the
activation or repression of these genes have begun to be identified. We briefly
review here some of the regulatory elements that have been identified in a
limited number of collagen genes.
methylation pattern o f the promoter ( 1 72). One possible explanation for the
absence of transcription of the a 1 (I) collagen gene is that the inserted viral
sequence would disrupt an enhancer located in the first intron . Interestingly,
recent organ culture experiments have indicated that this block in transcription
was ovemome in odontoblasts ( 1 73) .
A DNA segment with some of the properties of enhancing elements was
identified in a 782-bp fragment of the first intron of the human a l (l) gene
between + 820 and + 1 602 using a heterologous frog oocyte micromjection
assay system ( 1 74) . For these experiments the a 1 (1) collagen intron DNA
segment was inserted in a a 1 (1) collagen promoter-a globin chimeric gene,
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
either in the first intron of the a-globin gene or 5 ' to the a I (I) collagen
promoter sequence . After microinjection of the DNA in frog oocytes, the
levels of a-globin RNA were measured and found to be increased over those
seen with control chimeric genes lacking the intron sequences. Activation was
by Boston University on 05/09/13. For personal use only.
observed when intron sequences were placed in both orientations within the
a-globin intron but only in the inverted orientation when the intron segment
was placed 5 ' to the promoter. Similar results were obtained, again by
measuring the levels of globin RNA , after DNA transfection of NIH-3T3
fibroblasts using the construction in which the a I (I) collagen gene intron
segment was inserted in the first intron of the a-globin gene.
Upon further dissection of the first intron of the human a1 (I) collagen gene,
both positive and negative cis-acting elements were identified in this gene
segment. Indeed when a l (l) collagen chimeric genes were transfected into
chick emblYo tendon fibroblasts, a negative element was located between 820
and 1 093 ( 1 75). This element inhibited either the SV40 promoter or the a l (l)
collagen promoter when placed 5' of these promoters in either orientation.
When placed 3 ' of the a 1 (1) collagen promoter, the element was neutral in the
normal orientation but had a strong negative effect in the opposite orientation
( 1 76) . This strong negative effect when the element was located in the
opposite orientation 3 ' to the promoter appeared to require a sequence be
tween -625 and - 1 6 1 in the a 1 (I) collagen promoter ( 1 77) . Deletion of this
promoter sequence abolished the strong negative effect, but was apparently
without effect when the intronic element was placed in the correct orientation.
It is unclear why the negative effect of the 820-1093 intron segment was only
observed in the opposite orientation when placed 3 ' to the promoter.
Additional evidence has also been presented for the existence of positive
e lements in the first intron of the a 1 (I) gene on each side of the "negative"
element. One element ( + 292 to + 67 1 ) stimulated the a l (l) promoter when
placed 3 ' of this promoter in the correct orientation but was inactive in the
opposite orientation ( 1 77). However, another segment, which partially over
laps with the previous segment, stimulated the promoter in both orientations
( 1 77). It should be noted that the latter activating elements in the al (l) intron
862 VUORIO & DE CROMBRUGGHE
produced only a modest increase in the activity of the promoter, much less
than what is normally observed with viral enhancers such as the SV40
enhancer.
At this stage , our understanding of the regulatory elements in the first
introns of the 0' 1 (1) and 0'2(1) collagen gene remains very incomplete . The
choice of appropriate tissue culture cells used in DNA transfection is probably
of considerable importance and the overall interpretation of some experiments
using heterologous expression systems may be difficult. Furthermore, studies
with deletions of large DNA segments, although useful in an initial dissection
of potential regulatory elements, are often insufficient and need to be com
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
One factor that was shown to bind to a CCAAT motif in the mouse a2(I) gene
(-84 to -80) and in the mouse a 1 (I) gene (-96 to - 1 00) is an interesting
heterodimer, which belongs to a group of CCAAT-binding proteins ( 1 84 ,
1 85). Both subunits o f this protein have been purified t o homogeneity; one
has a Mr of 40,000-42,000, the other of 34,000 (S . Maity, T. Vuorio, and B .
de Crombmgghe, unpublished results) . Both components are needed for DNA
binding.
Furthermore , a highly purified preparation of this factor stimulated accurate
initiation of transcription of both the 0'2(1) and 0' 1 (1) collagen genes in a
reconstituted in vitro transcription system (1 85). The specificity of this trans
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
criptional stimulation was shown by the fact that a a2(1) collagen DNA
template containing a mutation in the binding site at -84, which essentially
abolished the binding of the factor, failed to be transcriptionally stimulated. In
by Boston University on 05/09/13. For personal use only.
DNA transfection experiments, the same mutation also strongly inhibited the
activity of both the 0'2(1) and 0' I (I) promoters ( 1 82; G. Karsenty, B . de
Crombrugghe, unpublished results). The similarity in the effects of mutations
in three different assays, i . e . DNA binding, in vitro transcription, and expres
sion in fibroblasts after DNA transfection , argue for a probable physiological
role of thi s factor in the control of the 0'2(1) and 0' 1 (1) collagen genes. In
chromatin , the segment around the binding site for the CCAAT-binding
protein in the 0'2(1) gene was also highly sensitive to DNAse I and to
restriction enzymes in cells expressing the 0'2(1) gene ( 1 86, 1 87). A similar
DNAse I hypersensitive site was found in the 0' 1 (1) promoter at approximately
the same llocation ( 1 88 , 1 89).
Recent experiments have indicated that two additional factors bind to
sequences immediately 5 ' to the binding site for the CCAAT-binding protein
in both promoters (G. Karsenty, B . de Crombrugghe, unpublished results) .
Based on the effects that mutations in the binding sites for these factors
displayed in DNA transfection experiments, the factors appeared to be nega
tive regulatory factors . These two negative factors and the positive CCAAT
binding faetor presumably participate in the coordinate control of the two type
I collagen genes. Hence, at least three different factors appear to interact at
approximately the same location of the promoters in these two coordinately
controlled genes. A number of additional binding sites for nuclear proteins
were shown to be present further upstream in these two genes ( 1 82 , 1 8 3 ; R.
Ravazzolo, G . Karsenty, B . de Crombrugghe, unpublished results) . A com
prehensive understanding of the control of these genes will require a systemat
ic study of these various factors and their binding sites. More work needs also
to be performed to understand the mechanisms that determine the tissue
specific expression of these genes.
One plausible hypothesis to account for the multiplicity of transcription
factors that bind to the promoter and enhancer segments of these genes is that
864 VUORIO & DE CROMBRUGGHE
fluence the expression of these genes ( 1 90- 1 94). One example of such
relationships between the cytokine TGF-f3 and a factor that binds to the -300
sequence in the mouse a2(1) collagen gene is discussed next. Indeed, a factor
identified as CTF/NF- I was shown to bind around -300 in the mouse a2(1)
collagen gene ( 1 95) . Although the binding site contained a CCAA motif, the
factor that was binding to the -300 site in the mouse a2(1) promoter did not
bind to the -80 CCAAT motif. The -300 site mediates the activation of a a2(1)
collagen promoter-CAT chimeric gene , which is produced by TGF-f3 treat
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
ment of NIH-3T3 fibroblasts and rat osteosarcoma cells transfected with this
chimeric gene ( 1 96). A 3-bp substitution mutation in the binding site, which
abolished the binding of NF- l , also prevented the induction of the promoter
by Boston University on 05/09/13. For personal use only.
about 20-fold (20 1 ) . The segment, which includes approximately 550 bp,
activates the 0' 1 (11) promoter in chondrocytes but not in fibroblasts and
myoblasts ..
CONCLUSIONS
complete than information about the regulatory DNA elements that control the
expression of these genes.
Some general conclusions emerge from a comparison of the structural
properties of different collagen genes.
exons contain a complete codon for Gly at their 5 ' end and a complete codon
for Y at their 3 ' end. No exons with split codons are found. The pattern of
succession of triple-helical exon sizes shows practically no variations between
by Boston University on 05/09/13. For personal use only.
changes in structure that must have accompanied during evolution the occur
rence of split codons, also tolerated much more frequent variations in exon
sizes than other collagens.
8. The structures of many collagen genes still display the marks of the
exon assembly patterns that led to the successful acquisition of unique
molecular and supramolecular structures for the products of their genes. For
the collagens, in which tight triple-helices assemble in highly organized
supramolecular fibrils, the corresponding genes display the clearest marks of
these assembly pathways as they show little variations in exon structure. In
contrast, for the collagens that form different supramolecular structures and
use less lateral aggregation between collagen molecules, the corresponding
genes presumably underwent many more changes in exon structure , erasing at
the same time to various degrees the traces of these evolutionary assembly
pathways.
9. Our understanding of the regulatory elements of the various collagen
genes is still at an early stage . Although only a small number of cis-acting
elements and their cognate trans-acting factors have been identified and
characterized, it is clear that a multiplicity of control elements exist in the
genes that have been examincd . It is likely that this mUltiplicity corresponds
to a diversity of regulatory mechanisms which can influence the expression of
these genes. The important tasks in this area now are to identify the cellular
mechanisms that determine the tissue-specific expression of the different
collagen genes, the mechanisms whereby these genes respond to various
cytokines and hormones that influence their expression, and the mechanisms
that ensure the coordinate expression of two or more genes that specify the
different polypeptide chains of a single collagen molecule . In addition a
systematic characterization of the different trans-acting factors that participate
in the control of individual genes will be needed to achieve a comprehensive
understanding of the regulation of these genes.
868 VUORIO & DE CROMBRUGGHE
ACKNOWLEDGMENTS
We thank Martha Trinkle for editorial assistance. We are grateful to the many
colleagues who generously provided manuscripts prior to publication and
apologize that only a fraction of the valuable information could be cited.
Work in the authors' laboratory was supported by National Institutes of
Health grants RO l -HL4 1 264-02 and RO I -CA495 1 5-0 1 (BdC) and by the
Finnish Academy .
Literature Cited
Annu. Rev. Biochem. 1990.59:837-872. Downloaded from www.annualreviews.org
42. Ninomiya, Y . , van der Rest, M . , May Proc. Natl. Acad. Sci. USA 84:940-
ne, R . , Lozano, G . , Olsen, B. R. 1985. 44
Biochemistry 24:4223-29 62. Pihlaj aniemi , T . , Tamminen, M . , Sand
43. van de r Res!, M . , M ayn e , R . , Nino berg , M . , Hirvonen, H . , Vuorio, E.
miya, Y . , Seid ah, N. G . , Chretien, M . , 1 990. Ann. NY Acad. Sci. 580:440-43
by Boston University on 05/09/13. For personal use only.
al. 1987. Proc. Natl. Acad. Sci. USA Bioi. Chern. 261 :4337-45
84:2803-7 191. Roberts, A. B . , Sporn, M . B . , Assoian,
1 69. Rossi, P . , de Crombrugghe, B .
1987. R. K . , Smith, J. M . , Roche, L. M . , et
Proc. Natl. Acad. Sci. USA 84:5590- al. 1986. Proc. Natl. Acad. Sci. USA
94 83:4 1 67-7 1
J., Jimenez,
by Boston University on 05/09/13. For personal use only.