A Regulatory SNP Causes a Human Genetic Disease by Creating a New Transcriptional Promoter
Marco De Gobbi,1* Vip Viprakasit,2* Jim R. Hughes,1 Chris Fisher,1 Veronica J. Buckle,1 Helena Ayyub,1 Richard J. Gibbons,1 Douglas Vernimmen,1 Yuko Yoshinaga,3 Pieter de Jong,3 Jan-Fang Cheng,4 Edward M. Rubin,4 William G. Wood,1 Don Bowden,5 Douglas R. Higgs1‡ We describe a pathogenetic mechanism underlying a variant form of the inherited blood disorder a thalassemia. Association studies of affected individuals from Melanesia localized the disease trait to the telomeric region of human chromosome 16, which includes the a-globin gene cluster, but no molecular defects were detected by conventional approaches. After resequencing and using a combination of chromatin immunoprecipitation and expression analysis on a tiled oligonucleotide array, we identified a gain-of-function regulatory single-nucleotide polymorphism (rSNP) in a nongenic region between the a-globin genes and their upstream regulatory elements. The rSNP creates a new promoterlike element that interferes with normal activation of all downstream a-like globin genes. Thus, our work illustrates a strategy for distinguishing between neutral and functionally important rSNPs, and it also identifies a pathogenetic mechanism that could potentially underlie other genetic diseases. he human a-globin cluster, located at the telomeric region of chromosome 16 (16p13.3), includes an embryonic gene (z), two minor a-like genes EaD (also called m) and q^, two a genes (a2 and a1), and two pseudogenes (ya1 and yz) (1, 2). Upstream of these genes are four highly conserved cis elements (MCS-R1 to MCS-R4) of which MCS-R2 (also known as HS-40) plays the major role in regulating expression of the cluster (2, 3) (Fig. 1). Previous analyses of mutations that down-regulate globin gene expression and cause thalassemia have elucidated many of the general mechanisms underlying human molecular disease (4). Down-regulation of one or two of the four a-globin genes (aa/aa) causes anemia with mild red blood cell changes; so-called a thalassemia trait. However, when a-globin gene expression is reduced to less than È50% of normal, excess b-globin chains form tetramers (b4, called HbH), which precipitate in the red blood cell, causing a more severe form of anemia called HbH disease (5). In nearly all cases of a thalassemia, the molecular basis for their reduced levels of a-globin expression can be readily identified (4, 5). We have studied 148 individuals from Melanesia with a thalassemia, including 5 with HbH disease, in whom none of the


1 Medical Research Council Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Oxford, OX3 9DS, UK. 2Department of Pediatrics, Siriraj Hospital, Mahidol University, Bangkok, Thailand. 3BACPAC Resources, Oakland Research Institute Children’s Hospital, Oakland, CA, USA. 4Genome Science, Genomic Division, Lawrence Berkeley National Laboratory, CA, USA. 5Department of Anatomy and Cell Biology, Monash University, Melbourne, Australia.

*These authors contributed equally to this work. ‡To whom correspondence should be addressed. E-mail:

previously described molecular defects could be found. The pattern of inheritance suggested that individuals with HbH disease are homozygotes for a codominant defect, referred to here as (aa)T, causing a thalassemia with a predicted genotype of (aa)T/(aa)T (table S1). To determine which process in gene expression had been affected, we analyzed a Melanesian individual (patient L, table S1) with a well-defined phenotype of HbH disease. In situ RNA hybridization to detect primary transcripts in erythroid cells from patient L detected substantially fewer nuclear transcripts from the a-globin genes than from the b-globin genes (Fig. 2), which is consistent with a mutation reducing a-globin RNA transcription. DNA fluorescence in situ hybridization studies in two affected individuals showed that the a-globin cluster was present at its normal location at the tip of chromosome 16. Extensive analysis of the a-globin cluster and the surrounding 300 kb revealed no evidence for any deletions or chromosomal rearrangements in the patients with a thalassemia. Where tested, the pattern of DNA methylation appeared normal. Sequence analysis of the major (a2 and a1) and minor (aD and q) a-like genes and their regulatory elements revealed only the wild-type sequences or known neutral single-nucleotide polymorphisms (SNPs). Having excluded all currently known a thalassemia mutations, we reasoned that the Melanesian form was either due to a cisacting mutation in a previously unrecognized regulatory element or resulted from a gain-offunction mutation that negatively regulates a-globin expression. Alternatively, it was possible that a thalassemia in these individuals was due to a trans-acting mutation. By analyzing SCIENCE VOL 312

linkage to a variable number of tandem repeats (VNTR) (6) located È8.5 kb from the a-globin genes (Fig. 1), we found that all individuals with the (aa)T mutation shared a common VNTR allele (fig. S1), demonstrating that this is a cis-linked defect. Further association studies, using known SNPs, showed that the (aa)T haplotype extends from the 16p telomere, with loss of association immediately downstream of the a-globin cluster (coordinate 168,467 in Fig. 1) defining the centromeric border of the region containing the cis-acting mutation. We estimated that the frequency of the (aa)T defect in the island population is È0.04 (fig. S1). We therefore resequenced the (aa)T haplotype by isolating bacterial artificial chromosomes (BACs) from a library constructed from the peripheral blood DNA of patient L with the Melanesian type of HbH disease E(aa)T/(aa)T^. BACs spanning the a-globin cluster and the surrounding È213 kb of DNA (coordinates 21,059 to 234,236) were sequenced (DQ431198), and we identified 283 SNPs and/ or sequence differences (Fig. 1) by comparison with the current wild-type sequence (National Center for Biotechnology Information database build 35, coordinates 1 to 223478), consistent with estimates of the frequency of SNPs throughout the genome (7). This now presented a situation analogous to a common, largely unsolved problem in human genetics: how to identify a functionally important single nucleotide change from all other SNPs within a relatively large (È213 kb) genomic interval (8, 9). To search for functional changes associated with these SNPs, we constructed a tiled array representing all regions of nonrepetitive DNA throughout the terminal 223.5 kb of chromosome 16. RNA expression profiles obtained with the use of complementary DNA from normal (aa/aa) or mutant E(aa)T/(aa)T^ erythroblasts were compared. Two prominent differences were observed in the mutant erythroblasts (Fig. 1). First a major new peak of RNA transcription (beyond the quantitative range of the array) from the same DNA strand as a-globin (fig. S2) was observed between coordinates 149,682 and 153,390 (Fig. 1, A and B). Quantitative reverse transcription polymerase chain reaction (RT-PCR) showed that expression from this region was 91000 fold higher in the mutant than in the wildtype chromosome (Fig. 1C). Second, by RTPCR we observed an È80-fold decrease in expression of the aD gene immediately downstream of this peak (Fig. 1, A and B). The decreased level of a2 and a1 gene expression detected by quantitative RT-PCR (table S2) was not detected on the array, again because globin expression lies beyond the quantitative range. No other substantial differences in the pattern of RNA expression were seen across the 223.5-kb region (Fig. 1, A and B).

26 MAY 2006


Fig. 1. Overview of the 20k 40k 60k 80k a-globin cluster and 16p13.3 4 6 identification of a rSNP. 3.1 5 3 1 2 The genes located in the telomeric region of chroDeletion mosome 16 are numDHSs eDHSs bered as in (1), and the MCS-P/R globin genes are laChIP Probes beled. The VNTR (3¶ Repeats hypervariable region) is shown as a red zigzag line. A deletion (15) All Diff.s removing the region New Diff.s containing the rSNP is shown as a black line. 100 A Below this, all DNAse1 hypersensitive sites (DHSs) and erythroid-specific sites (eDHSs) are shown 0 (3, 10, 16). MCS-P/R sum- 100 B marizes all evolutionarily conserved promoter and regulatory sequences across this region 0 QPCR C 10000 (2). Probes used to pro100 1 file ChIP products are 0.01 shown in pink, and reD peats are shown in green. Below this, all sequence differences between the (aa)T and wild-type aa chromosome are shown. ‘‘New Diffs’’ refers to newly identified sequence differences that are not known to be polymorphic SNPs. SNPs analyzed in genetic linkage studies described in this paper are shown in purple. The rSNP described here is shown as a black diamond in ‘‘All Diffs’’ and ‘‘New Diffs.’’ A dashed vertical line runs from these diamonds through the array data. Below, the patterns of gene expression recorded on a custom-tiled Affymetrix array spanning this telomeric region in primary erythroid cells from (A) a normal individual (aa/aa) and (B) patient L with the (aa)T/(aa)T genotype are shown. The peak of z-globin expression in the (aa)T chromosome results from cross-hybridization to the highly expressed
7 ζ ψζ αD α2α1 θ








R2 R3


αα (αα) T

abnormal transcripts across the homologous yz gene. (C) Estimates of the differences in RNA expression between normal and abnormal chromosomes, based on independent quantitative PCR (QPCR), are shown below (on a logarithmic scale). (D) Representation of how one or more of the conserved regulatory elements (contained within the region spanned by the horizontal black bar) normally interact with the a-globin promoters [aa] and how they are proposed to interact less effectively (dashed lines) in the abnormal (aa)T chromosome. The direction of transcription of the globin genes and the new promoter, created by the C allele of SNP 195, are indicated by the arrows. The region underlying this new peak of expression is unremarkable, containing 3.7 kb of poorly conserved, predominantly noncoding sequence, although the tail of the peak extends into the yz-globin gene. This region contains 17 SNPs, 10 of which have been previously characterized in nonthalassemic individuals. We therefore analyzed the segregation of the remaining seven SNPs and, as controls, six additional SNPs from nonrepetitive regions of the a cluster (Fig. 1), within affected families. In addition, we performed genetic linkage studies in 15 nonthalassemic Melanesian individuals (aa/aa), 22 with a thalassemia trait Eaa/(aa)T^, and 5 with HbH disease E(aa)T(aa)T^. Six of the seven SNPs underlying the new peak of transcription were found on both the normal aa and abnormal (aa)T chromosomes. Only the C allele of SNP 195 (C or T, located at coordinate 149709) segregated with thalassemia in the affected families and showed complete association with the (aa)T haplotype (table S2). This allele was not found in a separate analysis of 131 nonthalassemic, Melanesian individuals. SNP 195 changes the sequence

Fig. 2. In situ RNA analysis demonstrating reduced primary a-globin transcripts in patient L. Nascent a-globin (red) and b-globin (green) transcripts in intermediate erythroblasts from a normal control and from patient L [with the (aa)T/(aa)T genotype] are shown. (Left) Representative nuclei show b-globin transcripts in both patient and control, but a-globin transcripts are present only in the normal control. (Right) The proportion of nuclei containing none, one, or two signals were recorded from the analysis of 100 cells.


26 MAY 2006

VOL 312


Fig. 3. Chromatin immunoprecipitation demonstrating the acquisition of a new transcription factor binding site (arrowed). The new binding site is located at coordinate 149709. Names of transcription factors and chromatin modifications are shown at left. Chromatin immunoprecipitation was performed as previously described (10) using primers and antibodies described in the supporting online material (17). The degree of enrichment in a normal individual (black columns) and in an individual with the (aa)T/(aa)T genotype (white columns) is shown on the y axis, and coordinates of the regions sampled by QPCR are shown on the x axis. Asterisks indicate where insufficient primary cells were available for analysis. ments, out-competing the endogenous aglobin promoters (12–14). SNP 195 creates a new promoterlike element between the upstream regulatory elements and their cognate promoters. This element, when activated, causes significant down-regulation of the aD, a2, and a1 genes that lie downstream (Fig. 1D), thereby causing a thalassemia. These findings not only demonstrate an additional mechanism causing human genetic disease but also illustrate two important points when searching for SNPs that may influence gene expression (9). First, to distinguish functional from nonfunctional SNPs, it has been suggested that searches should be concentrated in areas of the genome likely to contain cisregulatory elements (8) (such as multispecies conserved elements). The gain-of-function regulatory SNP (rSNP) identified here, located in a region of the a-globin cluster that we know may be deleted with no discernible effect on aglobin expression (Fig. 1) (15), demonstrates that SNPs in such areas should not be dismissed as of no potential importance. Second, the use of densely tiled arrays for analysis of transcription and ChIP profiles provides a rapid and efficient in vivo strategy to distinguish nonfunctional from functional rSNPs that may underlie the altered patterns of expression responsible for a wide range of human genetic diseases.
References and Notes
1. J. Flint et al., Nat. Genet. 15, 252 (1997). 2. J. R. Hughes et al., Proc. Natl. Acad. Sci. U.S.A. 102, 9830 (2005). 3. D. R. Higgs et al., Genes Dev. 4, 1588 (1990). 4. M. H. Steinberg et al., Eds., Disorders of Hemoglobin (Cambridge Univ. Press, Cambridge, 2001). 5. D. R. Higgs et al., Blood 73, 1081 (1989). 6. A. P. Jarman, R. D. Nicholls, D. J. Weatherall, J. B. Clegg, D. R. Higgs, EMBO J. 5, 1857 (1986). 7. R. Sachidanandam et al., Nature 409, 928 (2001). 8. J. C. Knight, J. Mol. Med. 83, 97 (2005). 9. T. Pastinen, T. J. Hudson, Science 306, 647 (2004). 10. E. Anguita et al., EMBO J. 23, 2841 (2004). 11. I. A. Wadman et al., EMBO J. 16, 3145 (1997). 12. C. Esperet et al., J. Biol. Chem. 275, 25831 (2000). 13. A. Leder, C. Daugherty, B. Whitney, P. Leder, Blood 90, 1275 (1997). 14. E. Anguita et al., Blood 100, 3450 (2002). 15. P. Winichagoon et al., Nucleic Acids Res. 10, 5853 (1982). 16. P. Vyas et al., Cell 69, 781 (1992). 17. Materials and methods are available as supporting material on Science Online. 18. M.D.G. is a Ph.D. student in Pharmacology and Experimental and Clinical Therapy at the University of Turin, Italy. We thank J. Sloane Stanley, J. Sharpe, J. Green, J. Brown, N. Ventress, K. Clark, and the Oxford Computational Biology Research Group for technical support. V.V. is supported by the Thailand Research Fund.
HS -40 HS -33




α D α2 α1 θ LUC7L

1 0.8


0.6 0.4 0.2 0 2 1.6

** * *


1.2 0.8 0.4 0 3

** * *




0 10 8


Pol II

6 4 2 0

103142 149779 150916 152787 162908 163130 110025 147771 163558 165877



5¶-TAATAA-3¶ (T allele) to 5¶-TGATAA-3¶ (C allele), potentially creating a new binding site for the key erythroid transcription factor GATA-1. Conventional in vitro electromobility gel shift assays and supershifts, using an antibody to GATA-1, demonstrated that this SNP creates a potential GATA-1 binding site (fig. S3). A chromatin immunoprecipitation (ChIP) profile using quantitative real-time PCR across the aglobin cluster (coordinates 53195 to 185030) showed that in addition to binding the known regulatory elements, GATA-1 also binds at the C allele of SNP 195 in vivo (Fig. 3). The C allele also nucleates the binding of a pentameric erythroid complex including the transcription factors SCL, E2A, LMO2, and Ldb-1 (Fig. 3), which are frequently found with GATA-1 at erythroid regulatory elements (10, 11). ChIP profiles using antibodies that recognize modified histones EH4Ac, H3Ac, and H3K4me2 (Fig. 3 and fig. S4)^ demonstrated that binding

of GATA-1 at the C allele is associated with a new peak of active chromatin in the a-globin cluster. Finally, we showed that the C allele, unlike the T allele, binds RNA polymerase II (Fig. 3). Expression of the a-globin genes normally occurs late in erythropoiesis after what appears to be a well-defined order of transcription factor binding to the upstream regulatory elements (MCS1 to MCS4), followed by recruitment of the pre-initiation complex and RNA polymerase II. These events are thought to result in the formation of a DNA/protein complex including one or more of the regulatory elements and the a-globin promoter(s) (10). We and others have shown that the insertion of active heterologous promoters (such as PGK Neo) in some regions of the a-globin cluster can disrupt a-globin expression, probably as a result of preferential interaction of the heterologous promoter with the upstream eleSCIENCE VOL 312

Supporting Online Material Materials and Methods Figs. S1 to S4 Tables S1 and S2 References 21 February 2006; accepted 25 April 2006 10.1126/science.1126431

26 MAY 2006