You are on page 1of 120

METHODS FOR QUANTIATIVE DISSECTION

OF GENE REGULATION

A DISSERTATION

SUBMITTED TO THE GENETICS DEPARTMENT

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Jason Buenrostro

December 2015

© 2016 by Jason Daniel Buenrostro. All Rights Reserved.
Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-
Noncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/mn616fx6627

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

William Greenleaf, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Howard Chang, Co-Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Gerald Crabtree

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jin Li

I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Michael Snyder, PhD

Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.

iii

Acknowledgements

I have had an incredible experience at Stanford University, throughout my time

here I have had the immense privilege of working with many talented and amazing

people. That list begins with Will Greenleaf, my primary mentor and friend for the last 5

years in graduate school. Will has consistently provided thoughtful advice in science and

in life, and an incredible amount of patience throughout my graduate experience.

Throughout this time I have also had the privilege of mentorship from Howard Chang, I

thank Howard for his creative and thoughtful ideas, but more importantly his unwavering

support of all my scientific endeavors.

I thank Stanford and the Stanford Genetics department for the opportunity to do

my research here and providing such a wonderful environment to start my academic

career. I also thank my faculty committee Mike Snyder, Lars Steinmetz, Jerry Crabtree,

Billy Li and Ravi Majeti for their service, insights and support. In addition, I thank Mike

Snyder for the opportunity to collaborate and for the opportunity to work with Beijing

Wu, experiences that have significantly enriched my graduate experience.

I thank past mentors Hanlee Ji and Georges Natsoulis for mentorship and amazing

start to my scientific life. Specifically, I thank Hanlee Ji for my start in science, providing

me an amazing opportunity to join the Stanford community as a research assistant in

2009. I also thank Samuel Myllykangas, for his mentorship and friendship throughout

this time.

In addition to the long list of mentors, I thank many close collaborators for

making this work possible. I thank Lauren Chircus and Carlos Araya for their

determination, hard work and creativity, an essential component for the development of

iv

the RNA array. I thank Paul Gerisi for teaching me everything I know about chromatin.

Ulli Litzenberger, Dave Ruff and the Fluidigm team for their wonderful insight and

dedication to the development of single-cell ATAC-seq. I also thank new friends, Ryan

Corces and Ansu Satpathy, who have brought new and fresh perspectives to my scientific

thinking. I also deeply thank Beijing Wu, I am thankful for her unwavering dedication to

our work, personal support, and general thoughtfulness, without her much of this work

presented here would not have been possible.

Importantly, I’d like to thank my family who has provided life-long support.

Specifically, my parents Miguel and Martha Buenrostro, who have made tremendous

sacrifices throughout our lives to provide me with the opportunity and preparation to

pursue my dreams. I also thank my brother, sisters, and niece, Michael, Michelle, Erika

and Sam, for their love, support and patience. I also thank my roommate and partner Sara

Prescott, who has been there for me at my best and my worst, and continues to be my

closest ally in science and in life. Lastly, I thank all of my friends, family, past mentors

and other collaborators, whom I regret for not having enough space to mention here.

v

TABLE OF CONTENTS

CHAPTER ONE - Introduction .......................................................................................... 1
Cellular regulation in cis and trans .................................................................................. 1
Genome-wide methods .................................................................................................... 2
A quantitative and high-throughput approach to binding ............................................... 2
Measuring chromatin accessibility in rare cells .............................................................. 3
CHAPTER TWO – Quantitative dissection of millions of sequence variants ................... 5
Introduction ..................................................................................................................... 5
A high-throughput RNA array platform for quantitative binding measurements ........... 7
The RNA-array enables quantitative measurement of both binding and dissociation .... 8
Binding affinity can be partitioned between primary and secondary structure............... 9
Changes in association rate substantially contribute to changes in binding energies ... 11
Discussion ..................................................................................................................... 12
Chapter 2 - Figures and Figure Legends ....................................................................... 15
References ..................................................................................................................... 21
CHAPTER THREE – Measuring accessibility in rare cellular populations..................... 24
Introduction ................................................................................................................... 24
ATAC-seq measures chromatin accessibility using Tn5 transposase ........................... 25
Insert size yields information regarding nucleosome packing and positioning ............ 26
ATAC-seq reveals distinct classes of factor-nucleosome spacing ................................ 28
Footprints can be used to infer factor occupancy genome-wide ................................... 29
Discussion ..................................................................................................................... 30
Chapter 3 - Figures and Figure Legends ....................................................................... 32
References ..................................................................................................................... 37
CHAPTER FOUR – Single-cell accessibility reveals principles of regulatory variation 39
Introduction ................................................................................................................... 39
Single-cell ATAC-seq a measure of chromatin accessibility genome-wide ................. 39
Cell-cell variability in trans ........................................................................................... 41
Trans-factors synergize to induce cell-cell variability .................................................. 42
Cell-state and chemical perturbation effects on cell-cell variability ............................. 43
Single-cells vary in cis .................................................................................................. 45
Discussion ..................................................................................................................... 46
Chapter 4 - Figures and Figure Legends ....................................................................... 48
References ..................................................................................................................... 60
CHAPTER FIVE – The epigenomic determinants of human hematopoiesis ................... 62
Introduction ................................................................................................................... 62
Identification of chromatin accessibility landscape in primary blood cells .................. 64
Chromatin accessibility at distal elements delineates the hematopoietic hierarchy...... 65
Enhancer cytometry prospectively deconvolves complex cell populations .................. 67
Regulatory networks of normal hematopoiesis ............................................................. 68

vi

............................................................................................................................................................................ 88 Leukemogenesis and cancer evolution in AML . 95 Discussion ................................................................ 90 AML cell types exhibit lineage infidelity with regulatory contributions from multiple normal blood cell types ....................... 100 References ........ 94 Mechanism and clinical consequences of pre-leukemic HSC clonal advantage ........ 75 References .... 88 Introduction ..... 112 Methods for gene regulation ....... 112 vii ........ 110 CHAPTER SEVEN – Conclusion ............................................................................Figures and Figure Legends .......................................................................... 92 Generation of synthetic normal analogs for assessment of AML-specific biology ....................................................................................................... 112 Future work .................................................................................................................................................. 70 Discussion ....................................................................................... Accessibility profiles of purified cell populations identify the ontogeny of human diseases ................................................ 89 AML represents a cooption of normal myelopoiesis .......................................................................................................................... 72 Chapter 5 ........................................................................................................................................................................... 86 CHAPTER SIX – The regulatory landscape of acute myeloid leukemia ...................................................... 98 Chapter 6 ..................................................................Figures and Figure Legends ...................................................................................................................................

acting as: i) activators.Introduction Cellular regulation in cis and trans The human body is comprised of a large collection of highly diverse cell types. Here. or protein modifications. whereas nucleosome-free chromatin demarcates regions of active regulation in cells. ii) repressors or iii) insulators. wherein RNA structure defines the binding landscape of micro RNAs (miRNAs) and RNA binding protein (RBPs). Distal nucleosome-free regulatory elements can have highly divergent interactions with gene promoters in cis. Highly compacted chromatin is sequestered from regulatory machinery. RNA localization and degradation. The expression of transcription factors (TFs) and chromatin remodelers drive chromatin accessibility. The establishment and maintenance of a cell’s identity is largely determined by defined regulatory programs effecting diverse cellular processes such as chromatin accessibility. drivers of cellular function and cellular potential. which spans a continuum from nucleosome-free and nucleosome-associated. which have diverse effects on post-transcriptional processes. which together determine the expression of nearby genes. 1 . which define permissive or occluded binding substrates for trans-acting regulators. Analogous principles hold true for RNA regulation. eukaryotic RNA’s can fold into simple 2D or complex 3D folded structures. A quantitative and genome- wide understanding of these dynamic cellular structures would provide unique insight into the binding determinants of trans-acting regulators. to higher-order chromatin compaction. or enhancers.CHAPTER ONE . each providing a specialized and context-specific function.

Here. high- throughput assays measuring chromatin bound proteins (ChIP-seq)2 or RNA bounds proteins (RIP-seq and CLIP-seq)3.Genome-wide methods The advent of high throughput sequencing1 methodologies has enabled unbiased and genome-wide characterization of these diverse cellular processes. we repurpose a high-throughput sequencing instrument to serve as massively parallel biochemistry platform. 2 . assays for 4. I will describe the development of a high-throughput and generalizable platform for performing biochemical assays of RNA called RNA- MaP5. A quantitative and high-throughput approach to binding Carefully controlled in vitro assays can be used to determine the biophysical parameters defining a binding interaction. current methods are low throughput or not quantitative. However. these methods are limited in several ways.5 measuring chromatin accessibility (DNase-seq) or RNA structure (PARS)6 enable a genome-wide analysis of the structural determinants of this binding landscape. In this thesis. In the following thesis I will discuss the development of new methods. For example. which focus on the structural determinants of trans-factor binding to chromatin or RNA. as described in the following sections. We use this platform to describe the kinetic parameters of an RNA binding protein to >107 mutants of an RNA stem loop. However. In addition. have been shown to be sensitive methods for identifying the binding locations of trans-acting proteins.

Together. 1a) within rare cellular populations6 or within single-cells7. Importantly. 3 . Of particular importance to human health and disease. In response. wherein a single hematopoietic stem cell (HSC) can give rise to a multitude of distinct cellular populations ranging from enucleated red blood cells (RBCs) to specialized immune cells (CD4 and CD8 T cells. derived from tissues. B cells and more). is the hematopoietic hierarchy (Fig. dysregulation of these intricate regulatory networks lead to a multitude of hematologic malignancies11. these assays offer an unprecedented view of chromatin structure in vivo. averages over the rich diversity of chromatin structures within different cell states leading to an incorrect understanding of regulatory processes within these tissues. and an excellent model for understanding dynamic cellular behavior. However. in contrast single-cell ATAC-seq (scATAC-seq) may be used to partition cells into relevant subtypes de novo. Applying these methods to complex cellular populations.Measuring chromatin accessibility in rare cells Current methods to profile chromatin accessibility require millions of cells. this approach is also limited in that it requires established protocols for cell-type isolation. With such methods defined cellular populations within complex tissues can be isolated using flow cytometry and profiled using ATAC- seq. we have developed genome-wide methods for measuring chromatin accessibility (ATAC-seq)(Fig. limiting their application to either cell lines or whole tissues. 1b). In this work we also apply ATAC-seq and scATAC-seq to normal human hematopoiesis and acute myeloid leukemia (AML) in effort to elucidate governing biochemical principles defining normal human development and disease.

S. & Chang.. Nat Meth 10. et al. Nat.. et al. Nature 456. Spitale. Thurman. R. 1497–1502 (2007). J. A. C. Cell 132.. Y. Zhao. Buenrostro. Buenrostro. Quantitative analysis of RNA-protein interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Kertesz. Johnson. H. M. L. DNA-binding proteins and nucleosome position. J. Nature 489. Chang. 6. J. 2. Zaba. Genome-wide mapping of in vivo protein-DNA interactions. Accurate whole human genome sequencing using reversible terminator chemistry. 53–59 (2008).. P. J.. D. Myers. 939–953 (2010). D.. Science 316. Nature Reviews Genetics 12. D. E. 641–655 (2011). et al. Biotechnol. et al. 1213–1218 (2013). Y. W. 4 . B. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin. & Greenleaf. Mortazavi. G. R. E. Giresi. Segal. Nature 523. 3. P. H. Molecular Cell 40. D. Understanding the transcriptome through RNA structure. The accessible chromatin landscape of the human genome. High-Resolution Mapping and Characterization of Open Chromatin across the Genome. 75–82 (2012). 562–568 (2014). et al. R. M. 7. C. Y.References 1. Single-cell chromatin accessibility reveals principles of regulatory variation. 8. Boyle. Wan. A. 311–322 (2008). D.. 32. 5. Genome-wide Identification of Polycomb-Associated RNAs by RIP- seq. 9.. R. 486–490 (2015). & Wold. et al. Buenrostro. Bentley. 4. J.

The combinatorial nature of RNA sequence and intramolecular interactions. doi:10.11. predictive understanding of both the sequence dependence of affinity and the resulting evolutionary constraints imposed by these requirements. g-quartets.2880. Quantitative analysis of RNA-protein interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Furthermore. 1 Portions of this chapter were taken from Buenrostro et al.CHAPTER TWO – Quantitative dissection of millions of sequence variants1 Introduction RNA-protein interactions drive a wide variety of critical biological processes from gene expression1 to viral assembly2. and recent work has begun to uncover a web of RNA-protein interactions4-6 that can control gene expression through splicing. Nature Biotechnology. 2014. RNA localization. Protein interactions with long noncoding RNAs also play a role in epigenetic state changes during differentiation7. allowing gene expression control through post-transcriptional regulation10. coupled with the relative paucity of data produced from current biophysical methods has precluded a high-resolution. and non-canonical base pairs—that determine three- dimensional RNA structure12-14 and set the landscape for interactions with RNA-binding proteins (RBPs)15. Up to 10% of the eukaryotic proteome is estimated to bind RNA3. pseudo knots. Because the relationship between sequence and binding is often opaque. perhaps through “scaffolding” chromatin remodelers8.1038/nbt. divalent cation interactions. A biophysical understanding of the nucleic-acid sequence determinants of RNA- protein interactions lags behind our growing realization of their biological importance. mismatched base bulges. RNA-protein interactions have proven powerful tools in synthetic biology. and other post-transcriptional processes.9. 5 . stem loops. RNA substrates demonstrate diverse intramolecular interactions—including. Unlike double-stranded DNA.

to create a platform for direct. we have leveraged the Illumina DNA sequencing platform.22.little is understood regarding the evolutionary constraints on these RNA structures. In addition. The technological innovations that have propelled the high-throughput sequencing revolution provide the foundations for massively parallel. we have developed quantitative image analysis tools for large-scale analysis of these data. a system with widespread 6 . We apply these methods to the MS2 coat protein2. Recently. fluorescence-based observations over a large variety of nucleic acid structures immobilized on a surface23-26. and high-throughput TIRF imaging for massively parallel DNA sequencing27. Current methods for investigating the sequence dependence of RNA-protein interactions include medium-throughput microfluidic methods17 and high-throughput methods coupling affinity-based selection with high-throughput DNA sequencing or array hybridization18 and recently have been used to generate a catalogue of RNA binding motifs19. and demonstrate measurement of both equilibrium binding constants and dissociation kinetics. making bioinformatic identification of functional RNAs difficult16. While powerful. an instrument that integrates solid-phase molecular biology. koff. and Kd for RNA-protein interactions. fluidics.28-31. methods have been developed to quantitatively measure catalysis21. Recent work characterizing DNA-protein interactions26 has demonstrated the utility of these instruments for high-throughput binding affinity assays across large DNA sequence space. selection and sequencing methods bias results towards high- activity variants and do not directly and quantitatively measure the biophysical parameters that underlie biological function20. however. In this work. no such high-throughput methods exists for determining binding parameters kon. ultra-high throughput measurement of RNA-protein interactions.

and a region coding for diverse sequence variants of the MS2 RNA hairpin synthesized using doped oligonucleotides (Fig. decomposition of binding energies between primary and secondary structures. A high-throughput RNA array platform for quantitative binding measurements To generate a library of RNA targets. 1a. and regenerated 7 .applications in affinity purification32. The barcoding strategy serves to identify individual molecules within a population by uniquely tagging each molecule using a barcode library. RNA imaging33 and synthetic biology10. Bottlenecking allowed for each barcoded molecular species to be sequenced a median of 15 times per sequencing lane. we first generated an Illumina sequencing library containing an E. providing massive biophysical datasets enabling predictive models for affinity tuning. The sequencing process converted individual molecules within the library to ~1 µm diameter clusters of ~1.000 clonal DNA molecules on the flow cell surface27. coli RNA polymerase (RNAP) initiation and stall sequence. and quantitative analysis of evolutionary trajectories across sequence space. To ensure multiple measurements of each RNA variant and reduce sequencing error34. Following sequencing.11.b). we removed the sequenced DNA strand. This approach enables quantitative measurement of binding and dissociation of a protein to >107 RNA targets generated directly on the flow cell surface. allowing for multiple redundant measurements across the flow cell. and provided the sequence and position of the DNA templates across the 2D array. we introduced single-molecule barcodes 5’ of the RNAP initiation sequence. We then “bottlenecked” the DNA variant population by diluting ~8x105 molecules into a subsequent PCR amplification reaction.

2 x 107 distinct RNA features comprising 1. we perfused 1. First. The RNA-array enables quantitative measurement of both binding and dissociation To measure binding energies.48 x 105 unique sequences in a single sequencing lane. S1 and S2). and imaged bound MS2 at equilibrium using total internal reflection fluorescence (TIRF) at 10 increasing concentrations. The high-concentration of unlabeled MS2 protein blocks other binding sites on the array. This procedure results in transcribed RNA tethered to its parent DNA via RNA polymerase (Fig. we flowed fluorescent SNAP-Surface 549-MS2 over the RNA array. 1c). coli RNA polymerase holoenzyme (RNAP) in CTP-starved conditions. After the final measurement. we washed excess RNA polymerase from solution and introduced all 4 nucleotides.240 images representing 120 tiles imaged in two 8 .8 µM unlabeled MS2 and recorded the fluorescence decay caused by dissociation (Fig. which allows RNAP to generate 26 bases of RNA (the footprint of RNAP) before stalling at the first guanine on the DNA template strand.double stranded DNA (dsDNA) using DNA polymerase to extend a biotinylated primer. allowing RNAP to transcribe the variable region and stall at the biotin-streptavidin roadblock. The resulting RNA array contained 1. Using this approach. preventing re-binding of fluorescently labeled MS2. Second. We then saturated the flow cell with streptavidin to create a terminal biotin-streptavidin roadblock on these dsDNA fragments. To synthesize RNA we adapted methods from single molecule investigations35 designed to generate a single RNA per DNA template. 1a). we quantified the fluorescence signal for each cluster in 6. To quantify bound MS2 we developed image analysis tools that cross-correlate cluster centers from sequencing data to acquired images and fit the observed binding in each cluster to a 2D Gaussian (Fig. we initiated E.

1e. Fig. 2c). we calculated the –∆∆G of all double-mutants (Fig. we observe high mutation sensitivity at base-paired positions near the loop and at specific single-stranded positions. suggesting significant primary sequence and secondary structure requirements for RNA recognition. Binding affinity can be partitioned between primary and secondary structure To comprehensively examine these primary and secondary structure effects on binding.fluorescence color channels across 11 equilibrium MS2 concentrations and 15 dissociation time points. potentially due to cooperative effects on hairpin destabilization in these regions. 2e). We also observed negative epistasis in non-compensating mutants near the base of the stem.92.248 sequences. We observed high positive epistasis in a population of “compensating mutants”.08. encompassing 57 single (100%). We calculated off rates (koff) for 3.029 sequences and dissociation constants (Kds) for 129. 2d). we examined differential binding energies for all single-mutants compared to the consensus sequence (–∆∆Gconsensus=0 kBT). slope = 1. Specifically. f). The average binding energy change from all possible single-base changes at each position reveals a sensitivity to mutation throughout the hairpin that complements the effects of mutating individual residues on the binding surface of MS2 to alanine36 (Fig. b). 1. suggesting that these pairs of mutations preserve hairpin structure and maintain high binding affinities (Fig. Fluorescence signals from single clusters fit canonical dissociation (Fig.539 double (100%). 1d) and binding curves (Fig. yielding binding energy estimates in excellent agreement with published measurements (R = 0. slope = 0.76).181 triple (92. To investigate how sequence variation in the RNA hairpin impacts MS2 binding. 2a. 9 .4%) mutants (Fig. 1g) and in vitro binding assays (R = 0.94. and 24.

To quantify the sensitivity for non-canonical base-pairing at positions in the hairpin stem. identifying base-paired. G:U base pairs caused substantially 10 . This model provides two free parameters for each unpaired base accounting for primary sequence changes in the form of transitions or transversions. loop.) allowed de novo reconstruction of the bound hairpin structure.e.Reciprocal mapping of positive epistasis signatures (≥1 s.d. the model provides a total of 6 free parameters – one for transition and transversion of each base in the pair (4 parameters) as well as one parameter to account for disruption due to the loss of base-pairing and one parameter representing possible non-canonical base-pairing interactions. This re-fitting analysis allowed the model to incorporate a different energetic penalty for having non-canonical base pairs at a specific position instead of the energetic penalty for a full loss of base-pairing. we trained the model 8 separate times (once for each possible non-canonical pairing) with one free parameter representing the energetic cost of the respective non-canonical pairing. transitions or transversions that occur while holding secondary structure constant) and secondary structure changes (i. inferred energetic consequences of secondary structure disruptions or formation of non-canonical bases in isolation from primary sequence perturbations). We modeled the contributions of base-specificity (primary structure) and base- pairing (secondary structure) to binding energy at each position in the hairpin with a linear regression model from a set of 121 training sequences. in order to identify (via regression) the energetic contributions of primary sequence changes (i.e. and bulge positions demonstrating the feasibility of reconstructing molecular RNA structures from large-scale sequence-function data. In this analysis. For each pair of interacting bases. These parameters were optimized jointly.

and inferred the contribution from changes in 11 . We calculated the energetic contributions to –∆∆G from changes in dissociation rates [– !"#$%# !"#$%#$&$ !"#$%# log(!!"" /!!"" ) ≝ ∆log (!!"" )]. We also observed important roles for the -7A and -5C residues.less disruption to the binding energy than other non-canonical base pairs (Fig. d). Our final model.83. while non- canonical G:U base pairs cause substantially less energetic disruption at the -8:-3 and - 11:-1 positions. which we speculate results from direct interactions with MS2 on the 5’ side36. which incorporated a free parameter for G:U non-canonical base pairs. residues that interact with the Lys61 binding pocket on alternate halves of the dimer29. 3a). captured 92% of the variance in binding energy of the training set and predicted the binding energy of second and third mutations for variants with mutations in both paired and unpaired positions with correlation coefficients R=0. 3b).94 and R=0. confers energetic costs that exceed disrupting the hairpin structure at any single base pair. Altering the primary sequence at -10A (bulge) and -4A (loop). The model fit parameters allowed quantitative decomposition of primary and secondary determinants of affinity across the RNA structure (Fig. Energetic penalties for disrupting base-pairing increase with proximity to the loop. respectively (Fig. consistent with the formation of a wobble base pair at G:U positions that allows partial rescue of the secondary structure12. Altering the primary sequence on the 5’ side of the hairpin confers a greater energetic penalty compared with altering the 3’ side.37. consistent with stacking interactions at these positions38. Changes in association rate substantially contribute to changes in binding energies We sought to quantify how changes in association and dissociation rates contribute to measured –∆∆G values for all mutants with measurable kinetic data. 3c.

4a). This interpretation is reinforced by examining ∆log(koff) and ∆log(kon) in this region (Fig. Wilcoxon signed rank test. c). we examined the fractional contribution of change in dissociation rates to –∆∆G across single and double mutants (Fig. 4d. we have converted a high-throughput DNA sequencing flow cell into an RNA array for quantitatively measuring both binding kinetics and thermodynamics at an unprecedented scale. likely by changing hairpin stability. Using this quantitative deep mutational profiling approach we report. we addressed long-standing biophysical 12 . Using this decomposition. Discussion Using in situ transcription and inter-molecular tethering of RNA to DNA. !"#$%# !"#$%#$&$ !"#$%# association rates. 4b. the largest collection of binding affinities and kinetic constants for an intermolecular interaction. This small effect suggests that mutations at these positions modulate association rates. Because ∆log(koff) + ∆log(kon) = –∆∆G. µ = 0. thereby reducing the per-collision probability of productive binding. Using this dataset. [log(!!" /!!" ) ≝ ∆log (!!" )]. Across all measured variants. Dissociation rates change little while inferred association rates remain similar to that of the consensus sequence only for structures that maintain base-pairing through compensating mutations. At the base of the hairpin. P < 2. only a small fraction of –∆∆G measurements are explained by dissociation rate changes. we treated these parameters as pseudo-energies.2 x 10-16. we observe a significant population of structures with –∆∆G driven by association rates (Fig.5). These results suggest the kinetic drivers of observed affinity changes are position-specific and often operate through modulating association rates. to our knowledge. possibly by causing fraying of the hairpin and/or allowing competition with alternate RNA structures.

In addition. ii) the sequence-dependent kinetic contributions to observed affinities. the technique might provide quantitative information 13 . Additionally. including i) the relative contributions of primary and secondary structure elements to binding energy.questions. closely approximating synthesis conditions in vivo. these data provide a useful framework for understanding the effect of primary sequence. We anticipate this RNA-MaP methodology will be a powerful addition to select and sequence methods. as the RNA transcripts are synthesized by E. While this is an area of inquiry beyond the focus of this work. We anticipate this resource of sequence variants will enable affinity tuning of MS2-based RNA sensors enabling new applications in synthetic biology. creating a valuable framework for understanding the design and evolution of new RNA aptamers. We hypothesize that inferred changes in on-rates are due to destabilization of the RNA hairpin formation or competition with alternate secondary structure. Our predictive model for RNA-protein affinity across thousands of point mutations provides a map for quantitative tuning of both the association rate and the equilibrium constants of this RNA-protein interaction. secondary structure and non-canonical base-pairing. These observations suggest the data provided here may also provide a rich resource for modeling the RNA hairpin stability and alternate structure formation. the potential for formation of alternate structures and the effects of local sequence on native folding of RNA are well suited for study using this platform. coli RNAP and folded co-transcriptionally. reducing the number of productive binding collisions39. iii) the context-dependence of preference for G:U intermediates in secondary structure.

enabling transcription of long RNAs and allowing investigations of long non-coding RNAs and catalytic ribozymes.41 and used to directly quantify molecular affinities on in vitro generated RNA. In short. this platform might be coupled to sequenced in vivo RNA immunoprecipitation libraries40. allowing affinity tuning for the design of biological parts. 14 . we believe future application of RNA-MaP to diverse RNA-protein and RNA-RNA interactions promises to enable quantitative prediction and engineering of binding affinities and functional RNA molecules. opening the door to a detailed understanding of the sequence- specific rules driving acquisition of affinity in the selection process. as well as the identification and understanding of evolutionary sequence constraints based on underlying biophysical parameters. In addition. Alternatively. The multicolor imaging capabilities of the sequencer enables measurement of more complex biological interactions such as cooperativity between differentially labeled binding partners or RNA structure inference via fluorescence resonance energy transfer (FRET).on RNA libraries generated by systematic enrichment of ligands by exponential enrichment (SELEX). the sequencing platform is capable of generating DNA clusters >1kb42. While SELEX methods often begin with large libraries (~1014) and produce a small number of selected molecules. this RNA array methodology allows characterization of a much larger library subset (~105). providing measurements of interactions in well-defined conditions.

57 nM. (c) Images of fluorescently bound to RNA clusters at increasing concentrations of protein and at time points following perfusion of unlabeled MS2 competitor. (e) Fit binding curves to clusters labeled in panel (c). (d) Fluorescence decay of MS2 dissociating from clusters containing the consensus sequence (-5C) (t1/2=8. respectively. Below. (g) Correlation between binding energies reported in the literature and measured on the RNA array (squares.39 minutes).Figures and Figure Legends Figure 1. Romaniuk et al. - 5U. and -5A variants. (Grey bar indicates our affinity measurement cutoff.8 nM.Chapter 2 . (a) Steps for generating RNA tethered to DNA clusters on a high- throughput DNA sequencing flow cell. and 415 nM for the -5C. A massively parallel RNA array for quantitative. high-throughput biochemistry.) 15 . 36. fitted sum of Gaussians used to assign fluorescence to clusters.30). (f) The probability distribution of binding energies from all clusters with labeled variants. mean Kd = 2. circles. (b) Structure of the MS2 coat protein homodimer bound to the 19 nt hairpin RNA (PDB ID: 2BU1)31. Carey et al.28.

All energies are calculated relative to the consensus (-5C) sequence (arrow. A quantitative map of MS2 binding across RNA sequence variants. (b) Clusters measured per molecular variant as a function of mutation number.Figure 2. The –∆∆G of alanine36 substitutions to the MS2 binding surface are shown in parentheses (kBT). A median of ~11 clusters are observed for sequences with ≥4 mutations. (a) Distribution of observed RNA variants by number of mutations. respectively. –∆∆G=0). (c) Average –∆∆G of point mutations per position. and the number of quality-filtered double mutants in each matrix is indicated (M2). Solid and dashed lines represent base and phosphate interactions.385 clusters. (d) Matrix of –∆∆G for single and double mutants of the consensus sequence. 16 . Inset contains the matrix of –∆∆G for single and double mutants of the +1G variant. Affinities for the consensus sequence come from NC=909. (e) Epistasis matrix derived from (d) allows de novo reconstruction of the hairpin structure.

Binding affinity is dependent on primary sequence and secondary RNA structure.e.and double-stranded regions. Primary (i. 17 . (b) Predicted binding energies of variants with second (M2) and third mutations (M3) in both single. (a) Fit parameters for linear regression model showing position-specific contributions. mean energetic contributions of transitions and transversions) (c) and secondary (d) structure contributions to affinity derived from a.Figure 3. were mapped onto the hairpin (PDB ID: 1ZDH)38. Energetic components for all possible base pair combinations are shown below.

18 .43) rates to –∆∆G for all measured mutants (N=3.029). Positions at the base of the hairpin are highlighted.57) and dissociation (red. µ=0. µ=0. (a) Fractional contribution of dissociation rates for 31 single and 289 double mutants with measurable affinities and dissociation rates. Sequence-specific contributions of association and dissociation rates to binding affinity. M2 = number of quality-filtered double mutants. (d) Distribution of fractional contributions of association (blue. (b) ∆log(koff) and (c) ∆log(kon) at the base of the hairpin.Figure 4.

Data was aggregated across the image series by cluster ID. Data were cross-correlated with the observed images to define a global offset. and the fluorescence values for each cluster across concentrations was fit to a binding curve. X/Y and tile positions were extracted from the fastq header lines. (a) Sequencing cluster centers were derived from the fastq files from the sequencing run. and median binding energies for each sequence were reported.120 images). 19 .Supplementary Figure1. This process was repeated for all 120 tiles of the GAIIx sequencing lane and across the 26 image series (3. Images were broken into smaller sub regions (24x24 pixels) and the fluorescence was fitted to a sum of overlapping 2D Gaussians. Data Analysis Workflow. The fit binding energies were grouped by hairpin sequence. (b) Binding images were normalized for RNA content using the all RNA image (Alexa647 oligo hybridized to the stall sequence). Images were then cleaned to mask any saturated pixels.

The plotted cluster centers were adjusted using the least squares image fit. Images were fit to 2D Gaussians and generated the following distribution for the relevant parameters: (c) the fit amplitude and (d) the fit standard deviation from a representative tile. We found that a simple cross-correlation was sufficient to map x/y positions from the sequencing data to both the (a) all RNA image and the (b) MS2 binding images (cluster centers shown in green). the cross-correlation value (middle). 20 . Integrating these values generated (e) the distribution of the integrated fluorescence. and the resulting mapped cluster centers (right).Supplementary Figure 2: Correlating sequencing data and fitting 2D Gaussians to acquired images. Shown are unaligned images and cluster centers (left).

Kertesz. K. Systematic reconstruction of RNA functional motifs with high- throughput microfluidics. 1393–1406 (2012).. D. L. M. Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Cell 149. Proceedings of the National Academy of Sciences 106. S. O. et al. B.. H. N. C. Understanding the transcriptome through RNA structure. N. D.. 5. D. 8. P. Klass. Sequence-specific interaction of R17 coat protein with its ribonucleic acid binding site. Science 329. T. L. 17. Programmable single-cell mammalian biocomputers. et al. N. Keene... P. 309–319 (1997). Butter. F. Wieland. & Mann. Janga. Nature Reviews Genetics 12. 4. et al. Tsai. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. et al..4 Å Resolution. Scherrer. e15499 (2010). Carey. A long noncoding RNA maintains active chromatin to coordinate homeotic gene expression. J. 10626–10631 (2009).. Tsvetanova. & Steitz. e12671 (2010). Moore. 3. Measuring the thermodynamics of RNA secondary structure formation. 9. 7. PLoS ONE 5. J. D.1038/nature11149 12.1038/nature12894 15. 905– 920 (2000). & Weissman. Ray. Nature Reviews Genetics 8. J. Y. Y. De Haseth. & Fussenegger. M. 1192–1194 (2012). Scheibe. Nat Meth 9. Kellis.. Science 289. C. Rapid and systematic analysis of the RNA recognition specificities 21 . Nature (2013). Segal. Washietl. M. M. Biochemistry 22. Ban. Wan. Zubradt. doi:10. Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins. T. J. 13. Nature 472. Rouskin. D. Proteome-Wide Search Reveals Unexpected RNA-Binding Proteins in Saccharomyces cerevisiae. M. Mörl. S. A. doi:10. et al. M. Y. & Chang. 533–543 (2007). C. & Uhlenbeck. Science 330. RNA regulons: coordination of post-transcriptional events. 641–655 (2011).. M. & Turner. Hansen. Nissen. 2601–2610 (1983).. Spitale. Modular regulatory principles of large non-coding RNAs. S. & Smolke. Castello. J. A Screen for RNA-Binding Proteins in Yeast Indicates Dual Functions for Many Enzymes. C.. M. & Rinn.. & Gerber. SantaLucia. Unbiased RNA–protein interaction screen by quantitative proteomics. P.References 1. O. K. doi:10. P. 11.. 2. Culler.. J. Ausländer. S. 689–693 (2010). 10. 18. 339–346 (2012). D.. H. L. R. 1251–1255 (2010). C. Hoff. M... et al. Reprogramming Cellular Behavior with RNA Controllers Responsive to Endogenous Proteins. A. G. M. V. Nature 482. P. J.1038/nature12756 14. & Brown. Ding.. C. G. M.. PLoS ONE 5. Cameron. 6.. Martin. J. Nature (2012). Long Noncoding RNA as Modular Scaffold of Histone Modification Complexes. A. Müller. Salzman. The Complete Atomic Structure of the Large Ribosomal Subunit at 2.. Mittal. 120–124 (2011). Nature (2013). M. Guttman. Wang. 16. Biopolymers 44. E. S. Ausländer.

Efficient targeted resequencing of human germline and cancer genomes by oligonucleotide- selective sequencing. Science 330. Biotechnol. H. M. C.. 24. M. W. Nature 456. Biochemistry 22. E. 16858–16863 (2012). 1291– 1294 (2010). D. O. M. 53–59 (2008). G. Grahn. Uemura. Nat Meth 9. Bell. 22 . Nature 371. Carey. & Uhlenbeck. M. R. et al. 1024–1027 (2011).. Natsoulis. 36.. U. 18. Myllykangas. Wu. 1012–1017 (2010). Buenrostro. 437– 445 (1998).. 21. Rapid Construction of Empirical RNA Fitness Landscapes. 172–177 (2013). Murray. & Uhlenbeck. 19. Foster. & Liljas. A compendium of RNA-binding motifs for decoding gene regulation. P. 28. N. Nat. Proceedings of the National Academy of Sciences 109. D. 35. et al. et al. 32. 33. & Block. J. H. Greenleaf.1038/nature12543 23. Bardwell. 27. Real-time tRNA transit on single translating ribosomes at codon resolution. Nucleic Acids Res. Ray. J. Hidden specificity in an apparently nonspecific RNA- binding protein. RNA binding site of R17 coat protein. Lowary. Frieda. Alanine Scanning of MS2 Coat Protein Reveals Protein–Phosphate Contacts Involved in Thermodynamic Hot Spots. Nature 499. Romaniuk. D. Interaction of R17 coat protein with synthetic variants of its ribonucleic acid binding site. Structural basis of pyrimidine specificity in the MS2 RNA hairpin- coat-protein complex. et al. 4723–4730 (1983). Direct observation of hierarchical folding in single riboswitch aptamers. 34. C. Nature 464. K. M. R. Nat. L. 376–379 (2010). K. et al. M. 25. 20. & Wickens.. Stockley. N. T. Gabriele Varani. 27. of RNA-binding proteins. 37. J. J. 29. et al. 22. E. & Ji. doi:10. Bentley. The G·U wobble base pair: A fundamental building block of RNA structure crucial to RNA function in diverse biological systems. V. Matzas. Araya.-P. Guenther. Nutiu. Crystal structure of an RNA bacteriophage coat protein�operator complex. 26. Science 319. D. H. & Ferre-D'Amare. 613–624 (2006). & Uhlenbeck. A fundamental protein property. P. T. 1563–1568 (1987). P. N. Hobson.. L. 29. Journal of Molecular Biology 356. Pitt. 623–626 (1994). 630–633 (2008). High-fidelity gene synthesis by retrieval of sequence-verified DNA identified using high-throughput pyrosequencing. 72–74 (2011). A. C. J. thermodynamic stability. Stonehouse. P. 29. revealed solely from large-scale measurements of protein function.. Counting absolute numbers of molecules using unique molecular identifiers. L. J. Nat.. RNA 7. et al. Kivioja. P.. S. Purification of RNA and RNA-protein complexes by an R17 coat protein affinity method. Stormo. B. Biochemistry 26. Biotechnol. Valegård. S. 6587–6594 (1990). R. Direct measurement of DNA affinity landscapes on a high- throughput sequencing instrument. Localization of ASH1 mRNA particles in living yeast. 667–670 (2009). Woodside. C.. 31. 30. 659–664 (2011). W. 1616–1627 (2001). D. O. 2. Accurate whole human genome sequencing using reversible terminator chemistry. Biotechnol. O. Lowary. J. Nat. S.. 28. J. T. G.. et al. J. Biotechnol. Nature – (2013). A. G. Bertrand.. et al. et al.

Licatalosi. DNA-binding proteins and nucleosome position. Y. 939–953 (2010). P. 464–469 (2008). et al. Genome-wide Identification of Polycomb-Associated RNAs by RIP- seq. K. & Greenleaf. Zaba. 38. D. 42. 23 . et al. Valegârd. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. C. Molecular Cell 40... 264–278 (2008). Nat Meth 10. 40. Single-Molecule Fluorescence Resonance Energy Transfer Assays Reveal Heterogeneous Folding Ensembles in a Simple RNA Stem–Loop. et al. Buenrostro. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin. L. J. Gell. J. G. Journal of Molecular Biology 384. The three-dimensional structures of two complexes between recombinant MS2 capsids and RNA operator fragments reveal sequence-specific protein-RNA interactions. 41. 1213–1218 (2013). J. Chang. Journal of Molecular Biology 270. 18–23 (2000). C. W. Zhao.. D. H. Giresi. et al. D. EMBO Reports 1. Nature 456. 724–738 (1997). 39.

nucleosome positioning6-8.3. precluding generation of “personal epigenomes” in diagnostic timescales.CHAPTER THREE – Measuring accessibility in rare cellular populations2 Introduction Eukaryotic genomes are hierarchically packaged into chromatin1.2688. Second. 2 Portions of this chapter were taken from Buenrostro et al. Nature Methods. These limitations are problematic in three major ways: First. current methods can average over and “drown out” heterogeneity in cellular populations. genome-wide methods for separately assaying the chromatin 4. cells must often be grown ex vivo to obtain sufficient biomaterials. and the nature of this packaging plays a central role in gene regulation2. input requirements often prevent application of these assays to well-defined clinical samples. 2013. doi:10. Major insights into the epigenetic information encoded within the nucleoprotein structure of chromatin have come from high-throughput. Third. chromatin accessibility. Here we report a robust and sensitive method for epigenomic profiling that can provide a comprehensive portrait of gene regulatory processes.10(12):1213–1218. While powerful. and cannot simultaneously probe the interplay of nucleosome positioning. existing methods require millions of cells as starting material. and transcription factor binding. complex and time-consuming sample preparations. perturbing the in vivo context and modulating the epigenetic state in unknown ways.US 24 .1038/nmeth.5 accessibility (“open chromatin”) . and transcription factor occupancy9. DNA-binding proteins and nucleosome position. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin.

a prokaryotic transposase. In comparison to pre-existing methods. Here we describe Assay for Transposase Accessible Chromatin (ATAC-seq).000 and 500 unfixed nuclei 25 . amplifiable DNA fragments suitable for high-throughput sequencing are generated at locations of open chromatin (Fig 1A). ATAC-seq enables rapid and efficient library generation while radically reducing sample input requirements. In contrast. can simultaneously fragment and tag a genome with sequencing adapters (previously described as “tagmentation”11).14. we hypothesized that transposition by purified Tn5. The entire assay and library construction can be carried out in a simple two-step process involving Tn5 insertion and PCR. gel purification and reversal of crosslinks. Therefore. these protocols require 1-50 million cells (FAIRE) or 50 million cells (DNase-seq). We carried out ATAC-seq on 50. while published FAIRE-seq protocols require two overnight incubations carried out over at least 3 days13. Because transposons have been shown to integrate into active regulatory elements in vivo12. and two overnight incubations. such as adapter ligation.14 (Fig 1B). Extensive analyses show that ATAC-seq provides accurate and sensitive measure of chromatin accessibility genome-wide. likely due to their complex workflows13. whereas compaction and sequestration of chromatin make transposition improbable. ATAC-seq uses Tn5 transposase to integrate its adapter payload into regions of accessible chromatin.ATAC-seq measures chromatin accessibility using Tn5 transposase Hyperactive Tn5 transposase10.11. loaded in vitro with adapters for high-throughput DNA sequencing. on small numbers of unfixed nuclei would interrogate regions of accessible chromatin. Specifically. published DNase- and FAIRE-seq protocols for assaying chromatin accessibility involve complex multi- step protocols and many loss-prone steps. DNase-seq consist of 44 steps. Furthermore.

The insert size distribution of sequenced fragments from human chromatin has clear periodicity of approximately 200 base pairs. as can be seen in Fig 1C. although sensitivity is diminished for smaller numbers of input material. This fragment size distribution also shows clear periodicity equal to the helical pitch of DNA11. as insulator regions are enriched for short 26 .83). ATAC-seq paired-end reads produce detailed information about nucleosome packing and positioning. (Fig.14. demonstrating that these functional states of chromatin have an accessibility “fingerprint” that can be read out with ATAC-seq. which was generated from approximately 3 to 5 orders- of-magnitude more cells13.isolated from GM12878 lymphoblastoid cell line (ENCODE Tier 115) for comparison and validation with chromatin accessibility data sets. and normalizing to the global insert distribution (Methods) we observe clear class-specific enrichments across this insert size distribution (Fig. Highly sensitive open chromatin detection is maintained even when using 5. ATAC-seq has a signal-to- noise ratio similar to DNase-seq. Peak intensities were highly reproducible between technical replicates (R=0.79 and R=0. By partitioning insert size distribution according to functional classes of chromatin as defined by previous models17.000 or 500 human nuclei as starting material. 2B). Insert size yields information regarding nucleosome packing and positioning Unlike pre-existing assays that measure chromatin accessibility. including DNase-seq13 and FAIRE- seq16. and highly correlated between ATAC-seq and DNase-seq (R=0. 1C). suggesting many fragments are protected by integer multiples of nucleosomes (Fig 2A).98). These differential fragmentation patterns are consistent with the putative functional state of these classes. At a locus previously highlighted by others5.

fragments of DNA, while transcription start sites are differentially depleted for mono-, di-

and tri-nucleosome associated fragments. Transcribed and promoter flanking regions are

enriched for longer multi-nucleosomal fragments, suggesting they are more compacted

than other states that require access to DNA by regulatory factors. Finally, repressed

regions are differentially depleted for short fragments, consistent with their expected

compacted state. These data suggest that ATAC-seq reveals differentially compacted

forms of chromatin, which have been long hypothesized to exist in vivo2,18,19.

To explore nucleosome positioning within accessible chromatin in the GM12878

cell line, we partitioned our data into reads generated from putative nucleosome free

regions of DNA, and reads likely derived from nucleosome associated DNA. Using a

simple heuristic that positively weights nucleosome associated fragments and negatively

weights nucleosome free fragments, we calculated a data track used to call nucleosome

positions within regions of accessible chromatin20. An example locus (Fig. 3A) contains a

putative bidirectional promoter with CAGE data showing two transcription start sites

(TSS) separated by ~700bps. ATAC-seq reveals in fact two distinct nucleosome free

regions, separated by a single well-positioned mononucleosome (Fig. 3A). Compared to

MNase-seq21, ATAC-seq data is more amenable to detecting nucleosomes within putative

regulatory regions, as the majority of reads are concentrated within accessible regions of

chromatin (Fig. 3B). By averaging signal across all active TSSs, we note nucleosome free

fragments are enriched at a canonical nucleosome free promoter region overlapping the

TSS, while our nucleosome signal is enriched both upstream and downstream of the

active TSS, and displays characteristic phasing of upstream and downstream

nucleosomes6,7 (Fig. 3C). Because ATAC-seq reads are concentrated at regions of open

27

chromatin, we see strong nucleosome signal at the +1 nucleosome, which decreases at the

+2, +3 and +4 nucleosomes, in contrast, MNase-seq nucleosome signal increases at larger

distances from the TSS likely due to over digestion of more accessible nucleosomes.

Additionally, MNase-seq (4 billion reads) assays all nucleosomes requiring orders of

magnitude more sequencing than ATAC-seq (198 million paired reads) to reach similar

resolution at regulatory nucleosomes (Fig. 3B,C). Using our nucleosome calls, we further

partitioned putative distal regulatory regions and TSSs into regions that were nucleosome

free and regions that were predicted to be nucleosome bound. We note that TSSs were

enriched for nucleosome free regions when compared to distal elements, which tend to

remain nucleosome rich (Fig. 3D). These data suggest ATAC-seq can provide high-

resolution readout of nucleosome associated and nucleosome free regions in regulatory

elements genome wide.

ATAC-seq reveals distinct classes of factor-nucleosome spacing

ATAC-seq high-resolution regulatory nucleosome maps can be used to

understand the relationship between nucleosomes and DNA binding factors. Using ChIP-

seq data, we plotted the position of a variety of DNA binding factors with respect to the

dyad of the nearest nucleosome. Unsupervised hierarchical clustering (Figure 3E)

revealed major classes of binding with respect to the proximal nucleosome, including 1) a

strongly nucleosome avoiding group of factors with binding events stereotyped at ~180

bases from the nearest nucleosome dyad (comprising C-FOS, NFYA and IRF3), 2) a

class of factors that “nestle up” precisely to the expected end of nucleosome DNA

contacts, which notably includes chromatin looping factors CTCF and cohesion complex

28

subunits RAD21 and SMC3; 3) a large class of primarily transcription factors that have

gradations of nucleosome avoiding or nucleosome-overlapping binding behavior, and 4)

a class whose binding sites tend to overlap nucleosome associated DNA. Interestingly,

this final class includes chromatin remodeling factors such as CHD1 and SIN3A as well

as RNA polymerase II, which appears to be enriched at the nucleosome boundary8. The

interplay between precise nucleosome positioning and locations of DNA binding factor

immediately suggests specific hypotheses for mechanistic studies, a potential advantage

of ATAC-seq.

Footprints can be used to infer factor occupancy genome-wide

ATAC-seq enables accurate inference of DNA binding factor occupancy genome-

wide. We reasoned that DNA sequences directly occupied by DNA-binding proteins are

protected from transposition; the resulting sequence “footprint” reveals the presence of

the DNA-binding protein at each site, analogous to DNase digestion footprints22. At a

specific CTCF binding site on chromosome 1, we observed a clear footprint (a deep

notch of ATAC-seq signal), similar to footprints seen by DNase-seq23,24, at the precise

location of the CTCF motif that coincides with the summit of the CTCF ChIP-seq signal

in GM12878 cells (Fig 4A). We averaged ATAC-seq signal over all expected locations of

CTCF within the genome and observed a well-stereotyped “footprint” (Fig. 4B). Similar

results were obtained for a variety of common TFs. We inferred the CTCF binding

probability from motif consensus score, evolutionary conservation, and ATAC-seq

footprinting data to generate a posterior probability of CTCF binding at all loci (Fig.

4C)25. Results using ATAC-seq closely recapitulate ChIP-seq binding data in this cell line

29

and DNase-seq methods. 4D). Transcription factors that exhibit large variation in distribution between T-cells and B- cells are enriched for T-cell specific factors (Fig. allowing simultaneous interrogation of factor occupancy.and compare favorably to DNase-based factor occupancy inference. while canonical CTCF occupancy is highly correlated within these two cell types (Fig. nucleosome positions in regulatory sites. Discussion Epigenomic studies of chromatin accessibility have yielded tremendous biological insights. This analysis shows NFAT is differentially regulating. we compared the genomic distribution of the same 89 transcription factors between GM12878 and proband CD4+ T-cells. they each require separate assays with large cell numbers. enabling systematic reconstruction of regulatory networks. cost. suggesting that factor occupancy data can be extracted from these ATAC-seq datasets. but are currently limited in application by their complex workflows and large cell number requirements. While extant methods such as DNase. and chromatin compaction genome-wide. and allowing reconstruction of regulatory networks. which increases the time. These insights are derived from both the position of insertion and the distribution of insert lengths captured during the transposition reaction. ATAC-seq offers potentially unique advantages over pre- existing ChIP-. MNase. Using ATAC-seq footprints we generated the occupancy profiles of 89 transcription factors in proband T-cells. 4D). With this personalized regulatory map. and limits applicability to many 30 .and MNase-seq can provide some subsets of the information in ATAC-seq. ATAC-seq is an information rich assay.

ATAC-seq also provides insert size “fingerprints” of biologically relevant genomic regions. we believe that the attractive combination of speed. simplicity and low input requirements of ATAC-seq will enable new gene regulatory insights into biology and medicine. In summary. particularly when integrated with other powerful rare cell techniques.27. such as FACS. laser capture microdissection (LCM) and recent advancements in RNA-seq26.systems. and improve our understanding of gene regulation. We expect ATAC-seq to have broad applicability. significantly add to the genomics toolkit. suggesting that it capture information on chromatin compaction. 31 .

C) A comparison of ATAC-seq to other open chromatin assays at a locus in GM12878 lymphoblastoid cells displaying high concordance. A) ATAC-seq reaction schematic. accurate probe of open chromatin state. loaded with sequencing adapters (red and blue). 32 .Chapter 3 . Transposase (green). ATAC-seq is a sensitive. inserts only in regions of open chromatin (nucleosomes in grey) and generates sequencing library fragments that can be PCR amplified. B) Approximate input material and sample preparation time requirements for genome-wide methods of open chromatin analysis. Lower ATAC-seq track was generated from 500 FACS-sorted cells.Figures and Figure Legends Figure 1.

33 . B) Normalized read enrichments for 7 classes of chromatin state previously defined17. (Inset) log-transformed histogram shows clear periodicity persists to 6 nucleosomes. ATAC-seq provides genome-wide information on chromatin compaction. as well as a high frequency periodicity consistent with the pitch of the DNA helix for fragments less than 200 bp.Figure 2. A) ATAC-seq fragment sizes generated from GM12878 nuclei (red) indicate chromatin- dependent periodicity with a spatial frequency consistent with nucleosomes.

+1. as well as DNase. +2.836). and H3K27ac. -1.Z tracks for comparison. nucleosome free (NFR) bases in TSS and distal sites (see Methods). E) Hierarchical clustering of DNA binding 34 . C) TSSs are enriched for nucleosome free fragments. D) Relative fraction of nucleosome associated vs.Figure 3 ATAC-seq provides genome-wide information on nucleosome positioning in regulatory regions. +3 and +4 positions. H3K4me3. MNase. B) ATAC-seq (198 million paired reads) and MNase-seq (4 billion single-end reads) nucleosome signal shown for all active TSSs (n=64. calculated nucleosome track (Methods). and show phased nucleosomes similar to those seen by MNase-seq at the -2. A) An example locus containing two transcription start sites (TSSs) showing nucleosome free read track. TSSs are sorted by CAGE expression. and H2A.

factor position with respect to the nearest nucleosome dyad within accessible chromatin reveals distinct classes of DNA binding factors. Factors strongly associated with nucleosomes are enriched for chromatin remodelers. 35 .

Figure 4: ATAC-seq assays genome-wide factor occupancy. A) CTCF footprints

observed in ATAC-seq and DNase-seq data, at a specific locus on chr1. B) Aggregate

ATAC-seq footprint for CTCF (motif shown) generated over binding sites within the

genome C) CTCF predicted binding probability inferred from ATAC-seq data, position

weight matrix (PWM) scores for the CTCF motif, and evolutionary conservation

(PhyloP). Right-most column is the CTCF ChIP-seq data (ENCODE) for this GM12878

cell line, demonstrating high concordance with predicted binding probability. D) Cell

type-specific regulatory network from proband T cells compared with GM12878 B-cell

line. Each row or column is the footprint profile of a TF versus that of all other TFs in the

same cell type. Color indicates relative similarity (yellow) or distinctiveness (blue) in T

versus B cells. NFAT is one of the most highly differentially regulated TFs (red box)

whereas canonical CTCF binding is essentially similar in T and B cells.

36

References

1. Kornberg, R. D. Chromatin structure: a repeating unit of histones and DNA.
Science 184, 868–871 (1974).
2. Kornberg, R. D. & Lorch, Y. Chromatin Structure and Transcription. Annu. Rev.
Cell. Biol. 8, 563–587 (1992).
3. Mellor, J. The Dynamics of Chromatin Remodeling at Promoters. Molecular Cell
19, 147–157 (2005).
4. Boyle, A. P. et al. High-Resolution Mapping and Characterization of Open
Chromatin across the Genome. Cell 132, 311–322 (2008).
5. Thurman, R. E. et al. The accessible chromatin landscape of the human genome.
Nature 489, 75–82 (2012).
6. Schones, D. E. et al. Dynamic Regulation of Nucleosome Positioning in the
Human Genome. Cell 132, 887–898 (2008).
7. Valouev, A. A. et al. Determinants of nucleosome organization in primary human
cells. Nature 474, 516–520 (2011).
8. Barski, A. et al. High-Resolution Profiling of Histone Methylations in the Human
Genome. Cell 129, 823–837 (2007).
9. Gerstein, M. B. et al. Architecture of the human regulatory network derived from
ENCODE data. Nature 489, 91–100 (2012).
10. Goryshin, I. Y. & Reznikoff, W. S. Tn5 in vitro transposition. J. Biol. Chem. 273,
7367–7374 (1998).
11. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment
libraries by high-density in vitro transposition. Genome Biol 11, R119 (2010).
12. Gangadharan, S., Mularoni, L., Fain-Thornton, J., Wheelan, S. J. & Craig, N. L.
DNA transposon Hermes inserts into DNA in nucleosome-free regions in vivo.
Proceedings of the National Academy of Sciences 107, 21966–21972 (2010).
13. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harb Protoc 2010, (2010).
14. Simon, J. M., Giresi, P. G., Davis, I. J. & Lieb, J. D. Using formaldehyde-assisted
isolation of regulatory elements (FAIRE) to isolate active regulatory DNA. Nature
Protocols 7, 256–267 (2012).
15. Consortium, T. E. P. A User's Guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol 9, e1001046 (2011).
16. Giresi, P. G. & Lieb, J. D. Isolation of active regulatory elements from eukaryotic
chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory
Elements). Methods 48, 233–239 (2009).
17. Hoffman, M. M. et al. Integrative annotation of chromatin elements from
ENCODE data. Nucleic Acids Res. 41, 827–841 (2013).
18. Kornberg, R. D. & Lorch, Y. Chromatin and transcription: where do we go from
here. Current Opinion in Genetics & Development 12, 249–251 (2002).
19. Zhou, J., Fan, J. Y., Rangasamy, D. & Tremethick, D. J. The nucleosome surface
regulates chromatin compaction and couples it with transcriptional repression. Nat
Struct Mol Biol 14, 1070–1076 (2007).
20. Chen, K. et al. DANPOS: Dynamic analysis of nucleosome position and

37

occupancy by sequencing. Genome Research 23, 341–351 (2013).
21. Kundaje, A. et al. Ubiquitous heterogeneity and asymmetry of the chromatin
environment at regulatory elements. Genome Research 22, 1735 (2012).
22. Hesselberth, J. R. et al. Global mapping of protein-DNA interactions in vivo by
digital genomic footprinting. Nat Meth 6, 283–289 (2009).
23. Boyle, A. P. et al. High-resolution genome-wide in vivo footprinting of diverse
transcription factors in human cells. Genome Research 21, 456–464 (2011).
24. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription
factor footprints. Nature 489, 83–90 (2012).
25. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA
sequence and chromatin accessibility data. Genome Research 21, 447–455 (2011).
26. Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Meth
6, 377–382 (2009).
27. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression
and splicing in immune cells. Nature 498, 236–240 (2013).

38

it is reasonable to hypothesize that heterogeneity at the single cell level extends to accessibility variability within cell types at regulatory elements. In parallel. cancer heterogeneity8. doi:10. in particular. Recent proliferation of powerful methods for interrogating single cells1-5 has allowed detailed characterization of this molecular variation. particularly at distal regulatory regions10. the lack of methods to probe DNA accessibility within individual cells has prevented quantitative dissection of this hypothesized regulatory variation. have proven extremely effective in identifying regulatory elements across a variety of cell types11 – quantifying changes that lead to both activation and repression of gene expression. 2015. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. and provided deep insight into characteristics underlying developmental plasticity6. Methods for probing genome-wide DNA accessibility.CHAPTER FOUR – Single-cell accessibility reveals principles of regulatory variation3 Introduction Heterogeneity within cellular populations has been evident since the first microscopic observations of individual cells. Single-cell ATAC-seq a measure of chromatin accessibility genome-wide We have developed a single-cell Assay for Transposase-Accessible Chromatin 3 Portions of this chapter were taken from Buenrostro et al. However. 39 . genome-wide mapping of regulatory elements in large ensembles of cells have unveiled tremendous variation in chromatin structure across cell-types.1038/nature14590.7. and drug resistance9. Given this broad diversity of activity within regulatory elements when comparing phenotypically distinct cell populations.

14 to tag regulatory regions by inserting sequencing adapters into accessible regions of the genome. K562 chronic myelogenous leukemia and GM12878 lymphoblastoid cells) as well as from V6. which correlate with empty chambers or dead cells. 1d). EML6 (mouse hematopoietic progenitor).b). were excluded from further analysis (Fig. S1a. Using single-cell ATAC-seq we generated DNA accessibility maps from 254 individual GM12878 lymphoblastoid cells. improving on the state-of-the-art12 sensitivity by >500-fold. We further validated the approach by measuring chromatin accessibility from a total of 1.c). ATAC-seq uses the prokaryotic Tn5 transposase13. Data from single cells recapitulate several characteristics of bulk ATAC-seq data. Fluidigm) with methods optimized for this task (Fig.3x104 fragments mapping to the nuclear genome.5 mouse ESCs. Single-cell libraries are then pooled and sequenced on a high-throughput sequencing instrument. TF-1 (human erythroblast). Chambers passing filter yielded an average of 7. and a strong enrichment of fragments within regions of accessible chromatin (Fig. libraries are collected and PCR amplified with cell-identifying barcoded primers. After transposition and PCR on the Integrated Fluidics Circuit (IFC). 1a).(scATAC-seq). including fragment size periodicity corresponding to integer multiples of nucleosomes. Microfluidic chambers generating low library diversity or poor measures of accessibility. In scATAC-seq individual cells are captured and assayed using a programmable microfluidics platform (C1 single-cell Auto Prep System. Aggregate profiles of scATAC-seq data closely reproduce ensemble measures of accessibility profiled by DNase-seq and ATAC-seq generated from 107 or 104 cells respectively (Fig. 1b.632 IFC chambers representing 3 tier 1 ENCODE cell lines15 (H1 human embryonic stem cells [ESCs]. HL-60 (human promyeloblast) and BJ fibroblasts (human foreskin 40 .

for any set of features. a metric of excess variance over the background signal. ChIP-seq peaks. To quantify this variation we first choose a set of open chromatin peaks. we divide this by the root mean square of fragments expected from a background signal (BS) constructed to estimate technical and sampling error within single-cell data sets. we refer to this metric as “deviation”. down sampled from the aggregate profile. Herein. we aggregate the deviation measurements across cells (Fig 2b) to obtain an overall “variability” score.fibroblast).4% of promoters are represented in a typical scATAC-seq library (Fig. To correct for bias. For example. S1c-f).). which share a common characteristic (such as transcription factor binding motif. identified using the aggregate accessibility track. We then calculate the observed fragments in these regions minus the expected fragments. The sparse nature of scATAC-seq data makes analysis of cellular variation at individual regulatory elements impractical. 2a. S2a-f).17. cell cycle replication timing domains. a cell type with extensive epigenomic data sets16. Finally. we observe a near digital (0 or 1) measurement of accessibility at individual elements within individual cells. within individual cells. etc. To comprehensively characterize variability 41 .b and Fig. within a typical single cell we estimate a total of 9. We therefore developed an analysis infrastructure to measure regulatory variation using changes of accessibility across sets of genomic features (Fig. We first focused our analysis on K562 myeloid leukemia cells. Cell-cell variability in trans Because regulatory elements are generally present at two copies in a diploid genome.

representing variable ATAC-seq signal associated with changes in DNA content across the cell cycle. JUN. suggesting that variance across sets of trans-factors represent distinct regulatory states in individual cells. This analysis suggests that high-variance trans-factors are variable independent of the cell-cycle (Fig. 2f and Fig. we computed variability across all available ENCODE ChIP-seq. we find increased variability within different replication timing domains. 2c. These factors include sequence- specific transcription factors (TFs). and STAT2. either through cooperative or competitive binding. transcription factor motifs and regions that differed in replication timing (as determined from Repli-Seq data sets18) (Fig. with PC 5 describing changes in DNA abundance throughout the cell cycle. we discover a set of trans-factors associated with high variability. For example. to induce or suppress site- to-site variability in chromatin accessibility. As expected from proliferating cells.associated with trans-factors within individual K562 cells. In addition. the most variant factors in K562 cells – GATA1 and GATA2 – display expression heterogeneity and also bind an 42 . such as GATA1/2. S2g-i). Trans-factors synergize to induce cell-cell variability We hypothesized that variation associated with different trans-factors can synergize. and chromatin effectors. We found measures of cell-to-cell variability were highly reproducible across biological replicates (Fig. 2e and Fig. such as BRG1 and P300. Immunostaining followed by microscopy or flow cytometry (Fig.d). Principal component (PC) analysis of single-cell deviations across all trans- factors show seven significant PCs. S3a-d) confirmed heterogeneous expression of GATA1 and GATA2. S3e-g). The remaining PCs show contributions from several TFs.

the latter a cohesin subunit involved in chromosome looping17.” suggesting these factors may compete for access to DNA sequences.identical consensus sequence “GATA. In contrast. In contrast. Extending this analysis to all TF ChIP-seq data sets revealed a trans-factor synergy landscape for accessibility variation (Fig. whereas sites with only GATA1 or GATA2 show substantially less variability (Fig. S3h-i). We also find peaks unique to GATA1 binding are significantly more accessible than peaks unique to GATA2 (Fig. essential components of the cell cycle. an activator of accessibility. 2g and Fig. unless associated with proximal binding of ZNF143 or SMC3. SUZ12. we used chemical inhibitors to modulate potential sources of cell-cell variability. caused a marked reduction of variability within peaks associated with DNA replication 43 . Inhibition of cyclin-dependent kinases 4 and 6 (CDK4/6). CTCF. 2g).19. Cell-state and chemical perturbation effects on cell-cell variability To validate our ability to detect changes in accessibility variance. TAL1 or P300. in combination. induce or suppress cell-to-cell regulatory variation. single cell accessibility profiles nominate distinct trans-factors that. In support of this hypothesis. For example. Thus. we find no substantial change in variability of GATA1 binding sites that co-occur with JUN or CEBPB. chromatin accessibility variance associated with GATA2 binding is significantly enhanced when the same region could also be bound by GATA1. we find regulatory elements with both GATA1 and GATA2 ChIP-seq signals show increased variability in accessibility. and ZNF143 appear to act as general suppressors of accessibility variance. competes with GATA2 to induce single-cell variability. S3j-k) supporting the hypothesis that GATA1.

which was validated with flow cytometry (Fig. suggesting scATAC-seq can also be used to deconvolve heterogeneous cellular mixtures. these results show that variability can be experimentally modulated and further demonstrates that variability is not solely dependent on the cell-cycle.22. JUN variability was one of the top changes caused by JNKi but not Imatinib. For example. 3d). respectively) increased G1/S-associated variability suggesting an increase in the subpopulation of G1/S cells. This analysis also shows cells from different biological replicates cluster with their cell type of origin (with a single exception).1 expression differences 44 . 3a). We observe that trans-factors associated with high variability are generally cell type specific. consistent with the known stochastic and oscillatory property of nuclear shuttling in this system20.1 (SPI1) binding accessibility in EML cells. 3c). Hierarchical bi-clustering of single-cell deviations generated from three cell lines reveals cell-type specific sets of transcription factor motifs associated with high variability (Fig. regions associated with GATA TFs are most variant in K562s while regions associated with master pluripotency TFs Nanog and Sox2 are most variant in mouse embryonic stem cells (ESCs). The addition of inhibitors of JUN or BCR-ABL kinases (JNKi and Imatinib. Tumor necrosis factor (TNF) treatment of GM12878 cells specifically modulated accessibility variability at NF-kB sites (Fig. consistent with previous observations of expression variation of these factors21. Together. S4). suggesting that high-variance trans-factors can also be specifically and pharmacologically modulated. a cell type previously shown to have >200x GATA1 and >15x PU. 3b).timing domains (Repli-seq) (Fig. Systematic analysis of all assayed cell types identified high-variance trans-factor motifs that are generally unique to specific cell types (Fig. Importantly we also find high variability of GATA1 and PU.

The resulting matrix. suggesting differences in the global levels of trans-factor variability across cell lines. We then further enhanced this co- correlation matrix using a secondary correlation analysis using methods similar to those employed in chromosome conformation studies25. We then determined which windows co-varied within individual cells by calculating the co-correlation of each window across all others within the same chromosome within individual cells. Finally. and ETS/ERG20. the complete set of identified high- variance trans-factors contains a number of TFs previously reported to dynamically localize into the nucleus. Single-cells vary in cis Patterns of variation in accessibility along the linear genome in individual cells reveal an unexpected connection to higher order chromosome folding. 4b-d) (R=0. which identifies pairs of positions in the genome where accessibility co-varies within individual cells. yields Mb-scale correlation domains highly concordant with previously observed chromatin domains26 (Fig. These data provide independent biological validation of large-scale compartmentalization of higher-order 45 . We calculated single cell deviations within sliding windows across the genome. each encompassing a fixed number of peaks (N=25) (Fig.24. Interestingly. Overall these findings suggest that trans-factors promote cell-type specific chromatin accessibility variation genome- wide. including NF-kB.within clonal cellular subpopulations6.23. JUN.61 for chromosome 1). suggesting that temporal fluctuations in TF concentration may be driving observed chromatin accessibility heterogeneity. 4a). we find BJ fibroblasts and HL-60s exhibit less variance among this set of annotated trans-factor motifs.

co- occurance with other factors such as P300 appears to amplify variability. perhaps by providing a stable anchor of chromatin accessibility or insulator function that dampens potential fluctuations. variation of chromatin accessibility in cis is highly correlated with previously reported chromosome compartments. Moreover.chromatin structure25. Additionally. We identify trans-factors associated with increased accessibility variance. Lineage-specific master regulators are associated with cell-type specific single-cell epigenomic variability across several cell types. suggesting that control of single-cell variance is a fundamental characteristic of different biological states. a hypothesis also supported by single-cell FISH measurements of interactions between DNA loci27.effectors to variability in accessibility profiles within individual epigenomes. All together these data provide exciting new hypothesis of regulatory mechanisms that give rise to single-cell heterogeneity. Conversely. and that ensemble chromosome conformation data may arise in part from the statistical properties of single cell variation in co-regulated accessibility. these results suggest that higher-order chromatin interactions may drive regulatory variability in cis (elements that are close together tend to be open together). perhaps due to synergistic interactions. Finally. Discussion Using scATAC-seq we dissected single-cell epigenomic heterogeneity and linked cis.and trans. opening the intriguing possibility that this component of epigenomic noise has its roots in higher-order chromatin organization. 46 .26. other trans-factors such as CTCF appear to buffer variability. which we call high-variance trans-factors.

In addition. increasing throughput. we anticipate scATAC-seq may be paired with existing approaches in microscopy and single-cell RNA-seq to provide opportunities for systems analysis of individual cells. We believe scATAC-seq will likewise enable the interrogation of the epigenomic landscape of small or rare biological samples allowing for detailed. 47 . Such an approach will link regulatory variation to details of phenotypic variation. and potentially de novo. and refining methods of data analysis. We envision that future studies will enhance the utility of scATAC-seq by further improving the recovery of DNA fragments. Improvements to throughput and new statistical tools will enable single- cells to be partitioned by cell-state and analyzed in aggregate to find the individual peaks that drive variability (Fig. reconstruction of cellular differentiation or disease at the fundamental unit of investigation – the single cell. promising new insight into the molecular underpinnings of cellular heterogeneity. S5).

(d) Library size versus percentage of fragments in open chromatin peaks (filtered as described in methods) within K562 cells (N=288). (b) Aggregate single-cell accessibility profiles closely recapitulate profiles of DNase-seq and ATAC-seq.Figures and Figure Legends Figure 1.80). (C) Genome-wide accessibility patterns observed by scATAC-seq are correlated with DNase-seq data (R = 0. 48 .Chapter 4 .000) represent cutoffs used for downstream analysis. Single-cell ATAC-seq provides an accurate measure of chromatin accessibility genome-wide. Dotted lines (15% and 10. (a) Workflow for measuring single epigenomes using scATAC-seq on a microfluidic device (Fluidigm).

and replication timing (error estimates shown in grey. (b) Analysis infrastructure. see Supplemental Methods section 3. (e) 49 .2) to calculate TF deviations and variability from scATAC-seq data. Variability measured from permuted background (see Methods) is shown in grey dots. see Methods for details). density profile shown in purple (see Methods). Trans-factors are associated with single-cell epigenomic variability. histogram of cells shown in grey. The TF value is calculated by subtracting the number of expected fragments from the observed fragments per cell (see Supplemental Methods section 3. (a) Schematic showing two cellular states (TF high and TF low) leading to differential chromatin accessibility. (c) Observed cell-to-cell variability within sets of genomic features associated with ChIP-seq peaks. which uses a calculated background signal (BS.1). (d) Distribution of normalized deviations from expected accessibility signal for GATA1 sites in individual cells.Figure 2. transcription factor motifs.

Venn-diagrams show variability associated with GATA1 and/or GATA2 and CTCF and/or SMC3 (co-) occurring ChIP-seq sites. Bar plot of observed data shown in grey. (f) Principal components ranked by fraction of variance explained from observed data (purple) and permuted data (orange). (g) Calculated changes in associated variability of factors when present together versus independently. 50 . depicting a context-specific trans-factor variability landscape (see Methods).Immunostaining of GATA1 (green) and GATA2 (red) shows protein expression in K562s.

error bars (shown in grey) represent 1 standard deviation of bootstrapped cells across the two conditions. Change of cellular variability due to chemical perturbations using (a) CDK4/6 cell-cycle inhibitor (K562) or (b) TNF-alpha stimulation (GM12878). (d) Variability associated with trans-factor motifs across 7 cell types. Each row is normalized to the maximum variability for that motif across cell types (shown left). Cell type specific epigenomic variability. 51 . (c) Heat map of deviations from expected accessibility signal across trans-factors (rows) and of single cells (columns) from 3 cell types. Bottom color map represents assignment classification from hierarchical clustering.Figure 3.

(a) Per-cell deviations of expected fragments across a region within chromosome 1 (see Methods). (d) Box highlights a representative region depicting long-range covariability. For display. 52 . (b) Pearson correlation coefficient representing topological domain signal (see Methods) of interaction frequency from a chromatin conformation capture assay (left. (c) Permuted cis-correlation map for chromosome 1 (analyzed identically to (b)).Figure 4. only large deviation cells are shown (N=186 cells). Data in white represents masked regions due to highly repetitive regions. data from Kalhor et al.26) or doubly correlated normalized deviations of scATAC-seq (right) from chromosome 1 (see Methods). Structured cis variability across single epigenomes.

(d) Accessibility across all annotated promoters in GM12878 cells. scATAC-seq data recapitulate bulk assays. Typical promoters used for subsequent analysis are boxed with dotted lines.Supplemental Figure 1. (a) Histogram of aggregated read starts around all TSSs (in K562 cells) comparing ensemble approaches to scATAC-seq shows high enrichment above background level of reads. (c) Accessibility across all peaks (n=50. (b) DNA fragment size distribution of ATAC-seq fragments from single cells (grey) and the average of all single cells (red) display characteristic nucleosome-associated periodicity. Recovery of typical promoters shown in (a) within single- cells within (e) observed data and (f) extrapolated data using measures of predicted library complexity. 53 .000) in GM12878 cells.

(b) Tn5 bias and (c) GC bias. 54 . Error bars represent 1 standard deviation of the variability scores after bootstrapping cells from each replicate. (e) Tn5 bias and (f) GC bias.Supplemental Figure 2. Variability scores (incorporating bias normalization) within the same peaks shown in (a-c). scATAC-seq data analysis pipeline and validation of bias normalization. peaks are binned by deciles of (d) peak intensity. (g-i) Observed changes in variability comparing the merged set of replicates (K562) to each individual biological replicate. Standard deviation of log fold change in reads across cells within peaks binned by deciles of (a) peak intensity.

and shared sites. (f) Bi-clustered heat map of single-cell deviations observed from permuted data. (g) Projection of factor loadings onto principal component 1 versus 5 from principal component (PC) analysis of heatmap from Fig. (c) actin and (d) CTCF fluorescence observed by flow cytometry. (h) Distribution of accessibility among GATA1 only. Distributions in grey depict isotype controls. Labels on right identify co-clustering of related factors. (e) Bi- clustered heat map of single cell deviations as observed within K562 cells (N=239). 2d. Factor loadings do not vary along PC5. Characterization of high-variance trans-factors in K562 cells. (i) Mean accessibility 55 . (b) GATA2. GATA2 only. while peaks associated with regions with different replication timings (RepliSeq) have strong variation along this axis.Supplemental Figure 3. (a-d) Distribution of (a) GATA1. Venn- diagrams showing variability of (h) GATA1 and/or GATA2. (i) CJUN and/or GATA2 and CEBPB and/or GATA2 (co-) occurring ChIP-seq sites.

error bars represent 1 standard deviation generated by bootstrapping ChIP-seq peaks.from GATA1 only. GATA2 only. 56 . and shared sites in (k).

(e) Imatinib and (f) JUN inhibitor. using DAPI or PI. (c-f) Flow cytometry data depicting DNA content. in (c) control K562 cells or cells showing altered cell-cycle status after treatment with (d) cell-cycle inhibitor. 57 .Supplemental Figure 4. (a-b) Change in variability of untreated K562 cells versus cells treated with (a) Imatinib and (b) JUN inhibitor show increase of variability in factors associated with the cell cycle or s-phase and JUN factors respectively. Drug treatments modulate factor variability.

(a) The distribution of GATA1 deviation scores for single K562 cells. p-values were calculated using a binomial test.Supplemental Figure 5. Volcano plots of (e) non-NFKB peaks and (f) NF-kB peaks in GM12878 cells. (d) The distribution of NF-kB deviation scores for single GM12878 cells. Inset numbers show the number of points in upper left or upper right quadrants of the panel. Measurements of individual peaks within single-cells. showing 58 . (g) Accessibility at a genomic locus. p-values were calculated using a binomial test. Volcano plots of (b) non-GATA1 peaks and (c) GATA1 peaks in K562 cells.

59 .(top) aggregate NFKB low (blue) and NFKB high (red) profiles. (middle) single GM12878 cells ranked by NFKB deviations scores and (bottom) unranked single-cells.

Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Chang.. A. Chem. et al. 2. 19. An integrated encyclopedia of DNA elements in the human genome. & van Oudenaarden. 12. et al. A User's Guide to the Encyclopedia of DNA Elements (ENCODE). R. 16.. J. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Buenrostro. Y. Nat Meth 10. Architecture of the human regulatory network derived from ENCODE data. S. 10. Nature 489. low-bias construction of shotgun fragment libraries by high-density in vitro transposition. A. 4. Hemberg.. Imayoshi. S. Giresi. Rifkin. et al. 544–547 (2008). S. Raj. D. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. et al. W. S. A. Goryshin. Michor.. Science 338. et al. et al. Jaitin. S.. P. Zaba. Tay. Dynamics of chronic myeloid leukaemia. S. & Reznikoff. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin. 8. Consortium. & Greenleaf. 17.. Science 344. I. 13. et al. 1396–1401 (2014). 11. Rapid. Barahona. 713–724 (2013). Cohesins Functionally Associate with CTCF on Mammalian Chromosome Arms. Nature 435. D. L. I. J. E. G. Patel. E. H. R119 (2010). Smallwood. Nat Meth 11. 817–820 (2014). 1203–1208 (2013). Nature 463. Gerstein. A. V. S. & Xie. Hansen. Nature 489. 7367–7374 (1998). E. 75–82 (2012). Genome-wide detection of single- nucleotide and copy-number variations of a single human cell. P. e1001046 (2011). 1213–1218 (2013). T. D. W. PLoS Biol 9. Chang. 18. 6. Science 343. 913–918 (2010). R. 7. 1267–1270 (2005). Transcriptome-wide noise controls lineage choice in mammalian progenitor cells. Bendall. Nature 489. M. Proceedings of the National Academy of Sciences 107. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. et al. DNA-binding proteins and nucleosome position.. 15.. Cell 155. 422–433 (2008). Andersen. E. M. Chapman. 5. Single-cell NF-kB dynamics reveal digital activation and analogue 60 . A. Ingber. A. Nature 453. Science 342. 9. H. et al. P. F. low-input. Genome Biol 11. S. 14. Consortium. A. C. Variability in gene expression underlies incomplete penetrance. X. C. M. Zong. Tn5 in vitro transposition. 1622– 1626 (2012). H. 3. A.References 1. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Parelho. Cell 132. Dynamic trans-Acting Factor Colocalization in Human Cells. et al. 687–696 (2011). Adey. S. Science 332. P. & Huang.. The accessible chromatin landscape of the human genome. et al. E. 20. C. et al. 57–74 (2012). R. T. Oscillatory control of factors determining multipotency and fate in mouse neural progenitors. Y. J. 91–100 (2012). Lu. 273. Thurman. B.. 776–779 (2014). et al. Biol. 139–144 (2010). Xie. D.

Nature 466. F. Lin. B. Singer. J. N. 267–271 (2010). 1193–1200 (2013). 319–331 (2014). L. Levine. Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells. C. 26. Giorgetti. Z. 90–98 (2012). Science 326. S. 27. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Functional roles of pulsing in genetic circuits. Frequency-modulated nuclear localization bursts coordinate gene regulation. H. Kalhor. Nat Meth 11. Alber. 21. et al. Lieberman-Aiden. A. 637–640 (2014). Molecular Cell 55. Dalal. L. M. 24.. Y. Jayathilaka. et al. 23. 950–963 (2014). M. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.. et al. 30. Cell 157. 289–293 (2009). L. Validation of noise models for single- cell transcriptomics. & Chen. H.. information processing. & Elowitz. 61 . Nature 455.. D. K. Biotechnol.. 25. 485–490 (2008). Nat. Tjong. B.. Science 342. & van Oudenaarden. Grün. Kester. E. 22. & Elowitz. L. R. Predictive Polymer Modeling Reveals Coupled Fluctuations in Chromosome Conformation and Transcription. Cai.

These cells are long-lived and retain the ability to give rise to multiple distinct lineages of blood cells.CHAPTER FIVE – The epigenomic determinants of human hematopoiesis4 Introduction The entire human hematopoietic system is maintained by the activity of a small number of self-renewing hematopoietic stem cells (HSCs).7 hematopoiesis providing a rich resource for characterizing these cellular states. The dynamic expression of key transcription factors (TFs) can dramatically alter the regulatory landscape. However. Previous studies have profiled gene expression patterns in mouse3-5 and human6. and forms the molecular basis of specialized regulatory programs. Genome-wide sequencing methods are exquisite sensors for assessing the molecular determinants governing these distinct regulatory programs. which defines the expression of nearby genes. Despite its functional complexity. more than 200 billion blood cells are produced1. Lineage-specific and single cell chromatin accessibility charts human hematopoiesis and leukemia evolution. highlighting the need for tightly controlled regulatory programs that balance self-renewal of the apex stem cells with downstream production of differentiated effector cells. Submitted. 62 . Genome-wide chromatin-based assays measuring chromatin accessibility or chromatin bound proteins are sensitive 4 Portions of this chapter were taken from Corces R & Buenrostro et al. This enables interrogation of the precise transcriptional dynamics that govern the cell state transitions associated with differentiation and lineage commitment. measuring gene expression alone provides limited information regarding the causative regulators of cell identity. the hematopoietic system is the most extensively characterized adult stem cell hierarchy whereby many diverse cell types can be isolated through the use of multi- parameter fluorescence activated cell sorting (FACS)2. During the course of a single day.

In addition. a method capable of measuring chromatin accessibility in rare cellular populations10. we report the development of an improved ATAC-seq protocol. However. which demarcate active regulatory elements and hotspots for TF binding. these assays require millions of cells. which has successfully lead to the identification of regulatory elements within mouse hematopoiesis5. 63 . hindering efforts to comprehensively catalogue hematopoietic regulation and limiting their application to either cell lines8 or whole tissues9 which do not accurately represent individual primary cell types.methods for assaying cellular regulation. that allows for more rapid high-quality measurements with 10-fold fewer cells. We previously described the Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq). We apply this optimized protocol to cells isolated from 9 healthy human donors. optimized for human blood cells. chromatin accessibility measures nucleosome-free or nucleosome-depleted sites throughout the genome. we measure the transcriptomes of the same healthy donors to derive paired expression data. Recent developments have enabled genome-wide chromatin immunoprecipitation5 or chromatin accessibility10 profiling in rare cellular populations. Importantly. Here. These remarkable improvements in speed and efficiency have afforded the unique opportunity to provide such comprehensive and data- rich measurements in human hematopoietic cells and provide a platform for understanding the molecular underpinnings of human blood development and disease. studying 13 of the major cell types of the normal hematopoiesis. This atlas of normal human hematopoiesis provides a data rich resource for discovering the molecular determinants of human hematopoiesis and allows for the deconvolution of complex biological data.

These analyses excluded mature granulocytes due to high endogenous RNases and proteases as well as mature megakaryocytes. S1d). Overall. However. we sought to create a reference regulome and transcriptome map of the normal hematopoietic hierarchy (Fig. termed Fast-ATAC. and reduces the frequency of mitochondrial reads by ~5 fold (Fig. offers an approximately 10-fold improvement in sensitivity. 1a and Fig. S2). This protocol. was optimized for use on primary blood cells and relies on a 1-step membrane permeabilization and transposition using the lysis reagent digitonin. making it readily adaptable for large-scale studies of rare cellular populations. requires just 5.b). further optimizations were required to profile rare primary human blood cells from cryopreserved specimens. Using Fast-ATAC and RNA-seq.000 cells. We found that this simplified protocol provides extremely high quality data (Fig. which proved difficult to isolate in adequate cell numbers. The isolated cell populations included 7 64 . Although ATAC- seq is highly efficacious for a variety of cell sources. ii) fewer cells and iii) lower cost. S1a-c). Fast-ATAC provided hematopoietic epigenomes with i) increased speed. we profiled the chromatin accessibility landscape (“regulomes”) and transcriptomes from 13 distinct cellular populations from the human hematopoietic hierarchy via fluorescence activated cell sorting (FACS) (Fig. Cells were taken directly from donor bone marrow or peripheral blood without further in vitro manipulation or treating donors with agents such as granulocyte colony-stimulating factor (G-CSF). 1a.Identification of chromatin accessibility landscape in primary blood cells To better understand the regulatory networks controlling human hematopoietic differentiation and leukemogenesis. we note that digitonin is a gentle detergent and may not be ideal for cell lines and other cell types that are more resistant to lysis.

We found Fast-ATAC profiles to be highly reproducible between technical (R=0. Each individual cell type of the hematopoietic hierarchy displayed a set of uniquely expressed genes and uniquely open peaks mapping to genes known to be involved in cellular functions important for the given cell type (Fig. the sets of uniquely open peaks were enriched for motifs of transcription factors known to be involved in the biological processes of the cell type of interest (Fig. Importantly. We also observed a significant correlation (R=0. 1d) and biological (R=0. highlighting the value of analysis of highly purified stem and progenitor cell subpopulations. 1f). we performed ATAC-seq and RNA- seq on 3-4 adult donors for each cell population totaling 49 transcriptomes and 77 regulomes (Fig.11-13.73) between Fast-ATAC and DNase-seq (data from the Epigenomic Roadmap Consortium) of bone-marrow derived CD34+ HSPCs (Fig. S1f. 1g). S1h). S1e). Additionally. With this dataset we identified a total of 590. a CD34+ subpopulation. Unsupervised hierarchical clustering of our RNA-seq and ATAC-seq data shows robust classification of cell types 65 .93. Fig. and lymphoid lineages2.650 hematopoietic accessible peaks. 1c and Fig. erythroid. Fig. Chromatin accessibility at distal elements delineates the hematopoietic hierarchy Paired regulome and transcriptome data provide a unique opportunity to understand the regulatory networks of human hematopoiesis. 1e) replicates.unique stem and progenitor and 6 differentiated cell types spanning the myeloid. can have significantly different chromatin profiles than the bulk CD34+ HSPC pool (R=0. Fig.77.93.g). All together. we find that hematopoietic stem cells (HSCs).

2a. By RNA-seq. 2e. the common myeloid progenitor (CMP) clusters with the megakaryocyte erythroid progenitor (MEP) (Fig. 66 . 2d). This observation is clearly illustrated by the region surrounding the TET2 gene. by ATAC-seq CMP clusters more closely with the HSC and the multipotent progenitor cell (MPP) (Fig.b). which is more consistent with its role in the hematopoietic hierarchy. Notably.among technical and biological replicates (Fig. When regulatory elements were subdivided to gene promoters versus putative distal enhancers (>1000 bp away from the closest TSS). with the latter revealing the cell-type specific and combinatorial logic of regulatory elements that control the expression of nearby genes. suggesting that chromatin remodeling associated with these linked developmental lineage decisions occurs predominantly within distal regulatory elements. Despite the invariant expression of TET2 and ubiquitous accessibility of TET2 promoter.f). 2g). clearly distinguishing HSPCs. and in line with previous studies of murine cell subsets5. we find that promoter elements are largely invariant within CD34+ stem and progenitor cells. we find highly diverse accessibility profiles within nearby distal regulatory elements. epigenomes and transcriptomes also provide different conclusions about the relationship between various cell types. NK cells. 2c). and T cells (Fig. ATAC-seq appears to be more adept at classifying cell types as quantified by the cluster purity14. we find that distal enhancers provide significantly improved cell-type classification compared to promoters and transcription profiles (Fig. We reasoned that inaccurate lineage classification by RNA-seq may be resolved by ATAC-seq. whereas. suggesting that chromatin accessibility is more cell type-specific and better captures cell identity. a gene expressed within a 2- fold range in all cell types throughout the hematopoietic hierarchy. In this analysis. Intriguingly.

we employ the deconvolution algorithm CIBERSORT14 to quantify the contribution of each individual cell type to the ensemble profile. however. wherein we enumerate the frequency of cell types in complex cellular mixtures based on chromatin accessibility data. These data are very useful for understanding the biology of these cells. 3b). we hypothesized that Fast-ATAC can be used to deconvolve highly complex cellular populations into their constitutive subsets. we filtered for high-quality distal regulatory elements and removed promoter signal (see methods) and applied CIBERSORT to define an array of cell-type specific regulatory elements (Fig. In fact. To do this. For example. and in particular. 3a).Enhancer cytometry prospectively deconvolves complex cell populations Given the accuracy with which regulatory landscapes delineate cell types. others show high cell type specificity (Fig. the accessible site near micro-RNA 1915 shows a robust peak exclusively in CMP cells but shows almost no accessibility in the CD34+ DNase–seq data. For instance. While some regulatory elements are ubiquitous among all HSPCs. regulatory elements that are highly cell-type specific are averaged out and difficult to detect in this bulk CD34+ data (Fig. enhancer cytometry with CIBERSORT uses the presence or absence of accessibility at tens of thousands of elements to match pre-defined patterns of cell identity. 3a). the Epigenomic Roadmap Consortium has provided multiple datasets on heterogeneous tissues. The highly cell type-specific nature of our ATAC-seq data enabled the development of a strategy we term “enhancer cytometry”. Analogous to flow cytometry of cell surface markers. these tissues represent an ensemble average of multiple distinct cell types. mixtures of CD34+ HSPCs. CIBERSORT employs support vector 67 . To do this.

we found that enhancer cytometry can also be used to deconvolve CD34+ DNase-seq data (p < 0. and multicollinearity14. we sought to quantify the effect of specific trans-factors at each developmental transition.f). One exception is the MPP that showed reasonable but lower accuracy than other cell types. We found that enhancer cytometry yielded highly accurate enumeration of the constituent cell types when compared to flow cytometry (R2 = 0. To do this we adapted a computational framework we previously developed to measure accessibility across regulatory elements sharing a common feature. Notably. suggesting that ATAC-seq with enhancer cytometry may be a general strategy for identifying and counting cells within complex cellular mixtures. they are most frequently misclassified as HSCs. In brief. a method shown to be robust to noise.95. We validated this approach using a leave-one- out cross validation and found that enhancer cytometry proved to be highly robust for classification of all normal hematopoietic cell types (Fig. their closest normal cell type.d). we prospectively tested enhancer cytometry on bulk CD34+ HSPCs and performed flow cytometry in parallel. i. In addition. Fig.001). However. TF motif15. Next. Regulatory networks of normal hematopoiesis To better understand the mechanisms governing these diverse regulatory landscapes. 3c.91). unknown mixture content. which represents a differential gain or loss of accessibility across peaks sharing a 68 . this cell type deconvolution was not as accurate without restriction to distal regulatory elements (R2 = 0. we classified hematopoietic regulatory elements by their underlying transcription factor motifs and calculated a bias corrected “deviation” score.regression (SVR) for deconvolution. we note that when MPP cells are misclassified.e. 3e.

“RUNX”. 4d). 4b.c). For further validation. the underlying motif sequence does not identify the precise causative regulator of accessibility at those motif instances. 4a and Fig. S3a). we compared GATA TF footprints20 between MEPs (GATA high) and common lymphoid progenitors (CLPs) (GATA low) and found that CLPs had no detectable binding at GATA sites when compared to MEPs (Fig. We note that.292 respectively). and found drastically fewer GATA footprints in CLPs compared to MEPs (N=173 and N=27.given motif for each transition in the hematopoietic hierarchy. We find TF motifs such as “GATA”. we condensed similar motifs to create a non-overlapping list (see methods). for subsequent visualization. and “SPI1” to be dominant regulators of chromatin accessibility (Fig. To validate this approach for determining global TF motif regulators of cell identity. and signal-to-noise bias. Notably. However. This is exemplified by the “GATA” and “PAX” motifs which are strongly enriched in erythroid and lymphoid lineages respectively (Fig. We find that activation of these TFs are highly cell-type specific. this measure of TF accessibility is highly robust to the number of sequenced reads. This is a common issue in epigenomic studies and particularly important for cases in 69 . confirming our analytical strategy for measuring TF binding. these factors have also been previously shown to be governing master regulators of hematopoiesis17-19. a TF footprinting algorithm. DNA sequence bias. often displaying step-wise gains across developmental lineages. thus. we employed PIQ21. We reasoned that the accessibility of a given motif should correlate with the expression of the associated transcription factor. unlike current methods for TF footprinting16. We therefore chose this approach to measure the effect that a given TF motif enacts on the accessible genome at each stage of hematopoiesis.

Interestingly. Fig. P = 10-230. the GATA motif is shared among 6 TFs (GATA1-6). we calculated correlation coefficients for the expression of all known TFs23 to deviation scores across hematopoiesis. these results highlight the utility of a systems-level analysis of epigenome and transcriptome data. but have not been able to pinpoint the cells responsible for 70 . In an effort to assign motifs to transcription factors. Many genome-wide association studies (GWAS) have linked diseases to polymorphisms. 4e). For example. such as the HOX motif.75. the hematopoietic regulome can trace the ontogeny of activity in the noncoding genome that impacts human disease. Fig.h). S3b-e). we integrated our ATAC-seq and RNA- seq data to predict causative regulators of motif accessibility. we employed CIS-BP22. a comprehensive database of in vitro and in silico derived motifs. P = 10-18 and R = 0. suggesting that regulation of HOX accessibility is more complex.88. S3g. Using this approach we find a striking correlation of motif usage with the expression of known master regulators of hematopoiesis (Fig. the expression of GATA1 and PAX5 are highly correlated with accessibility at GATA and PAX motifs. we find many putative regulators with weak correlations (N = 11. 4e-g and Fig. to create an association table linking hematopoietic TF motifs to 806 genes by motif similarity (Fig. Accessibility profiles of purified cell populations identify the ontogeny of human diseases In addition to enhancing our understanding of developmental gene regulation.which many factors share identical or near-identical TF motifs. Next. for some motifs. To do this. For example. S3f). while the PAX motif is shared among 9 TFs. Together. respectively (R = 0.

4i). then calculated “deviation” scores for each GWAS across the hematopoietic hierarchy as described above. Intriguingly. are most strongly enriched in erythroblasts (Fig. it is now possible to more accurately predict the specific cell types impacted by genetic variants linked to diverse human diseases24-26. To do this we first filtered for GWAS that were significantly enriched in hematopoietic cells (Fig. many regions associated with MCV polymorphisms first become accessible at the CMP stage and increase in accessibility in MEP cells. By measuring the activity of regulatory elements that overlap regions with predicted sites of functional variation from GWAS. lead to closure of sites that would otherwise be accessible. thus enriching our understanding of developmental origins of human disease (Fig. see methods). S4c). S4a. This association is consistent with the known role of autoantibodies and pathogenic B cells in the pathogenesis of RA.those phenotypes.b. We find a more complex pattern in the disease alopecia areata. polymorphisms associated with rheumatoid arthritis (RA) show a strong enrichment in B cells (Fig. a measure of the average volume of an erythrocyte cell.28. As a second example. 4h-k and Fig. 4h). an autoimmune disease in which hair is lost from some or all areas of the body. We found that each of these associations can be traced through the hematopoietic lineage to predict the developmental point at which each variant may first exert its effects. The autoimmunity 71 . polymorphisms linked to mean corpuscular volume (MCV). These non-coding polymorphisms are predicted to affect transcription factor binding and would. From this. MCV-associated polymorphisms found in the accessible regions of CMPs and MEPs suggest that these polymorphisms exert their effects prior to full erythroid lineage commitment. therefore. as well as the documented success of B cell depletion therapy in the treatment of RA27. As a positive control example.

suggesting a new direction of investigation. This technique 72 . Importantly.30. made possible by Fast-ATAC. a result consistent with the enrichment of polymorphisms for alopecia areata in both CD4+ and CD8+ T cells and monocytes (Fig. demonstrating that these distal regulatory elements more precisely define cell identity and developmental trajectory. Enhancer cytometry harnesses this specificity and proves to be a useful strategy to navigate regulome data. 4j). specifically distal enhancers. the disease associations that are highlighted by our data are not limited to diseases canonically associated with hematopoietic cells. groups individual cell types with extremely high cluster purity. 4k).driving this disease has recently been associated with both innate and adaptive immune responses29. Unsupervised clustering of accessible chromatin regions. polymorphisms linked to Alzheimer’s disease show a strong enrichment in B cells and monocytes. Discussion Here we report a rich resource charting the epigenomic and transcriptomic landscape of 13 unique blood cell types. enhancer cytometry enumerates the frequencies of pure cell types in complex cell mixtures. By matching patterns of distal element accessibility to known profiles of pure cell types.31 (Fig. This resource relies on the accurate and precise determination of the epigenomic landscapes in primary human blood cells. The chromatin accessibility profiles of blood cells are highly cell type specific and allow for a much more robust classification system than more frequently used transcriptional profiles. B cells also harbor many active elements associated with alopecia areata but have not been studied in this disease. two cell types that have predicted roles in the pathogenesis of the disease24.

but a significant association can be seen as early as the common myeloid progenitor stage (CMP). and thus does not permit prospective cell purification at present. we identify strong associations of disease-linked polymorphisms with the open chromatin landscapes of specific hematopoietic cell types. but it is typically limited to a handful of cell surface markers. each requiring a different antibody that may have off-target binding and gating idiosyncrasies. This atlas of human hematopoiesis enriches the interpretation of GWAS results in several ways. First. In contrast. empowering an extremely robust classification system.enabled the accurate deconvolution of data derived from CD34+ bone marrow cells into the constituent highly-similar HSPC cell types. These results are consistent with the concept that many enhancers are developmentally primed prior to their activation following cell differentiation5. a measurement of the size of red blood cells. the strongest association occurs in erythroblast cells. An important limitation of enhancer cytometry is that the method destroys the cell as the measurements are made. In principle. Given our in-depth characterization of known 73 . notably the developmental contexts in which the disease-relevant elements first become active. this general approach may be used to resolve cellular heterogeneity in any tissue or organism. enhancer cytometry employs a universal probe system (a transposase) to simultaneously interrogate hundreds of thousands of regulatory elements. In the case of mean corpuscular volume. We note that while we have used well- characterized cell types with known cell surface immunophenotypes to generate pure cell type reference maps. Flow cytometry has become a standard technique. single cell ATAC-seq with enhancer cytometry may be used as an unbiased measure of cell type identity within a population providing archetypal cell profiles within complex cellular populations.

to be a rich resource for continued efforts to build computational tools that model both cis32 and trans33 determinants of chromatin accessibility and gene expression. We anticipate this combined data set. Lastly.in order to achieve long lasting phenotypic correction in the tissue. The same logic applies to genetic variants in the noncoding genome and suggests the need to map the developmental ontogeny of regulatory elements.g. which represents a dynamic developmental process. Comprehensive and cell type-specific regulome maps will help to nominate hypotheses of relevant cell types in diseases. we are able to identify the earliest progenitor cells that may be relevant in the pathogenesis of specific diseases and elucidate putative targets for corrective action.human HSPC subtypes. 74 . the HSC in blood or basal cells in epithelia .e. this resource provides a platform to identify specific trans-acting regulators that drive blood cell identity and function. Integration of ATAC-seq and RNA- seq data improves motif-transcription factor pairing and enables the accurate determination of causative regulators of chromatin accessibility throughout hematopoietic differentiation. It is now well accepted that effective genetic correction of coding mutations needs to take place in the stem cell compartment .

See Supplementary Table 1 for the exact number 75 .Figures and Figure Legends Figure 1. (c) Normalized ATAC-seq profiles at developmentally important genes.Chapter 5 . (b) Diagram of analyses performed using paired ATAC-seq and RNA-seq data in both primary human blood cells and primary patient AML cells. Granulocytes and megakaryocytes were excluded. Profiles represent the union of all technical and biological replicates for each cell type. (a) Schematic of the human hematopoietic hierarchy shows the 13 primary cell types analyzed in this work. Interrogation of chromatin landscapes in primary blood cells.

76 . and (g) ATAC-seq HSCs with bulk CD34+ HSPCs. (f) ATAC-seq and DNase-seq data derived from CD34+ HSPCs. (e) different human donors. (d-g) Scatter plot showing correlation of (d) technical replicates.of technical and biological replicates for each cell type.

77 . Cluster purity quantifies the degree that cells of the same lineage (color coded in the key) are clustered together. Data represents the union of all technical and biological replicates for each cell type.d) Phylogenetic dendrograms of (c) RNA-seq and (d) ATAC-seq data showing inter-cell type correlations derived from aggregate averages of all biological and technical replicates. (g) ATAC-seq peaks in the TET2 locus show highly variable distal regulatory landscapes (left) and relatively constitutive expression of TET2 (right). Values shown are Pearson correlation coefficients. Distal regulatory elements enable accurate classification of the hematopoietic hierarchy. Length of tree branches represents Euclidean distance. (c. (e.f) Hierarchical clustering of ATAC-seq profiles (N=77) mapping to (e) promoters and (f) distal regulatory elements.Figure 2. Data represents the union of all technical and biological replicates for each cell type.b) Hierarchical clustering of (a) RNA-seq (N=49) and (b) ATAC-seq (N=77) data from all biological replicates of 13 normal hematopoietic cell types. (a.

(e) Enhancer cytometry of ATAC-seq data derived from FACS sorted bulk CD34+ HSPCs identifies fractional contribution from all expected cell types. Error bars in (c) represent the standard deviation of 100 random permutations. including methods to define a signature matrix of highly cell-type specific enhancers (right panel. (a) Normalized ATAC-seq profiles of HSPC subsets and ensemble CD34+ HSPC DNase-seq profiles illustrating heterogeneity amongst CD34+ HSPC subpopulations. N=735). (b) Schematic of enhancer cytometry. Test data and training data are non-overlapping.Figure 3. Predicted cell fractions are shown on the left and nearest annotated genes are shown on the bottom. (f) Correlation of predicted fractional contribution of each HSPC cell type by enhancer cytometry versus flow cytometric “ground truth” data of input CD34+ cells. Enhancer cytometry allows for deconvolution of the hematopoietic hierarchy. 78 .d) Benchmarking of enhancer cytometry using randomly permuted synthetic mixtures to test robustness to (c) sequential subtraction and (d) randomized mixture content. (c.

Integrative analysis of the hematopoietic regulome refines transcriptional circuitry driving cell specification and enriches the understanding of human disease (a) Transcription factor dynamics showing major TFs driving hematopoietic regulomes. compared to that in HSCs.Figure 4. Values represent the relative deviation of the motif accessibility. (b. The size of the circle represents the effect of that motif in driving accessibility in human blood cells. a measure of motif usage. (e) Correlation (Pearson) of motif accessibility and significance of gene expression for GATA (top) and PAX (bottom). The relative distance between circles represents the co-occurrence of motifs throughout hematopoietic differentiation (see methods). Red dots represent DNA-binding factors annotated to bind the given 79 . (d) Footprint analysis of the GATA motif in MEP and CLP cells.c) Usage of the (b) GATA and (c) PAX motif throughout hematopoietic differentiation.

(f. (i) rhuematoid arthritis. (j) alopecia areata.motif. 80 . and (k) Alzheimer’s disease. gray dots represent all other DNA-binding factors.g) Expression of (f) GATA1 and (g) PAX5 phenocopies the usage of the GATA motif throughout hematopoietic differentiation (h-k) Relative deviation scores of chromatin accessibility within hematopoietic regulatory elements with GWAS SNPs for (h) mean corpuscular volume.

(b. Data processing pipelines. (a) ATAC-seq insert size distribution for three biological replicates of HSCs. (e) Accessible chromatin landscapes surrounding a constitutively accessible region of the genome. Profiles represent the union of all technical and biological replicates for each cell type.c) Enrichment of signal at annotated transcription start sites (TSS) from Fast-ATAC data compared to (b) DNase- seq and (c) previously published ATAC-seq data using the original ATAC-seq protocol10. (d) Fraction of total mitochondrial reads derived from the original ATAC-seq protocol and the fast-ATAC protocol.Supplementary Figure 1.g) GO Term analyses from unique 81 . (f.

(f) gene expression and (g) accessible peaks from normal hematopoietic cells. (h) Enrichment of developmentally relevant motifs in accessible peaks. 82 .

(a) Representative examples of sorting strategies for the seven CD34+ HSPC populations isolated in this study. 83 .Supplementary Figure 2. Cell sorting strategies.

(a) Summary of motif deviations across hematopoiesis normalized by maximum and minimum signal. Scale is represented above each column. (c.Supplementary Figure 3. 84 . Trans regulators of hematopoiesis. (f) Correlation of motif deviations to gene expression changes in hematopoiesis for two developmentally important TFs. Motifs are listed on the left and genes are listed on the right. GATA1 and PAX5. (e) Histogram of all correlation values shown in (b) with lists of putative hematopoietic regulators highlighted (N=255).d) Example of clustered motifs for (c) GATA4 and (d) MEIS1. (g. Values represent correlation coefficients (Pearson).h) Summary list of putative TF (g) positive and (h) negative regulators of hematopoiesis. (b) Clustering of hematopoiesis TF motifs (N=46) with CIS-BP motifs (N=806) using Pearson correlation (see methods).

GWAS enrichments across hematopoiesis. (c) Summary of GWAS deviations across hematopoiesis normalized by maximum and minimum signal. 85 . (a) Representative example of GWAS enrichment across tissues (see methods). Colors as shown in (b).Supplementary Figure 4. (b) Hierarchical clustering of all GWAS (N=235) across diverse tissues.

16. 635–45 (2007). & Weissman. & Greenleaf. et al. A. G. 14. Chromatin state dynamics during blood formation. 15. J. 1251033–1251033 (2014). Mayhall. & Weissman. He. Nature Immunology 13. Identification of a hierarchy of multipotent hematopoietic progenitors in human cord blood. 57–74 (2012). I. J. Y.. et al. H. A. 3. E. Integrative analysis of 111 reference human epigenomes. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 2. 338–342 (2010). Manz. 18. 1–10 (2014). R. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. & Colvin. 1–10 (2015). 4. Giresi. E. J. & Orkin. et al. 640–653 (2010). 2005). Buenrostro. 12. In Williams Hematology. Nature 518. 11. 6. Ji. 2331–42 (2005). Progenitor Cells. Single-cell chromatin accessibility reveals principles of regulatory variation. H. E. W. Comprehensive methylome map of lineage commitment from haematopoietic progenitors. et al. Science 345. M. I. Consortium. A. I. 1213–1218 (2013). Nat Meth 12. DNA-binding proteins and nucleosome position. Robust enumeration of cell subsets from tissue expression profiles... Nat Meth 10. Zaba. et al. Hematopoietic stem cell: self-renewal versus differentiation. 8. Akashi. & Zon. G. 963– 971 (2012). Novershtern. G. D. C. Hematopoietic Stem Cells. L. Burns. Proceedings of the National Academy of Sciences of the United States of America 99. Shepard. An integrated encyclopedia of DNA elements in the human genome. Transcriptional diversity during lineage commitment of human blood progenitors. 86 . Nature 523. J. and Cytokines. Park. Nature 467. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. L. 486–490 (2015). 153–174 (McGraw-Hill.. L. Chang. L. P.. H. S. et al. Genes and Development 23. et al. J. Bernstein. Seita. Weiss. D.. Hematopoietic stem cell fate is established by the Notch-Runx pathway. Experimental Hematology 23. Cell 144. et al. Ly6d marks the earliest stage of B-cell specification and identifies the branchpoint between B-cell and T-cell development.. 5. Science 55. & Weissman. Lara-Astiaso. D. Miyamoto. 296–309 (2011). D. H. Prospective isolation of human clonogenic common myeloid progenitors. Inlay.References 1. N. J. K. 73–78 (2013). 2376–2381 (2009). R. 17. Nature 489. Lymphoid priming in human bone marrow begins before expression of CD10 with upregulation of L-selectin. L. C. Quesenberry. M. Buenrostro. Newman. et al. 11872–11877 (2002). Traver. 10. L... E. Genes & development 19. H. A. P. Chen. Kohn. L. Nat Meth 11. I. et al. J. Cell Stem Cell 1. M. et al. 13. 7. 317–330 (2015). B. 2. L. C. T. Majeti. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin. 99–107 (1995). 9. Y. M. GATA transcription factors: key regulators of hematopoiesis.

25. Whitaker.. Gjoneska. Cell 158. 31. & Gregersen. 20. Rheumatoid arthritis: a view of the current genetic landscape. 22. Genome-wide association study in alopecia areata implicates both innate and adaptive immunity. A.19. 101–111 (2009). Vaquerizas. Neph. 337–343 (2015). et al. Genes and Immunity 10. Butovsky.1 induces myeloid lineage commitment in multipotent hematopoietic progenitors. et al. et al. L. J. European Journal of Neuroscience 26. J. K. 2029–33 (2002). Setty. Khoury. J. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. 26. Koronyo-Hamaoui.. T. et al. 29. N. M. Early enhancer establishment and regulatory locus complexity shape transcriptional programs in hematopoietic differentiation. Predicting the human epigenome from DNA motifs. K. et al. H. Nat Med 13. 1431–1443 (2014). Nature 518. I. P. expression and evolution. W. Selective ablation of bone marrow-derived dendritic cells increases amyloid plaques in a mouse Alzheimer's disease model. S. Genes & development 12. & Wang. 32. M. Petukhova. & Graf. S. 87 . et al. R. De Vita. 432–438 (2007). 2403–2412 (1998). J. Maurano. M. 24. W. Kummerfeld. et al. 83–90 (2012). 337. Coenen. 1190–1195 (2012). Nature Genetics (2015). Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape.. An expansive human regulatory lexicon encoded in transcription factor footprints.. K. 28. Nat.. A. Nerlov.. Kunis. O. M.-H. C. M. 113–117 (2010). Arthritis & Rheumatology 46. & Leslie. Nat Methods 12. Teichmann. K. Nature 489. M. et al. C. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. 30. 252–263 (2009). T. 21. S. S. & Schwartz. Nature 518. Efficacy of selective B cell blockade in the treatment of rheumatoid arthritis: evidence for a pathogenetic role of B cells. E. & Luscombe. J. 265–272 (2015). Sherwood. et al. 33. 27. T. M. M. PU. 32. Chen. Z. Biotechnol. A census of human transcription factors: function. 365–369 (2015). El. G. 413–416 (2007). Nature 466. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Ccr2 deficiency impairs microglial accumulation and accelerates progression of Alzheimer-like disease. Nature Reviews Genetics 10. 23. 171–8 (2014). González. S. Farh. Weirauch. Genetic and epigenetic fine mapping of causal autoimmune disease variants.

CHAPTER SIX – The regulatory landscape of acute myeloid leukemia5 Introduction Dysregulation of the intricate regulatory networks of the hematopoietic system has been shown to play a critical role in the development of hematologic malignancies1. 88 . ten-eleven translocated 2 (TET2)10. Submitted.6 – such as DNA methyltransferase 3A (DNMT3A)9. Despite a low overall mutation rate2 and prolonged periods between cell divisions3. therefore. Recent work4-8 has demonstrated that HSCs constitute a cellular reservoir for mutation acquisition that plays a causative role in multiple hematopoietic malignancies. the role of these epigenetic mutations during the evolutionary process of leukemogenesis and their effects on the regulatory networks that govern normal hematopoiesis remains poorly understood. in the case of acute myeloid leukemia (AML). longstanding debates have centered on how cell fate choices are corrupted in human leukemias13 – whether leukemic cells truly harbor multiple lineage-specific regulatory programs at once (termed “lineage infidelity”) or merely maintain bipotential progenitor states that normally exist in development (termed “lineage promiscuity”)--fundamental issues that may be resolved by modern epigenomic technologies. In particular. 5 Portions of this chapter were taken from Corces R & Buenrostro et al. the long lifespan of HSCs makes them susceptible to the accumulation of mutations over time.12. been termed pre-leukemic HSCs. Notably. Lineage-specific and single cell chromatin accessibility charts human hematopoiesis and leukemia evolution. many of the genes found to be recurrently mutated during the pre-leukemic phase of AML have been shown to regulate the epigenome5. However. and isocitrate dehydrogenase 1 and 2 (IDH1/2)11. HSCs isolated from leukemia patients have been shown to harbor some but not all of the genetic alterations found in the frankly leukemic cells and have. Importantly.

and leukemic blast cells. we chart the genetic and epigenetic progression from normal to malignant in AML. Our results provide key insights into the evolutionary process of leukemogenesis and identify important transcriptional programs that could be targeted to disrupt this process during its earliest stages. Unmutated HSCs serve as the reservoir for mutation acquisition during the early phases of leukemogenesis (Fig. 1a). we first identified 3 distinct stages of AML evolution. in the context of normal hematopoiesis. one of the most aggressive hematopoietic malignancies14. diverse genetic mutations can lead to similar epigenetic alterations. Leukemogenesis and cancer evolution in AML We sought to characterize the evolution of AML. pre-leukemic HSCs (pHSCs). To this end. In summary. this work serves as a rich resource for the study of regulatory dynamics in normal and malignant hematopoiesis. Through direct comparison of hematopoietic cells isolated from normal human bone marrow and patient-matched pre-leukemic HSCs. leukemia stem cells (LSCs). providing the first characterization of the full evolutionary process of leukemogenesis. and leukemic blast cells (blasts). Acquisition of mutations. With hematopoiesis as a reference of normal development. Each of these leukemic cell populations can be enriched based on immunophenotype via FACS purification. suggesting a common path for the leukemogenic process. typically in genes that regulate the 89 . We demonstrate that the vast majority of epigenetic and transcriptomic change that occurs during leukemogenesis is derived from normal hematopoietic differentiation. we measure the effects on the leukemogenic process of both early mutations in epigenetic modifiers and late mutations in proliferative oncogenes. leukemia stem cells. Moreover.

We profiled the mutation frequency of known leukemogenic driver mutations in HSCs.epigenome. This allowed us to quantify the heterogeneity exhibited among the 90 . creates pHSCs that expand to create a pre-leukemic clone. we define the “pre-leukemic burden” as the percentage of HSCs isolated from a leukemia patient that harbor at least the first mutation. and blast cells from 39 AML patients. AML represents a cooption of normal myelopoiesis The AML leukemogenic process provides a novel system to study the genesis and evolution of cancer at the level of the epigenome through the lens of normal hematopoiesis. typically in genes that lead to increased proliferation. LSCs. We performed Fast-ATAC and compared the chromatin accessibility landscapes of patient-matched pHSCs. The pre-leukemic mutations found in this large cohort recapitulate previous findings5. the population of HSCs isolated from leukemia patients by FACS represents a heterogeneous mixture of healthy unmutated HSCs and pre-leukemic HSCs. Importantly.6 showing that early mutations tend to occur in genes that modify the epigenome while later mutations occur in genes involved in activated signal transduction. 1c). Pre-leukemic burden is highly variable in this cohort with some patients exhibiting a complete repopulation of the HSC compartment with pre- leukemic cells and others exhibiting undetectable levels of pre-leukemic mutations (Fig. and blasts. To quantify this heterogeneity. The optimized Fast-ATAC protocol produced robust accessibility profiles from cryopreserved primary patient AML cells (Fig. 1a). 1b). generates LSCs that are capable of self-renewal and the production of AML blasts (Fig. T cells. Subsequent acquisition of progressor mutations.

pHSCs are most closely related to HSCs and MPPs (Fig.16. These results indicate that the majority of inter-patient variation in AML is derived from the developmental position along the normal myeloid differentiation trajectory where each leukemia has arrested. 1g).15. we find that the various stages of AML spread across the trajectory from HSC to monocyte. As expected. We find that the level of epigenetic variance between all samples of the same cell type increases through progressive stages of leukemia evolution (Fig. When overlaid across the principal components derived from normal hematopoiesis. 1g). Indeed. This may be the consequence of the epigenetic mutations present in the leukemic cell types or a manifestation of the point along the normal hematopoietic hierarchy at which the particular AML cell types exist. key developmentally-associated genes such as GATA2 and CEBPB show variation amongst the AML cell types consistent with different developmental stages (Fig. Assigning a score to the myeloid differentiation component of our data. 91 . all AML cell types exhibit more inter-donor variance than normal hematopoietic cells. LSCs show strong similarity to GMP and LMPP cells and leukemic blast cells show a wider distribution with less differentiated blasts clustering with GMP cells and more differentiated blasts clustering with monocyte cells18. see methods). As shown previously17. 1d. 1g). 1e).19 (Fig. indicating that the process of leukemogenesis largely mirrors the process of normal myelopoiesis (Fig. 1f). we find that the first four principal components from normal hematopoietic differentiation account for 60% of the variation observed in our leukemia samples (Fig. Consistent with their functional ability to produce both lymphoid and myeloid cells in xenotransplantation assays6.different stages in leukemia evolution.

2a). we recently developed single-cell ATAC-seq (scATAC-seq)20 and reasoned that scATAC-seq with enhancer cytometry would be able to resolve these two pressing hypotheses (Fig. Although CIBERSORT could accurately deconvolve bulk populations. we found that individual regulatory elements within single cells often contained 0. 1 or 2 fragments. at each stage of leukemogenesis. we performed scATAC-seq on purified LSCs and blast cells from patient SU070. however. Together.AML cell types exhibit lineage infidelity with regulatory contributions from multiple normal blood cell types These intermediate positions across myelopoiesis suggest that each patient- specific AML might harbor a unique collection of multiple distinct normal regulatory programs. or ii) show developmental heterogeneity within individual clonally derived cells. Traditional ensemble genome- wide approaches for measuring regulatory elements average over cellular states and cannot distinguish between these two hypotheses. we reasoned that 92 . Rather than relying on individual regulatory elements. we quantified the contribution of each normal cell type for each leukemic sample assayed (Fig. To discriminate between these two possibilities. harbors multiple distinct regulatory networks contributing to the epigenetic diversity of leukemic cell types. Using enhancer cytometry. we find that the majority of the patient donors have AML blasts that are clonally derived and harbor all the leukemic mutations at comparable allele frequencies. Importantly. these findings raise the intriguing possibility that AML cell types may either i) exist in stable intermediate cell states that are not normally maintained during normal hematopoiesis. We found that each patient. and was simply too sparse for existing deconvolution methods such as CIBERSORT. 2b). consistent with our previous work20.

2b). S1d). these results corroborate a lineage infidelity model wherein primary human AML cells and AML-derived cell lines can simultaneously access two normally independent regulatory programs within the same cell.c. 2f). Together. This observation is corroborated by enhancer cytometry of a widely used clonal AML cell line HL60. To better visualize and quantify heterogeneity within these cell subsets we flattened these components onto a one-dimensional myelopoietic developmental progression (Fig. Indeed. could be used to assign chromatin accessibility at all enhancers to developmental lineages and enable enhancer cytometry in single-cells (Fig. 2c. we find that primary patient derived LSCs and blast cells are remarkably homogenous and indeed exist at intermediate cell states. This observation is consistent with post-sort analysis of MEPs suggesting a low level of contribution of CMP or CMP-to-MEP transitional cell-states (Fig. see methods). S1c). which also shows mixed normal cell contributions using ensemble (Fig. Intriguingly. S1b. we found that with this approach. S1a) and single-cell (Fig. 2e). we also find that biological replicates of scATAC-seq from the erythroleukemia cell line (K562) show highly reproducible measures of erythroid differentiation (Fig.principle component analysis (PCA) of the regulome. Importantly. 2f and Fig. 2e) enhancer cytometry. we performed scATAC-seq on FACS- purified MEP cells. we find single MEPs show a predominant peak centered at the MEP position with a prominent tail towards CMP along erythropoietic differentiation (Fig. learned from normal bulk hematopoiesis.d and Fig. single cell accessibility profiles could be projected onto hematopoietic principal components with high accuracy (Fig. Using these projections. 93 . To further test our ability to project single-cells onto hematopoietic components.

2a) shows that this may not be sufficient.Generation of synthetic normal analogs for assessment of AML-specific biology The ability to accurately quantify the contribution of each normal cell regulome to the epigenetic profile of a leukemic cell type enables a more robust identification of AML-specific regulatory elements. Our data (Fig. we identified 6 regulatory modules that are utilized by AML cells (Fig. By examining co-association of AML-specific peaks. We can track the usage of 94 .86. comparison of AML cell types to their synthetic normal analogs yields an even higher correlation (R = 0. S1e) and provided fewer cancer-specific peaks than comparison to the closest normal analog (Fig. 2i and Fig. S1f). Notably. In particular. Fig. S2a). more importantly.91. we found that comparison of AML epigenomes to synthetic normal analogs consistently resulted in higher Pearson correlation values (Fig. analyses of leukemic cell types in the past have relied on comparing the malignant cells to a carefully chosen normal cell type. 2g).003). 3a and Fig. Fig. Due to these mixed lineages. and that multiple distinct normal regulatory patterns are contributing to the biology of AML cells. we suspect that past epigenomic and transcriptomic cancer studies may be highly biased towards the rediscovery of normal and developmentally dynamic genes rather than bona fide cancer-specific genes. While comparison of AML cell types to their closest normal cell analogs yields a high correlation (R = 0.954 to N = 8. We reasoned that effective removal of this “normal” contribution is possible through the generation of “synthetic normal” analogs which represent admixtures of various normal cells defined by enhancer cytometry (see methods). 2h) and. leads to a reduction in the number of predicted AML-specific peaks (N = 10.

indicating the activation of AP-1-dependent stress response pathways in these cells. This observation is consistent with previous publications that identify JNK as a therapeutic target in AML21 and indicates that similar strategies may prove efficacious in targeting pre- leukemic HSC. S2b) and is maintained through the stages of leukemogenesis. it remains unclear whether pre-leukemic HSC represent a unique functional state or merely serve as long-lived reservoirs for mutation accumulation. Moreover.these modules through leukemogenesis and identify patterns related to specific AML cell types (Fig. 3b). For example. Using ATAC-seq and enhancer cytometry we show that pHSCs share many regulatory programs with HSCs and MPPs (Fig. This increase in accessibility of JUN/FOS motifs is echoed by an increase in expression of these factors by RNA-seq (Fig. each module shows enrichment for peaks associated with different key transcription factors (Fig. 3c). representing the earliest known event of AML evolution (Fig. JNK inhibition showed a moderate but consistent selective targeting of AML blasts (Fig. This repressed regulatory module is 95 . Nevertheless. Additionally. modules 1 and 2 show strong enrichment for JUN and FOS activity. 6a). 3b). Indeed. S2c-e). identifying inhibition of these pathways as a potential therapeutic strategy in AML. functional epigenetic consequences of pre-leukemic mutations in primary AML samples have not been characterized. comparison to synthetic normal analogs identifies a distinct regulatory module (module 6) that shows decreased accessibility in pHSCs. Mechanism and clinical consequences of pre-leukemic HSC clonal advantage Despite previous work on the acquisitions of mutations during the pre-leukemic phase of AML evolution5.

3f) and erythroid (Fig. S2f). S2g) in umbilical cord blood CD34+ HSPCs led to a retention of stemness in the context of both myeloid (Fig. a concomitant decrease in differentiated 96 . We have previously assessed the effect of depletion of GATA1 and GATA2 on HSPC differentiation and self- renewal(Mazumdar et al. This observation excludes these GATA factors from mediating the defects in differentiation associated with repression of module 6. finding that knockdown of GATA2 led to a decrease in self-renewal of HSPCs while knockdown of GATA1 had no effect. we found depletion of HOXA9 by short hairpin RNA (shRNA) knockdown (Fig.e. in particular the role of HOXA9 in HSCs. Given the decreased accessibility of module 6. 3d. HOX and GATA) and provides direct evidence to support a model where pHSCs maintain a unique epigenetic and functional state. Given the well-studied role of HOX factors in stem cells22. a hallmark of AML. Indeed. this suggests that accessibility at certain stem cell-related motifs may confer the ability to properly differentiate rather than properly self-renew. pHSCs showed a strong resistance towards differentiation. we hypothesized that HOXA9 might mediate the observed stemness phenotype.. In order to better understand the consequences of a loss in accessibility at motifs associated with HSPCs. previous studies have shown an increase in the number of HSCs in mice deficient for HOXA923. we probed pHSCs for phenotypic changes related to self-renewal and differentiation. Moreover. instead favoring maintenance of the stem cell state (Fig. we reasoned that loss of accessibility at HOXA9 target sites may confer an increase in stemness and prevent proper differentiation.e). 3g) differentiation. 2015 in press). From this. When pushed to differentiate down the myeloid and erythroid lineages (Fig. In fact.enriched for motifs associated with HSPCs (i.

30 for overall survival and 2. S7k) and predispose patients to more aggressive or refractory leukemia.16. S2j). 3h. In sum. In addition.i). we note that this retention of stemness is also observed in the absence of a differentiation stimulus (Fig.6.05). Together. these results suggest that decreased HOX accessibility in pHSCs may promote retention of stemness and prevent differentiation of these cells. Retention of stemness provides pHSCs with an evolutionary advantage in that resisting differentiation maintains cells in an HSC-like state. which increases the likelihood of acquiring additional leukemogenic mutations.i). S7k). Characterization of our patient cohort shows that pre-leukemic burden inversely correlates with overall survival and relapse-free survival (Fig. High pre- leukemic burden is associated with approximately 300% increased likelihood of death or leukemia relapse (hazard ratio = 3. S2h. p < 0.granulocytes and erythroid cells was also observed (Fig. One implication of this model is that pre-leukemic burden may have adverse effects on patient survival. 97 . despite the fact that pHSCs do not confer disease in xenograft transplant assays4. detailed analysis of AML-specific regulomes enables the identification of novel features of pHSC biology that have important prognostic implications. consistent with results from mouse models of HOXA9 deficiency23. The retention of stemness in pHSCs caused by loss of accessibility at HOXA9 motifs helps to explain the observation that pHSCs outcompete their normal HSC counterparts in vivo (Fig.24.99 for relapse free survival. These results further implicate pHSCs in AML pathology and suggest a mechanism wherein AML arises from the presence of a pre-leukemic clone that is capable of outcompeting its normal HSC counterparts (Fig.

and blast cells representing three distinct time points in AML evolution. Two classic but competing models posited (i) lineage infidelity— a single cancer cell simultaneously accesses two normally distinct regulatory programs.Discussion The study of acute myeloid leukemia sheds light on the biology and step-wise progression of leukemia evolution. which raises diagnostic challenges and treatment conundrums. and single-cell ATAC-seq of hundreds of individual leukemic cells. This result has potentially important diagnostic and mechanistic implications. and the cancer cell is simply an expansion of this rare but physiologic bipotential state. 98 . Examination of the average epigenetic variance across the genome shows that variance increases through the stages of leukemia evolution with the majority of this variance being explained by differences observed during normal hematopoietic differentiation. The epigenetic landscapes of AML blast cells isolated from various patients are extremely divergent. Comparison of cancer to matched normal cells is one of the most basic and commonplace experiments in cancer biology. and we build upon both classical models to address this challenge. By using our comprehensive map of hematopoiesis. Cancer cells with markers or morphologies of one cell type have been shown to also express markers of a different cell type25. We measured regulomes in patient-matched pre- leukemic HSC. or alternatively (ii) lineage promiscuity—a normally bipotential progenitor cell exists. patient-matched AML cell subsets. LSC. we show direct evidence of lineage infidelity—a single cell accessing a mixed regulatory program. highlighting the need for personalized approaches to adequately target each patient’s unique cancer cells. A longstanding debate in cancer biology is how cancer cells violate cell lineage rules.

99 . we use enhancer cytometry to construct “synthetic normals”—proportionally matching the fractional contribution of cell type-specific regulomes—in order to pinpoint cancer-specific aberrations. This approach streamlined the discovery of candidate drivers and led us to discover the loss of HOXA9-mediated accessibility as the most consistent defect in pre- leukemic HSCs. in fact. higher pre-leukemic burden is predictive of poor overall and relapse-free survival in AML. Instead. indicating an important role for pre-leukemic HSC in disease pathogenesis. Importantly. These results provide potential avenues for therapeutic intervention during the earliest stages of leukemogenesis. Moreover. and that our integrative approach using enhancer cytometry to construct synthetic normal analogs should be broadly applicable to many disease pathologies.but lineage infidelity demonstrates that there may be no appropriate “normal” for comparison in epigenomic and transcriptomic studies. cause defects in differentiation as observed in these pre-leukemic HSC and confer an evolutionary advantage. we anticipate that lineage infidelity is a widespread phenomenon in many types of cancer. We found that HOXA9 loss can.

Gray color indicates a mutation known to be present in leukemic cells but not observed during the pre-leukemic phase of AML evolution (i. a late mutation event). Color indicates the percent of cells mutated as estimated from the variant allele frequency.e. TET2. If a mutation is bi-allelic.Chapter 6 .Figures and Figure Legends Figure 1. and IDH1/2 generate pre-leukemic HSCs. Asterisks indicate the predicted first mutation. Early mutations in epigenetic modifiers such as DNMT3A. Downstream acquisition of genes involved in activated signal transduction such as FLT3 and RAS lead to generation of leukemia stem cells which both self-renew and produce leukemic blast cells. Acute myeloid leukemia regulomes reveal a cooption of normal myelopoiesis. HSCs serve as a reservoir of mutation acquisition. Patients with more than 20% of HSCs harboring a pre-leukemic mutation were classified as “high 100 . (b) Genotype and mutation frequencies of HSCs isolated from AML patients (N=39). (a) Schematic of the leukemogenic process. the representative bar is divided in half.

burden” and those patients with less than 20% of HSCs harboring a pre-leukemic mutation were classified as “low burden”. (f) Cumulative variance of AML ATAC-seq data explained by the first N principal components derived from normal hematopoiesis. LSC (N=8). Profiles represent the union of all biological replicates for each cell type. Blasts (N=12). The myeloid score is calculated from the first principal component which encompasses the majority of variation observed in myelopoiesis. (d) Mean variance of chromatin accessibility across the genome as calculated by a moving average across each leukemic cell stage (see methods). (g) Myeloid development score in normal blood cell types (N=4 biological replicates) and AML cell types. (c) Normalized sequencing track of control loci on chromosome 19 from FACS-purified AML cell types. (e) Normalized sequencing tracks of developmentally-associated genes GATA2 (left) and CEBPB (right). 101 . Profiles represent the union of all biological replicates for each cell type – pHSC (N=12).

Two biological replicates of K562 cells are marked as “K562-1” and “K562-2”. (c.Figure 2. (a) Enhancer cytometry deconvolution showing contribution of various normal cell types to the epigenetic landscape of different AML cell types.f) Relative density of (e) single SU070 LSCs. respectively. (e. (g) Scatter plot showing the correlation of ATAC-seq data derived from SU353 blast cells with the closest normal 102 . and HL60 and (f) single MEP and K562 cells projected onto a one-dimensional representation of the myeloid and erythroid progression. (b) Schematic of single-cell ATAC-seq protocol and analysis. Enhancer cytometry and single-cell regulomes support a model of lineage infidelity and allow for deconvolution of AML-specific biology.d) Projection of ATAC-seq data derived from (c) single SU070 LSCs and (d) single SU070 blast cells onto the principal components derived from the normal hematopoietic hierarchy. SU070 blasts.

887 peaks enriched in the synthetic normal analog and 8. 103 . (i) Comparison of AML cell types to synthetic normal analogs.209 peaks depleted and 10. The closest normal is shown in color.86). showing the correlation between SU353 blast cells with the enhancer cytometry-defined synthetic normal analog (R=0. as shown in (g).003 peaks enriched in SU353 blast cells. Using a log2(fold change) cutoff of 4 identifies 5. The percent of the total significant peaks that are removed by comparison to synthetic normal analogs is plotted for each sample.cell type (GMP) (R=0.954 peaks enriched in SU353 blast cells. (h) Scatter plot. Using a log2(fold change) cutoff of 4 we identify 8.91).

identified in Figure 7a identifies activated and repressed patterns in leukemogenic progression. (d. (c) Enrichment and hierarchical clustering of motifs in AML-specific regulatory modules. (b) Enrichment of each module.Figure 3. Error bars represent 1 S.D. Gray bars shown represent 1 S. Early chromatin accessibility alterations within pHSCs promote stemness which predicts adverse patient outcomes. Experiments done in triplicate.D. (a) K-means clustering of cancer-specific peaks identifies 6 distinct regulatory modules.e) Retention of stemness as measured by flow cytometric analysis of CD34 protein expression after 6 days of enforced differentiation down the (d) myeloid lineage and (e) erythroid lineage. (f.g) Fold change in the percent of cells expressing CD34 as measured by flow cytometric analysis of human umbilical cord blood-derived HSCs transduced with shRNAs targeting HOXA9 104 . across all samples of that given cell type.

P values comparing two Kaplan-Meier survival curves were calculated using the log-rank (Mantel-Cox) test. N=24. Low pre-leukemic burden N=15).0001 derived from two-tailed t-test.01.or a non-targeting control. Percent CD34+ cells measured after 6 days of enforced differentiation down the (f) myeloid lineage and (g) erythroid lineage. Only GFP+ transduced cells analyzed. (h) Overall and (i) relapse-free survival of patients stratified by pre-leukemic burden as described in Figure 5b (High pre-leukemic burden. Error bars represent 1 S. ***p<0. All patients were included for the analysis regardless of their treatment. High pre-leukemic burden defined as greater than or equal to 20% of HSCs harboring at least the first pre-leukemic mutation. ****p<0. Experiments done in triplicate. 105 . **p<0.001. Hazard ratios were determined using the Mantel- Haenszel approach.D. Survival analysis was performed using the Kaplan-Meier estimate method.

MEP (97. Validation of enhancer cytometry in AML cell lines and primary cells by single-cell ATAC-seq.Supplementary Figure 1.5%) and GMP (0%). (d) Post-sort analysis of MEPs used in scATAC-seq analyses presented in Figure 6f and Supplementary Figure 6c gated for CMP (2.54%). (b) Projection of down sampled bulk hematopoiesis data onto myeloid (left) and erythroid (right) progression. (a) Enhancer cytometry of ATAC-seq data derived from various blood cell lines demonstrates mixed regulatory contribution from various normal hematopoietic cell types. (c) Projection of single MEPs onto hematopoiesis principal components 2 and 3. (e) Pearson correlations of AML cell types with the closest normal analog (color) and the enhancer cytometry-derived 106 .

107 . (f) Total significant peaks observed after comparison of AML cell types to synthetic normal analogs. Significance measured as log2(fold change) > 3.synthetic normal (gray).

(c-e) The effect of JNK/ERK inhibition by (a) JNK-IN-8. pHSCs. and (c) SCH772984 was determined by IC50 of sorted primary AML blast cells in comparison to CD34+ HSPCs derived from umbilical cord blood. (a) Principal component analysis of the log2(fold change) values of each AML cell type compared to its synthetic normal. and blasts.Supplementary Figure 2. two-tailed t-test. Validation of regulatory network analysis in AML cell types. (b) Expression of JUN in various normal hematopoietic cells. (f) Strategy for in vitro differentiation of HSPCs down the myeloid and erythroid lineages. Viability determined by flow cytometric assessment of Annexin V and DAPI. HSPCs are grown in defined culture media for 6 days 108 . *p<0. (b) SP600125.05.

i) Fold change in the percent of (h) CD15+ granulocytes or (i) CD71+GPA+ erythroblasts between cord blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9 shRNAs after 6 days of differentiation down the (h) myeloid or (i) erythroid lineage.01 by two-tailed t-test 109 . (j) Fold change in the percent of CD34+ HSPCs after 6 days of culture in stemness retention media (see methods) between cord blood-derived CD34+ HSPCs infected with lentiviruses with non-targeting or HOXA9 shRNAs. ****p<0. or other genes when detected in pre-leukemic HSC. *p < 0. TET2. IDH1/2. (k) Burden of mutations in DNMT3A. ***p<0. **p < 0.and then analyzed for cell surface markers of stemness or differentiation. (h.001. Immature cells at day 6 express CD34 and have not yet upregulated CD33.0001 by two-tailed t-test. (g) Quantitative reverse- transcriptase PCR validation of HOXA9 knockdown via shRNA.05. Knockdown performed in THP1 cells for 72 hours and validated with two separate primer sets.

Cloning and characterization of a family of novel mammalian DNA ( cytosine-5 ) methyltransferases Non-invasive sexing of preimplantation stage mammalian embryos. L. S.. 4. et al. Age-Related Clonal Hematopoiesis Associated with Adverse Outcomes. The role of mutations in epigenetic regulators in myeloid malignancies. Bennet. Corces-Zimmerman. S. 1–10 (2012). N. 2276–2282 (2014). 2. Goardon.. & Majeti. M. A. J. 19. French-American-British (FAB) co-operative group. R. 18. D.. 1– 11 (1986). M. Disrupt TET2 Function. 7. L. et al. 22–35 (2015). Sun.. et al. M. M. 2548–53 (2014). Cancer Cell 19. 219–220 (1998). 16. van't Veer. 8. Preleukemic mutations in human acute myeloid leukemia affect epigenetic regulators and persist in remission. W. B. P. J. & Levine. J. H. British Journal of Haematology 33. Patel. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. I. D. H. B. S. Proceedings of the National Academy of Sciences of the United States of America 111. Blood 67. Clonal Hematopoiesis and Blood-Cancer Risk Inferred from Blood DNA Sequence. L. Nature 462. J. Clonal evolution of preleukemic hematopoietic stem cells precedes human acute myeloid leukemia. Nature (2014). Jan. Nature Genetics 19. Medeiros. Shlush. M. Chan. N Engl J Med 373. R.. Clonal dynamics of native haematopoiesis. V. Lindberg. 6. Cancer research 65. L. 11. et al. Cancer-associated IDH1 mutations produce 2-hydroxyglutarate. 5. et al. Nature 506. Okano. 9. Shih. F. R. Jaiswal. N Engl J Med 371. & Majeti. et al.. et al. R. et al. I. M. C. Acute Myeloid Leukemia. Nature Reviews Cancer 263. Weissman. & Majeti. 2477–2487 (2014). Leukemia 28... and Impair Hematopoietic Differentiation. Coexistence of LMPP-like and GMP-like Leukemia Stem Cells in Acute Myeloid Leukemia. Oncogene 1–6 (2012). R. M. Weisdorf. 14. Watt. 328–333 (2014). 15. Proposals for the classification of the acute leukaemias. Clonal evolution of acute leukemia genomes. et al. 930–5 (2009). 1136–52 (2015). Greaves. Jan. Abdel-Wahab. & Li. et al. Corces-Zimmerman. Hong. 138–152 (2011).References 1. M. Pre-leukemic evolution of hematopoietic stem cells: the importance of early mutations in leukemogenesis. H. Lineage Promiscuity in Hemopoietic Differentiation and Leukemia. J. D. 12. 10. Science translational medicine 4. E. L. C. W. A quantitative measurement of the human somatic mutation rate.. 17. Figueroa.. et al. & Molgaard. 2488–2498 (2014). J. 13. 553–567 (2010). Cancer Cell 18. Tahiliani. R. Dang. M. Dohner. O. Leukemic IDH1 and IDH2 Mutations Result in a Hypermethylation Phenotype. Science 324. 739–44 (2009). Furley. M. A. The diagnosis of acute leukemia with undifferentiated or 110 . 8111–7 (2005). Araten.-J. Xie. M. C. 3. & Bloomfield. N Engl J Med 371. 451–8 (1976). J. Identification of pre-leukaemic haematopoietic stem cells in acute leukaemia. E.

Mice bearing a targeted interruption of the homeobox gene HOXA9 have defects in myeloid. J. 22. minimally differentiated blasts. Brun.e1–1421. Buenrostro. & Karlsson. D.. Volk. Nature 523. 161–5 (1992). 1421. A. C. J. et al. Cell 162. and lymphoid hematopoiesis. M. Abramovich. & Humphries. 184–197 (2015). et al.B and JNK is synergistic in TNF-expressing human AML. 1922–1930 (1997). Magnusson. Current opinion in hematology 12. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Hox regulation of normal and leukemic hematopoietic stem cells. 24. et al. Experimental Hematology 35. J. Levine. Co-inhibition of NF. K. et al. H. J. Lawrence. R. A.e9 (2007). Lawrence. Annals of Hematology 64. 1093–1108 (2014). M. H. H. 210–216 (2005). Hoxa9/hoxb3/hoxb4 compound null mice display severe hematopoietic defects. 486–490 (2015). 21. Single-cell chromatin accessibility reveals principles of regulatory variation. 111 . S. Journal of Experimental Medicine 211. 23. C. erythroid. 20. 25. Blood 89..

as demonstrated by our efforts to understand human hematopoiesis and leukemogenesis. these methods can measure chromatin accessibility in in vivo derived human tissues. Such efforts promise to provide a principled biochemical understanding of the sequence and structure determinants of trans-factor binding. Preceding this work. In addition. these methods enable genome-wide chromatin accessibility measurements of carefully isolated or de novo defined cellular populations. and the inference of the trans-acting regulatory proteins that define them. methods for measuring chromatin structure genome-wide often required tens of millions of cells and included complex experimental workflows. This platform may be extended to profile a diversity of RNA-protein interactions and may form the basis of methods that quantify protein-protein interactions or chromatin-TF interactions. Future work Regulatory rules describing promoter-enhancer interactions and their effect on the expression of nearby genes would greatly enhance our ability to causally link the 112 . Measuring these regulatory processes in vivo provides unique insight into the characteristics and potential of cellular behavior. We have developed ATAC-seq and scATAC- seq for profiling chromatin accessibility within rare cellular populations and/or from single-cells. Together. Using an in vitro approach we make >107 quantitative measurements of the biophysical parameters defining an RNA- protein interaction across sequence mutants.CHAPTER SEVEN – Conclusion Methods for gene regulation The thesis work presented leverages high-throughput methodologies in effort to provide a quantitative understanding of cellular regulation.

Further development of in vitro methods or high-throughput in vivo reporter assays may be used to further elucidate these mechanisms. however. for example the expression of a gene or cellular response to a stimulus. only a combined or systems approach to these data would yield a complete understanding of cellular regulation within single-cells. TF-TF interactions. Integrating ATAC-seq. combining genome-wide assays within single-cells provides a unique opportunity to develop regulatory models.epigenome to gene expression and subsequently disease mutations to phenotypes. a multidisciplinary and collaborative approach promises to enrich our understanding of cellular regulation and form the basis of our understanding of human disease. To this effort. wherein natural variation within single cells can be used to infer causal changes of expression at nearby genes. these approaches provide deep insight into individual regulatory patterns. Such a lofty endeavor will require new experimental and computational methodologies. TF-remodeler or other TF-protein interactions are critical for understanding TF binding landscapes and gene expression in vivo. In summary. are required. Together. their binding to cis regulatory elements and the effect of expression in nearby genes. 113 . Furthermore. Specifically. RNA-seq and protein measurements in the same single-cell at high- throughput may serve to quantify trans-acting regulators. computational models that integrate these data sets and infer causality.