Senior Thesis FINAL

Leveraging Low-Dimensional Structure to
Enable Spatial Transcriptomics
a dissertation presented
by
Kushagra Sharma
to
The Department of Computer Science
in partial fulfillment of the requirements

for the degree of
Artium Baccalaureus
in the subject of
Computer Science
Harvard University
Cambridge, Massachusetts
December 2021
©2021 – Kushagra Sharma
all rights reserved.
Thesis advisor: Professor Sahand Hormoz Kushagra Sharma
Leveraging Low-Dimensional Structure to Enable Spatial

Transcriptomics
Abstract
Information about biological phenotype can be gleaned from a variety of sources. We’ve made
rapid progress in the last decade in more and more accurate measurements of one, central source:
gene expression levels inside of cells. We’re now able to rapidly and cheaply sequence the content
and abundance of RNA transcripts down to single-cell resolution. However, in the process we lose
information regarding the spatial context of the cell: where in the tissue it originated from. Tech-
niques have been developed in the last few years to remedy this problem, by incorporating spatial
information into gene expression measurements. However, these techniques tend to be restricted to
the lab of origin due to their high degree of technical complexity. We aim to alleviate this problem
by using low-dimensional structure in gene expression profiles to use low-dimensional experimental
measurements that are widely accessible to impute the full, high-dimensional spatial transcriptome.
iii
Contents
0 Introduction 1
0.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.3 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Related Work 6
1.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Methods 12
2.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Experiment Goals and Design 19

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Evaluation, Comparison, and Understanding . . . . . . . . . . . . . . . . . . . 35
4 Results 36
4.1 High Level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Diving Deeper into Distribution Reconstructions . . . . . . . . . . . . . . . . . 40
5 Discussion 51
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Further Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References 56
iv
Listing of figures
3.1 Examining the Distribution of N(p(i)) for σ = 5 . . . . . . . . . . . . . . . . . 22

3.2 Examining the Distribution of N(p(i)) for σ = 100 . . . . . . . . . . . . . . . . 23
3.3 Summary of the Properties of the Gaussian with Systematically Selected σ . . . . . 24
3.4 The Euclidean Distance to the kth Neighbor, Outlier and Median . . . . . . . . . 25
3.5 Visualizing the Distribution for Various Cells . . . . . . . . . . . . . . . . . . . 26
3.6 Plotting UMAP Visualizations and PDFs to √Understand Distribution . . . . . . 27
3.7 Visualizing Neighbor Distances with K = N . . . . . . . . . . . . . . . . . . 28
3.8 Plotting Marker Genes on the Original UMAP and our KNNG UMAP . . . . . 29
3.9 Visualizing Various Laplacian Eigenvectors . . . . . . . . . . . . . . . . . . . . . 31
3.10 Examining the Reconstruction Performance with Varying Normalization Methods,
Coefficient Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.11 Examining the Neural Network Reconstruction Performance by Laplacian Coeffi-
cient Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Mean JS Divergence Comparison by Method . . . . . . . . . . . . . . . . . . . 37

4.2 CDFs for our best methods, compared to the CDFs for true Laplacian coefficient re-
constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Mean MSE Comparison by Method . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Scatter-plot of JS Divergence by Cell . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Scatter-plot of JS Divergence, Centered on Negative Control and Rescaled . . . . 40
4.6 PDF and CDF of Centered and Scaled JS Divergence for M1 . . . . . . . . . . . 41
4.7 A Sample Reconstruction from the Best 10% of Reconstructions . . . . . . . . . 41
4.8 A Sample Reconstruction from the Worst 10% of Reconstructions . . . . . . . . 42
4.9 A Random Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 Erythroid Marker Gene (CA1) Reconstruction . . . . . . . . . . . . . . . . . . 43
4.11 Megakaryocyte Marker Gene (ITGA2B) Reconstruction . . . . . . . . . . . . . 44
4.12 HSC/MPP Marker Gene (CRHBP) Reconstruction . . . . . . . . . . . . . . . 44
4.13 UMAP Displaying Reconstruction Differences . . . . . . . . . . . . . . . . . . 45
4.14 Scatter-plot of JS Divergence by Cell . . . . . . . . . . . . . . . . . . . . . . . . 45
4.15 Scatter-plot of JS Divergence, Centered on Negative Control and Rescaled . . . . 46
4.16 PDF and CDF of Centered and Scaled JS Divergence for M1 . . . . . . . . . . . 46
v
4.17 A Sample Reconstruction from the Best 10% of Reconstructions . . . . . . . . . 47
4.18 A Sample Reconstruction from the Worst 10% of Reconstructions . . . . . . . . 47
4.19 A Random Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.20 Erythroid Marker Gene (CA1) Reconstruction . . . . . . . . . . . . . . . . . . 48
4.21 Megakaryocyte Marker Gene (ITGA2B) Reconstruction . . . . . . . . . . . . . 49
4.22 HSC/MPP Marker Gene (CRHBP) Reconstruction . . . . . . . . . . . . . . . 49
4.23 UMAP Displaying Reconstruction Differences . . . . . . . . . . . . . . . . . . 50
vi
Acknowledgments
There are so many people in my life to whom credit for this accomplishment are due. Think-
ing and recalling everyone who in some way contributed to my life and success was a pleasure.
Starting from the beginning, my heartfelt thanks to my mom for being one of the most independently-
minded people I’ve ever met. You’re never satisfied by thinking with the crowd, and I’ve come to re-
alize how unique this is. You’ve taught me the same, and have always encouraged me to pursue what
I find interesting, independent of what the world may think. To Papa - thanks for buying me my
first computer, pushing me towards learning programming (even if it may not have worked at first),
and the support throughout my life. You’ve never in my life failed to be there when I’ve needed it
the most. Thanks for being a great father. Thanks to my sister for being a lifelong companion, and
having one of the warmest hearts of anyone I’ve ever met. Your intelligence and maturity continues
to surprise me and warm my heart.
Thanks to all my friends who’ve ever been by my side, exploring ideas, life, and this world to-
gether. To list but a few, Dominic Tanzillo, Kubilay Agi, Keaton Gibbs, Tanay Tandon, Evan Hart,
Ari Hatzimemos, Casey Carter, Soumil Singh, Oscar Avatare, Kendall Zhu, Isabel Haro, Snigdha
Roy, and Alex K. Chen. I’m grateful to have had the chance to share my life with you all.
Thanks to Tim Jaconette and Pai-Ling Yin for bringing in a 15 year old to their research group
and giving me my first exposure to the academic world. Enormous gratitude to Carlos Garay, who
was my first real mentor, and without whom I’d be a significantly worse software engineer. This
thesis and its 10, 000 lines of code would not have been possible without your early guidance. Re-
reading through some of our exchanges recently reminded me how brilliant a mentor you were for
me.
A heartfelt thanks to Terri Bittner, who is the most courageous educator I’ve ever had the privi-
lege to learn from. Without her selfless attempts to open the doors of higher mathematics to high
school students, this thesis and my entire career path wouldn’t have been possible. She resisted our
educational system’s drive to mediocrity and refused to accept ’good-enough.’
Thanks to Kenneth Blum for taking me seriously as a scientist, and making me realize that there
was a there ’there’ when it came to scientific exploration.
Thanks to Laura Deming for inspiring me with an example of personal integrity, agency, and
authenticity. You were critical for my development at a key fork in the road, and I’m deeply grateful.
Thanks to Lada Nuzhna for the deep support, encouragement, and nourishing conversations
vii
that are ever-reinvigorating. We share ambitions and value-systems to such an enormous degree, and
I’m excited to keep building with you. I also can’t forget your very instrumental contributions to
this thesis - your machine learning expertise was key to the success of this research.
A deep thanks to Fei Chen for supporting my first explorations in the field of biology. You took
an unproven student in and helped me build my abilities, interests, and confidence in research.
Thanks to Sahand Hormoz for guiding this research project, and David Jacobowitz to providing
just the right degree of involvement - helping me succeed when I could do it by myself, and support-
ing me when I couldn’t.
A final thanks to all who supported me over the last year of transitioning into biology with con-
versations, advice, guidance, and endowed me with the confidence that I could do this (Tony Kulesa,
Sam Rodriques, Patrick Hsu, Adam Marblestone, Ankit Ranjan, Martin Borch Jensen, Ed Boyden,
Daniel Goodwin, Rob Phillips, Tom Knight). Not all of you know it, but our conversations meant
an enormous deal to me and played a part in putting me where I am now.
Onwards!
viii
Introduction
0
0.1 Background
Science progresses by the mutually reinforcing feedback loops between higher quality measure-
ments, higher quality predictions, and higher quality deductions. As a field, biology has mostly
aimed to make better measurements, particularly since the era of molecular biology began. A po-
tential reason for this is that it has been clear for a long time that there are better measurements to
1
be made, and what those measurements are. It’s been similarly clear in what ways those measure-
ments would be useful if we were able to make them. We can trace back the origins of the focus on
measurement specifically of genetic information to Frederick Griffith’s identification of the trans-
forming factor, a substance that was able to transform a nonvirulent strain of Streptococcus pneumo-
niae, the bacterial cause of pneumonia, into a virulent strain of the same bacteria, thereby carrying
otherwise hereditary information from one strain to another. Once it became clear that there was a
hereditary material, we were off to the races. We began to identify what that material was, to deter-
mine the structure of that material, and to determine how that material came to create phenotype.
In the process of doing so, we uncovered the central dogma of molecular biology: that informa-
tion is transformed from DNA to RNA to protein, with the latter two (RNA and protein) creating
most of the visible phenotypes of the cell. This dogma was established in 1958, 6 and we’re still on
the quest to measure each of these components in their full context. Genome (DNA) sequencing
was the first to mature, although epigenetic sequencing is still largely beyond our grasp.
The product of DNA is RNA, is also known as a transcript. Each RNA molecule encodes the
information necessary to produce a single protein (in eukaryotes).* RNA is encoded in (roughly)
the same four-character language as DNA, and each RNA corresponds to a specific segment of
DNA known as the coding region, from which the RNA is synthesized (transcribed). Just as the
full set of genes is known as the genome, the full set of transcripts is known as the transcriptome.
Transcriptome (RNA) sequencing (transcriptomics) has more recently matured, and we now have
the ability to measure 10 − 50% of the RNAs inside a single cell.
The same fundamental technology that was developed for genome sequencing was leveraged to
sequence the transcriptome, since RNA can be enzymatically converted to corresponding DNA
in a process known as reverse transcription, and then can be sequenced using genome sequencing
* With some exceptions, such as RNAs that do not code for proteins but are useful molecules in their own
right.
2
technology.
The key development in genome sequencing technology that enabled RNA sequencing (RNA-
seq) was next-generation sequencing (NGS). 18 NGS dramatically increased the throughput of
genome sequencing, and made it possible to sequence large numbers of transcripts from different
cells.
RNA-seq tells you how many RNA transcripts are present inside of a cell, and tells you what the
sequence of each RNA transcript is. For RNA transcripts that are used to make proteins (mRNA),
the sequences of the transcripts tell you what protein is going to be made, and the quantity of a
specific transcript gives you information about the abundance of the corresponding protein. Since
proteins determine a large portion of the phenotype (observable characteristics and behavior) of a
cell, RNA-seq provides extremely valuable information about the ’state’ of a cell.
Despite the incredible power of RNA-seq, it has a few major weaknesses. To sequence the RNA
from single cells in a tissue (scRNA-seq), we dissociate the cells from the tissue, thereby removing
the individual cells from their spatial context. In the most popular scRNA-seq method, we then
suspend each cell inside an oil droplet, which serves as a reaction chamber for all non-RNA material
to be degraded, and for the RNA inside a cell to be converted into DNA. However, once the cell is
inside this droplet, we lose all knowledge of where the cell came from inside its tissue of origin.
These spatial locations inside the tissue are key to understanding the biological function of tis-
sues, pathological and healthy. Cells tend to be localized, and they tend to interact with and in-
fluence the cells in their local neighborhood. This can be seen from the key importance of spatial
patterns in embryonic development, tumor formation, the development of Alzheimer’s disease, and
nearly every other tissue-level biological problem.
Spatial transcriptomics refers to a broad class of techniques that share a common goal: to measure
the transcriptome of individual cells while maintaining knowledge of their spatial context within a
3
tissue. †
There have been many methods developed to solve this problem, which are covered in detail
in the Related Work chapter. These techniques have one thing in common: they are technically
complex to perform, and require a degree of implicit knowledge that is difficult to transfer out of
the originating laboratory. Only 4/11 of the developed transcriptome-wide spatial transcriptomics
methods have been reproduced outside of the originators lab, and none are in widespread usage.
This is a flaw common to much of experimental biology, and is concerning for a method that has
the potential to produce otherwise inaccessible information to aid our understanding of basic and
translational biology. 1
0.2 Motivation
Right now, spatial transcriptomics data is only (in)accessible via the technically challenging and
centralized methods previously mentioned. What if we could use insights about the underlying
structure of transcriptomic data to broaden access to methods to generate this data? This is the
question we ask in this research.
To be more specific about the difficulty of spatial transcriptomics methods, it is specifically full
transcriptome data that is difficult to generate. Eukaryotes have on the order of 105 − 106 transcripts
per cell, with number of unique transcripts on the order of 105 . 14 Measuring the entire range of
possible genes is difficult. However, there are methods that are widely used that allow 102 genes to
be measured with spatial context.
Our work seeks to make full spatial transcriptomic data more abundant by using data from
widely accessible techniques to measure 102 genes to reconstruct the measurements for 105 genes.
†
Transcripts also have associated temporal information. Transcripts are produced at a particular time,
and have a half-life on the order of minutes. Recent efforts are aiming at resolving the age of transcripts in
scRNA-seq to measure this information.
4
0.3 Our contributions
Our work seeks to use these ”low-dimensional” ( 102 genes) measurements to reconstruct / impute
the ”high-dimensional” ( 105 genes) full transcriptome. To do this, we generate an experimentally
realistic low-dimensional dataset and see if we can faithfully reconstruct the true high-dimensional
data. We use a variety of methods from statistics, machine learning, and mathematics that are elabo-
rated on in Methods and Experiment Goals and Design. If successful, these computational tech-
niques provide a framework for designing less technically challenging experimental paradigms for
full transcriptome measurement.
We construct methods for imputation of data, for measurement of imputation error, for repre-
sentation of data, and for evaluation of performance. We then critically evaluate the performance of
our methods.
Further directions for research are discussed in the Discussion section of the paper. All code for
this thesis is located on GitHub.
5
It is not the critic who counts; not the man who points out
how the strong man stumbles, or where the doer of deeds
could have done them better. The credit belongs to the
man who is actually in the arena, whose face is marred
by dust and sweat and blood; who strives valiantly; who
errs, who comes short again and again, because there is no
effort without error and shortcoming; but who does actu-
ally strive to do the deeds; who knows great enthusiasms,
the great devotions; who spends himself in a worthy cause;
who at the best knows in the end the triumph of high
1
achievement, and who at the worst, if he fails, at least
fails while daring greatly, so that his place shall never be
with those cold and timid souls who neither know victory
nor defeat.
Theodore Roosevelt
Related Work
To provide context for the current state of spatial transcriptomics, the following sections de-
scribe the most popular transcriptome-wide and targeted spatial transcriptomics. We also describe
some computational techniques that served as inspiration for the methods used in this research.
6
1.1 Experimental Methods
To provide context for the state of spatial transcriptomics, we sketch out methods for both transcriptome-
wide measurement as well as targeted measurement of specific genes.
1.1.1 Transcriptome-Wide Measurement
We first describe techniques that are able to measure all gene expression profiles with spatial context.
These techniques tend to be experimentally difficult and restricted to a few labs.
Spatial Transcriptomics
The original spatial transcriptomics method that bears the name is Spatial Transcriptomics, from
2016. This method reverse-transcribes mRNA from a tissue sample in place into cDNA, and uses
spatially specific oligonucleotide primers to perform the reverse-transcription. Thus, the cDNA
formed from the mRNA encodes the spatial position of the mRNA in the original tissue sample.
When the cDNA is sequenced, the original location of the mRNA can be resolved. 16
Slide-seq
One of the more recent spatial transcriptomics techniques that has the potential to become accessi-
ble and wide-spread is Slide-seq. Slide-seq involves the following process. First, you make a ”puck” -
a surface of 10μm DNA-barcoded beads with a known sequence to location map. Then, you place
a tissue slice on top of the puck, and allow RNA from the tissue to be captured by the individual
beads. When you sequence the beads, you read out both the RNA and the DNA barcode. The
DNA barcode allows you to resolve the spatial location of the RNA on the puck, creating a spatial
transciptome. 15
7
One of the main problems with the technique is the problem of ’doublets,’ where more than one
cell’s RNA is captured in a bead, and it is thus difficult to resolve which transcripts belong to which
cell. This is caused by the fact that the resolution of individual beads is 10μm which is around the
average diameter of a cell, meaning that multiple cells can be included in a single bead. Computa-
tional approaches are currently being developed to resolve cells using benchmark measurements that
indicate typical cell profiles. 3
HDST
HDST is very similar in technique to Slide-seq. The main difference is the size of the beads and the
resulting resolution - HDST uses beads with a 2μm spatial resolution, allowing for higher spatial
resolution and fewer problems with resolving transcripts to individual cells. 17
Compressed sensing for imaging transcriptomics
Compressed sensing for imaging transcriptomics is the method most similar to our own. This
method leverages compressed sensing, a technique from signal processing, to infer gene expression
levels from sparse linear combinations of particular genes. The sparse measurements are collected
with probes that are added in the stochiometric combinations corresponding to the linear combi-
nations needed. Then, the framework of compressed sensing is used to impute the overall gene ex-
pression profile. The main differences between this work and ours are the differences in imputation
method and the differences in the proposed experimental techniques. 5
1.1.2 Targeted Measurement
There are a variety of methods for targeted measurement of specific transcripts in situ, with the
earliest method going back to 1982. All of the methods surveyed here share a core technique in
8
common: fluorescent in situ hybridization (FISH).
FISH
The original FISH method is quite simple. It aimed to localize particular DNA sequences on Drosophilia
chromosomes. Biotin molecules were enzymatically incorporated onto DNA probes. The DNA
probes were then bound to the target sequences in situ, since two complementary DNA sequences
will bind together in a process known as hybridization.
The probe is then fluorescent tagged using any of a variety of methods, most commonly using
antibodies that bind to the biotin molecule. The fluorescence can then be read out using fluores-
cence microscopy, allowing for a readout of localization of specific DNA sequences in situ. 11 The
method is more generally applicable to identifying particular DNA sequences in any context, not
just Drosophilia chromosomes.
smFISH
FISH was then adapted to detect RNA molecules, since the basic methodology for targeting differ-
ent nucleic acids is quite extensible. smFISH uses multiple probes to bind to different regions of the
same mRNA molecule, creating a bright spot indicative of a transcript upon binding. This reduces
the false positive rate since unbound probes are unlikely to localize to the same region and create
spots with the same order of intensity as a correctly identified transcript. 7
MERFISH
smFISH was quite a powerful technique, but its main drawback was the lack of throughput. Even
using multiple fluorophores that could be read out on unique wavelengths, there’s a cap on the
number of transcripts that can be targeted in a particular round of hybridization. Additionally,
9
each round of hybridization degrades the quality of the sample, so you can’t just run thousands of
rounds to measure all the genes. Finally, each round takes a substantial amount of time, due to the
need for high-resolution imaging of a large side (relative to the resolution of the necessary imaging).
MERFISH was created to solve some of these problems.
Instead of making binary 0/1 measurements of transcripts, MERFISH assigns each transcript of
interest a K-bit barcode. There are K rounds of hybridization. During each round (i) of hybridiza-
tion, all RNAs with 1s in the ith position have fluorophores bound to them. The combined output
of all rounds is a read-out of binary barcodes at each spatial location where a transcript is located.
This allows the transcript at each location to be reconstructed. The Hamming Distance (a distance
measure between two binary strings) between each transcript’s barcode is maximized to minimize
errors.
This method allows several orders of magnitude higher scale than smFISH due to the efficiency
of information gathering in each round of imaging, since the rounds of imaging are the scarce quan-
tity in each of these methods, as they require both time and degrade the sample quality. 4
1.2 Computational Methods
1.2.1 Diffusion Maps
The main computational technique that influenced this work was the application of diffusion maps
to single-cell RNA sequencing data.
Diffusion maps is a non-linear dimensionality reduction technique that uses the distance be-
tween two points and the corresponding transition probabilities between two points (with points
viewed as states) to create lower-dimensional embeddings of the states.
This idea was applied to scRNA-seq data to define differentiation trajectories. Cells develop by
travelling along a trajectory in transcriptome space from their undifferentiated state to a differenti-
10
ated state, and the notions of distance and transition probabilities between states are clearly present
in the fundamental biology. 8 We used this idea of non-linear dimensionality reduction based on
transcriptome distance in our own work.
11
You keep on learning and learning, and pretty soon you
learn something no one has learned before.
Richard Feynman
Methods
2
In this chapter, we describe the experimental methods used to generate the data we work with,
as well as the computational methods we use to perform our research.
12
2.1 Experimental Methods
2.1.1 scRNA-seq
Our data was gathered by members of the Hormoz Lab using droplet-based scRNA-seq on patient
bone marrow samples. The dataset was enriched for CD34+ cells (precursor cells) and processed
using the 10x Genomics scRNA-seq platform.
In droplet-based scRNA-seq, individual cells are dissociated from the tissue of origin and en-
capsulated into nanoliter oil droplets. Once inside the droplets, the mRNA is reverse-transcribed
into cDNA with a unique barcode sequence attaches. The cDNA is then pooled together and se-
quenced, and the unique barcodes allow the sequencing results to be resolved into the single cells
that they originated from. 10
2.2 Computational Methods
2.2.1 Single-Cell Variational Inference (scVI)
One of the main weaknesses of scRNA-seq is that it suffers from large amounts of noise. In partic-
ular, because it only captures at most 50% of total transcripts inside a cell, it’s prone to not measur-
ing low-abundance transcripts (known as ”dropout”). Single-cell variational inference (scVI) is a
method that is used to impute more accurate transcriptome data from the raw data collected from
scRNA-seq. It uses stochastic optimization and deep neural networks to approximate the underly-
ing distribution that generates scRNA-seq measurements, i.e. the biological ground truth. We use it
as a pre-processing step on our scRNA-seq data for all analyses in this research. 12
13
2.2.2 Random Binary Matrices
To simulate experimentally realistic low-dimensional measurements, we generate a variety of ran-
dom binary matrices which we use to construct a dataset from the full transcriptome measurements
we get from scRNA-seq measurements. We find, surprisingly, that the variance of the performance
of our randomized matrices is low, implying that using biological domain knowledge to select genes
to measure is not particularly important.
We use random binary matrices with 50 genes measured across 10 distinct measurements, i.e.
each feature in our vector is a binary combination of 50 genes, and we have 10 such features. We ran-
domly select the genes from the set of all genes in the transcriptome, without replacement for any
particular feature but with replacement across features. We chose this setting due to its experimental
plausibility and good performance for reconstruction.
2.2.3 Uniform Manifold Approximation and Projection (UMAP)
UMAP is a non-linear dimensionality reduction technique commonly used in analysis of scRNA-
seq data. It creates a high-dimensional graph of the data and then tries to construct a low-dimensional
graph that maintains as many properties of the high-dimensional graph as possible. We use it mainly
for visualizations of the outputs of our reconstructions. 13
2.2.4 Neural Networks
Neural networks are a commonly used computational technique in computer science, used to pre-
dict an output (labels) from an input (features). They’re ’trained’ on a training set that is represen-
tative of the overall distribution the data is drawn from. The training set contains a large amount of
feature:label mappings.
14
The parameters of the neural network are optimized to minimize a loss function, which is com-
puted by comparing the outputs of the neural network to the ground-truth labels for the training
data. The loss function is a function of the parameters of the neural network and can be differen-
tiated with respect to those parameters, allowing for simple techniques like gradient descent to be
used to minimize the loss function.
There are various hyper-parameters of the neural network such as the number of parameters,
the orientation of parameters, etc. We hold out a small number of feature:label maps known as the
validation set to optimize the hyper-parameters by examining values of the loss function on the
validation set for different settings of the hyper-parameters.
2.2.5 KNN Graphs
Our single cell data sits in an Ngenes -dimensional transcriptomic space, where Ngenes is the number
of genes measured. We have reason to believe that Euclidean distance between cells in this space
has biological meaning. Cells in a small neighborhood may be of the same subtype. For example,
CD8+ T-cells and microglia have distinct neighborhoods that can be identified via cluster analy-
sis. Cells that are close together in transcriptomic space may also be a part of the same cell lineage,
since cell fate transitions occur via changes in the transcriptome of a cell. For example, multipotent
hematopoietic stem cells (hemocytoblasts) are close in transcriptomic space to their direct descen-
dant, the common lymphoid progenitor.
We can formalize the importance of Euclidean distance on this space by constructing a graph,
where the nodes are cells and edges are some representation of distance. We use the k-nearest neigh-
bor graph (k-NNG) construction method.
The KNNG construction method is quite simple. We compute Euclidean distances between
all cells, and for each cell, we construct edges between that cell and each of its K nearest neighbors
in Euclidean space. Note that as the K nearest neighbor relation is not symmetric, the number of
15
edges can vary between nodes. We choose to make the graph symmetric manually for ease of further
calculations. The graph is undirected and unweighted.
2.2.6 Laplacian Matrices
Let A be the adjacency matrix for our KNNG and D be a diagonal matrix with the degrees of the
vertices down the diagonal (i.e. the degree matrix). Then the Laplacian Matrix is defined as ≡ D−A.
Let N be the number of vertices in our graph. 2
Since L is a real symmetric matrix, it’s diagonalizable and has eigenvalues λ1 , ..., λN and or-
thonormal eigenvectors⃗v1 , ...,⃗vN . Let us consider this list sorted by eigenvalue, such that λ1 ≤
λ2 ≤ ..., ≤ λN .
The eigenvectors of L form a basis set for integrable functions on our graph. This is because the
eigenvectors of the Laplacian roughly converge to the eigenfunctions of the Laplace-Beltrami op-
erator on the underlying manifold, and these eigenfunctions form a basis set for functions on the
manifold. This is itself a generalization of the Fourier basis functions from functions on R to com-
pact Riemannian manifolds. Importantly, because our eigenvalues are monotonically increasing, the
corresponding eigenvectors capture higher-and-higher frequency features of the underlying func-
tion space, just as with Fourier basis functions. Thus, using the first N eigenvectors to represent a
function constitutes a low-pass filter on the function. 9
We use these Laplacian eigenvectors to represent, embed, and reconstruct functions on our cell
graph. We can project a function onto the Laplacian eigenvectors, and use the resulting coefficients
to capture the function. We can use the first N coefficients to reconstruct the lower-frequency struc-
ture of the function. We can try to impute those coefficients for some data that we know has a cor-
responding function on the graph, and use the imputation to compute an approximation of the
function. We use all of these capabilities in our research.
16
2.2.7 Distribution Functions in Cell Space
The specific class of functions on graphs that we’re interested in are distributions over cells. We use
the aforementioned binary combinations to reconstruct a cell state, and we specifically reconstruct
a discrete Gaussian analogue, centered at the ground truth cell. We considered using an indicator
function, but the sharp peak on a specific cell is unforgiving for statistical imputation methods,
and a somewhat-diffuse distribution over cell states may be closer to the biological ground truth,
assuming that our unseen cell can be represented as some linear combination of seen cells, which is a
less strict requirement than explicitly assuming that our unseen cell is in the training set. We present
this distribution in more detail in the following chapter.
2.2.8 Error Measures
We use two error measures, one to measure the accuracy of distribution reconstruction and one to
measure the accuracy of transcriptome reconstruction.
Jensen Shannon divergence
The Jensen Shannon divergence (JS divergence) is a symmetric measure of distance between two
distributions. It’s based on the Kullback–Leibler (KL) divergence. For two distributions P, Q, the JS
divergence is defined as:

1 1
JS(P∥Q) = KL(P∥M) + KL(Q∥M)
2 2
Where M = 21 (P + Q) and KL is the KL divergence, defined as:
Ngenes
∑ ( )
pi (x)
KL(P∥Q) = pi (x) log
i=1
qi (x)
17
Mean Squared Error
Mean squared error is a standard error measure in Euclidean space. It’s defined as:
Ngenes
1 ∑
MSE = (X(i) − X̂(i) )2
Ngenes i=1
18
But, look, I don’t think of myself, I guess for me, people tell
me, “Oh, you’re so smart about this or that thing,” but to
me, that’s not what it feels like to me. I’m always trying
to figure out things that are difficult for me to figure
out. Now, maybe some of those things are really, really
difficult for some other people to figure out. But for me,
I’m always kind of–I’m always struggling to figure stuff
out so it doesn’t…The, kind of, the internal perception is
not one of kind of—I mean, the fact that I’m, I’m always
trying to figure stuff out.
3
Stephen Wolfram
Experiment Goals and Design
We’re trying to understand whether or not it’s feasible to reconstruct the whole transcrip-
tome from low-dimensional measurements. The low-dimensional measurements are meant to
represent experimentally feasible measurements. We concretely simulate these experimental mea-
surements by taking binary combinations of genes from scRNA-seq data. We begin by describing,
at a high level, what we’re aiming to do and how we accomplish it.
19
3.1 Overview
Our goal is to reconstruct the expression levels of every gene in a cell from binary combinations of
genes. We frame this problem in two ways:
1. Reconstructing a distribution over cell states (transcriptomes) from our training set, in KNN
graph space
2. Reconstructing exact expression levels
The first is our primary problem in this research. We choose to approach the problem in this way
for a few reasons. We believe that the KNNG structure encodes important information about the
biological ground truth, that isn’t captured in direct inference of expression levels. First, it makes
explicit the relationships between cells and their nearest neighbors. This has biological meaning be-
cause cells transition through transcriptome states continuously, for example, during differentiation
or stimulus-response. Therefore, cells with nearby transcriptome states are biologically related to
each other.
Additionally, the Laplacian on this graph incorporates additional information with biological
meaning. The Laplacian Matrix can be interpreted as a discrete analog of the Laplace operator ∇2 ,
representing the average rate of change at a vertex (cell). We can think of this as the local transition
probabilities around any cell, or the rate of diffusion for a cell starting at a particular location on the
graph.
Since we want to reconstruct a distribution on the KNNG, we need a natural coordinate system
to reconstruct to. The eigenvectors of the Laplacian form such a coordinate system, since imputing
the first N coordinates in this coordinate system is equivalent to reconstructing the most important
basis coefficients of the distribution function, in the sense of capturing the most information about
20
the underlying distribution. Additionally, we posit that using the Laplacian coordinate system as an
intermediate reconstruction target will act as a form of regularization, preventing over-fitting.
Thus, we use a variety of methods to reconstruct the coordinates of our distribution in the Lapla-
cian coordinate system (the Laplacian coefficients) as well as to reconstruct the distribution directly.
We also reconstruct the gene expression levels directly as a secondary line of effort.
3.2 Data
We split the data into a training, validation, and test set, as is typical for statistical inference prob-
lems. We use the training set to learn directly learnable parameters such as neural network weights,
and we use the validation set to determine hyper-parameters, such as the number of layers in the
neural network. Our models deliberately have no exposure to the test set until their evaluation.
3.3 Error Measures
We needed error measures to measure reconstruction error both between two distributions and
between two transcriptomes.
To measure error between two distributions, we chose to use the Jensen Shannon Divergence.
As previously mentioned, the Jensen Shannon divergence (JS divergence) is based on the Kullback–
Leibler divergence, but is symmetric and finite. We used the JS divergence for a variety of use cases,
including parameter selection and performance evaluation.
To train our neural networks that aimed to predict the distribution directly, we used the Kullback-
Leibler divergence.
To measure errors between two transcriptome reconstructions, we used the mean squared error
between the two transcriptomes.
21
3.4 Parameter Selection
3.4.1 Distribution Representation
The primary target of our reconstruction is a function over transcriptome space. In particular, we
aim to reconstruct a discrete Gaussian distribution centered on the target cell. The ”ground truth”
distribution centered on the jth cell is thus p(i) ∝ exp(−( ∥x

(j) −x(i) ∥
σ )2 ) for the ith cell. Note that
the distance measure is the squared L2 norm in transcriptome space. *
This begs the question of how to select the standard deviation, σ. The purpose of using a Gaus-
sian distribution as opposed to, say, an indicator function is to provide some degree of smoothness
around our target cell, but the aim is still for our reconstruction to be highly similar to our target
cell. In other words, we want some small number of cells around the target cell to be in our ground
truth distribution. To select σ, we chose N = 10 as the number of cells, not including the target cell,
as the number of cells we’d like to include in our ground truth distribution.
One way we could systematically choose
σ is be the following method. For a given dis-
tribution p(i), we define the number of cells
included in the distribution as N(p(i)) = |{i}|
for i such that p(i) is non-negligible, i.e.
p(i) ≥ ε for ε = 5 × 10−3 . We could then
choose the minimum value of σ such that
N(p(i)) ≥ 10.
The value of σ that will give us the desired

Figure 3.1: Examining the Distribution of N(p(i)) for
number of proximal cells in the ground truth σ=5
* The fundamental assumption behind this is that the training set contains all cells that we can measure,
since we reconstruct test set cells by first mapping them to the closest training set cell, and then reconstructing
a Gaussian centered on that cell.
22
distribution will vary by cell, since the local
distributions of cells in transcriptome space varies, both in our dataset and biologically. Thus, we
wanted to investigate the degree of variance in this respect for our dataset. We begin by considering
the distribution of N(p(i)) for σ = 5 over the cells in our training set in Figure 3.1.
The above results indicate that there are a large number of cells that require a significantly larger
value of σ to have an acceptable N(p(i)). There is a large amount of variance in N(p(i)) in general
for σ = 5. This indicates to us that we’ll need to choose custom values of σ for each cell that meet
our desired criteria.
One way we could do this could be, as previously mentioned, to run a binary search for the min-
imum value of σ such that our criteria, N(p(i)) ≥ 10, is met. Unfortunately, N(p(i)) is not mono-
tonic in σ. We can see this by noting that for σ → ∞, we get a uniform distribution, with all prob-
abilities lower than our threshold, and N(p(i)) = 0. This can also be seen in Figure 3.2 of the
distribution of N(p(i)) with σ = 100.
Thus, N(p(i)) is not an appropriate measure to search over. One could alternatively search using
a function that is monotonic in σ, for example, the entropy, or the number of cells inside the 2σ
(j) (j)
interval. We instead choose an elegant exact method, which is setting σj = ∥x(j) − x10 ∥ where x10 is
the 10th nearest neighbor to x(j) . This functionally creates a distribution with 10 cells in the ≈ 95%
interval from the mean.
We now set σj to the above value. To ensure
that this setting gives us the desired results, we
plot the distribution of the number of cells
in the 95% interval, the distribution of the
entropy for each distribution, and for the sake
of consistency with our previous metric, the
distribution of N(p(i)), all in Figure 3.3.
23
Figure 3.2: Examining the Distribution of N(p(i)) for

σ = 100
(a) Examining the Distribution of the Number of Cells (b) Examining the Distribution of N(p(i)) for System‐
in the 95% Interval atic σ
(c) Examining the Number of Cells in the 95% Interval (d) Examining the Distribution of the Entropy for
on a UMAP Visualization Systematic σ
Figure 3.3: Summary of the Properties of the Gaussian with Systematically Selected σ
24
Table 3.1: Summary Statistics for Outlier Cell
Summary Statistic Value

Mean 0.000278
σ 0.000011
Min 0.000267
25% 0.000277
50% 0.000277
75% 0.000277
Max 0.000785
Unfortunately, as can be seen in the first fig-
ure, there is still more variance in the number
of cells with non-negligible ground truth dis-
tribution probabilities that we’d like. This is
because our distribution is not precisely Gaussian, since it’s discrete. We can examine why this is the
case by looking in more detail at, for example, the cell at the rightmost end of the distribution, with
N = 3415 cells included in its 95% confidence interval. We’ll call this the ”Outlier Cell” for the re-
mainder of this section. We can first examine the summary statistics of the distribution in Table 3.1.
It appears that we have a highly uniform distribution. To understand why, we look at the dis-
tances of the 10 nearest neighbors to the outlier cell relative to the median cell with respect to the
confidence interval distribution, which will give us insight into σ for the underlying distribution.
As we can see, there’s huge variance in dis-
tance to the 10th nearest neighbor (and nearest
neighbors in general), with our outlier cell be-
ing very far away from nearly all cells. Note
that even the median cell plotted here has
≈ N = 200 cells in its 95% confidence in-
terval. Thus, our solution is to truncate the
25
Figure 3.4: The Euclidean Distance to the kth Neighbor,
Outlier and Median
distribution for each cell at the 11th highest
probability cells (10 plus the true cell), appro-
priately normalized.
We can now visualize the resulting distribu-
tions by plotting probabilities on a UMAP visualization for a few cells. In Figure 3.5 we examine the
outlier cell, the median cell, and two randomly selected cells.
Figure 3.5: Visualizing the Distribution for Various Cells
As we can see, the resulting distributions are not perfect but are decently localized and peaked
on the target cell. Note that UMAP visualizations themselves make assumptions about cell-cell
distances and thus distance on a UMAP plot is not necessarily equivalent to biologically-relevant
distance in transcriptome space. We can see this by plotting key marker genes and seeing non-perfect
localization, as is done in our section on choosing K for our KNNG.
Finally, we plot a few distributions for randomly selected cells side by side with the UMAP in
26
Figure 3.6 so we can get a sense for the shape of the distribution.
(a) UMAP and PDF for First Random Cell
(b) UMAP and PDF for Second Random Cell
Figure 3.6: Plotting UMAP Visualizations and PDFs to Understand Distribution
3.4.2 Graph Construction

√
Standard KNN graph construction methods use K = N where N is the number of samples /
data points. We naively began with this setting, but found that this was likely too high considering
the structure of the data. Below, we draw the UMAP representation of the data in two settings - the
27
first connects each cell and its nearest neighbor, and the second connects each cell and its farthest
√
neighbor to which it shared an edge in the symmetric KNNG with K = N.
(a) UMAP with the Farthest Neighbor Connected (b) UMAP with the Nearest Neighbor Connected
√
Figure 3.7: Visualizing Neighbor Distances with K = N
We see in Figure 3.7 that the farthest neighbor is quite far in UMAP space, indicating that we’re
using too high a setting for K.
The UMAP dimensionality reduction method itself constructs such a graph, and uses K = 15 as
its standard. UMAP is a standard dimensionality reduction method for single cell sequencing data,
so we chose to follow its standard.
Since the UMAP method uses a graph representation of the data as an intermediate computa-
tion, and is a well accepted visualization method for single cell data, we can compare the distribution
of certain marker genes (genes that have a spatial pattern in UMAPd data as well as biological sig-
nificance in differentiating between cell types in the bone marrow) between the original UMAP
representation and UMAP computed using our KNNG. These are plotted in Figure 3.8.
In Table 3.2 are the marker genes we chose to plot, by cell type. We plot these same marker genes
for all marker gene analyses in this thesis. Note that we plot log rather than absolute expression
levels for ease of visualization.
28
(a) Erythroid Marker Gene
(b) Megakaryocyte Marker Gene
(c) HSC/MPP Marker Gene 29

Figure 3.8: Plotting Marker Genes on the Original UMAP and our KNNG UMAP
Table 3.2: Marker Genes by Cell Type
Cell Type Marker Gene

Erythroid CA1
Megakaryocyte ITGA2B
HSC/MPP CRHBP
We see in Figure 3.8 graph maintains the expected spatial distributions for these marker genes, so
we conclude that our KNNG is appropriate to use for our analysis.
3.4.3 Laplacian Coefficient Selection
We want to understand how effective the Laplacian formulation is at allowing for a reconstruction
of the Gaussian distribution. We also want to get a sense for how the Laplacian reconstruction func-
tions.
We first address the first question by plotting a few Laplacian eigenvectors on the UMAP of the
training cells in Figure 3.9. The eigenvectors are bases for functions on the graph, and by plotting
them we can get a sense for how the reconstruction is functioning.
As we proceed from the first eigenvector to the last, each captures less and less variance of func-
tions on graph space. We can get a sense for what patterns each eigenvector is capturing from the
above visualization; some are capturing spatial patterns, some are more diffuse over the entire distri-
bution of cells.
Next, we examine the effectiveness of the Laplacian reconstruction and the number of coeffi-
cients needed to effectively reconstruct a distribution. To probe into this, we first look at the mean
JS Divergence when using different numbers of (true) Laplacian coefficients to reconstruct a distri-
bution, in the validation set.
Secondarily, when we perform our reconstruction using Laplacian coefficients, we need to turn
our reconstruction into a probability distribution. Since we’re not using all the Laplacian coeffi-
30
Figure 3.9: Visualizing Various Laplacian Eigenvectors
cients for many reconstructions, the resulting function is not constrained to be a distribution, so we
must make the function positive and normal. There are a few choices of ways we can make entries
positive: exponentiation, squaring, and taking the absolute value. We compare the JS Divergence
with respect to the number of coefficients used to make the reconstruction for all three of these
methods.
We see from Figure 3.10 that the absolute norm becomes better than the squared norm for a very
large number of coefficients. This makes sense, because the problem of having negative entries in
the reconstructed distribution would no longer be a problem for large numbers of coefficients, be-
cause the reconstruction would be naturally a probability distribution without any post-processing.
However, for the numbers of coefficients that we use in our reconstructions (<< 1000), the
squared method is better, so we proceed with that.
31
We now want to see
what the validation accu-
racy of our neural network
Laplacian reconstruction
method is when we vary
the number of Laplacian
coefficients we train to
reconstruct. We trained
neural networks to pre- Figure 3.10: Examining the Reconstruction Performance with Varying Normalization
Methods, Coefficient Numbers
dict varying numbers of
Laplacian coefficients, re-
constructed the distribution with the Laplacian eigenvectors, and computed the JS divergence. The
results are shown in Figure 3.11.
(a) Neural Network Reconstruction Performance by (b) Same as (a); but Zoomed in on Minimum Region
Laplacian Coefficient Count
Figure 3.11: Examining the Neural Network Reconstruction Performance by Laplacian Coefficient Count
We see a minimum at N = 50 coefficients used, so we choose to use that for our reconstruction.
32
3.5 Reconstruction Methods
We previously discussed the shape of the technical methods we used in our reconstruction, we now
discuss how we used those methods to ask the questions we’re interested in.
3.5.1 Distribution Reconstruction Methods
We begin with our primary task, the distribution reconstruction. We’re testing the following meth-
ods’ reconstruction accuracy (all beginning with the binary linear combination of genes):
1. Using a neural network to reconstruct the first 50 Laplacian coefficients, and then using the
Laplacian eigenvectors to reconstruct the distribution
2. Using a neural network to reconstruct the whole transcriptome, then using the transcrip-
tome to reconstruct a distribution over cells
3. Using a neural network to reconstruct the whole transcriptome, then using a neural network
to reconstruct the first 50 Laplacian coefficients, and then using the Laplacian eigenvectors to
reconstruct the distribution
4. Using a neural network to directly reconstruct the distribution over cells
For our positive control, we used the true first 3000 Laplacian coefficients, and then used the
Laplacian eigenvectors to reconstruct the distribution.
To be clear, the positive control does not involve binary measurements, but instead uses the full
transcriptome for test-set cells to construct a distribution over the KNNG by finding the nearest
neighbor and constructing a distribution over that cell. We then take the first 3000 Laplacian coeffi-
cients of that distribution.
We used the following negative controls:
33
1. A uniform distribution over cells
2. A random distribution over cells, p(i) ∝ R where R is a random real number in [0, 1]
3.5.2 Transcriptome Reconstruction Methods
Next is our secondary evaluation, the transcriptome reconstruction. We’re testing the following
methods’ reconstruction accuracy (all beginning with the binary linear combination of genes):
1. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using the
Laplacian eigenfunctions to reconstruct the distribution, then taking a weighted average
of the transcriptomes of cells in the distribution
2. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using a neural
network to reconstruct the whole transcriptome
3. Using a neural network to directly reconstruct the whole transcriptome
4. Using a bottlenecked neural network to reconstruct the whole transcriptome
We’re comparing these methods to the following positive controls:
1. The ground truth transcriptome with multivariate Gaussian noise, with standard deviation
calculated from each gene’s expression levels
2. Using the true first 50 Laplacian coefficients, followed by a neural network reconstruction of
the transcriptome
And the following negative controls:
1. The average transcriptome of all cells on the graph
34
2. A random transcriptome drawn from a multivariate Gaussian centered on the average tran-
scriptome with standard deviation calculated from each gene’s expression levels
3. A random cell’s transcriptome drawn from the training set
3.6 Evaluation, Comparison, and Understanding
3.6.1 Evaluation and Comparison
To evaluate methods against each other, we first examined the average values of their error measures
against each other, evaluated on the test set. We then examined the CDF of the error measures over
cells for the methods we were most interested in.
3.6.2 Understanding
Then, to understand the performance of the most promising methods, we conducted an in-depth
analysis of their reconstructions. We examined the distribution of their error measures. We then
compared these distributions to the distributions of error a negative control. We examined the CDF
of their error measures, and examined illuminating properties of these CDFs, such as skew and cu-
mulative density below the negative control’s error. We then visually compared the reconstructed
distributions to the original distributions for a variety of situations:
1. The best reconstructions (according to the error measure)
2. The worst reconstructions
3. Reconstructions of marker genes
Finally, we visually examined the differences in reconstruction in UMAP space.
35
4
Results
We now present the results of our reconstruction analyses. We first present high level results,
examining the results for our distribution reconstructions and our transcriptome reconstructions.
We then dive deeper into the two most promising methods for distribution reconstruction, and
critically analyze their successes and failures.
The goal is to determine the best method for reconstructing cell state / transcriptome data from
36
a low-dimensional, binary combination of genes that is experimentally feasible to measure. Previous
analyses have found that the specific genes and combinations that are being taken do not have a large
impact on reconstruction accuracy, thus we use a pre-selected random matrix for our computations,
with the experimentally feasible 50 genes per combination and 10 total combinations.
Our primary evaluation metric is the Jensen Shannon between the reconstructed distribution
and the ground truth distribution; the Gaussian analogue we previously constructed. Our sec-
ondary evaluation metric is the mean squared error between the ground truth transcriptome and the
reconstructed transcriptome. We determine the reconstructed transcriptome by taking a weighted
average of the training cells as indicated by the reconstructed distribution.
For cells not in the training set that are thereby lacking a ground truth measurement (e.g. the val-
idation and test sets), we use the closest cell in the training set to the given cell to construct a ground
truth distribution to evaluate in comparison to.
4.1 High Level Results
4.1.1 Distribution Reconstruction Results
As a reminder, we’re testing the methods in
Table 4.1. Whenever we reconstruct Laplacian
coefficients, we use them to reconstruct the dis-
tribution by taking the linear combination of
the coefficients with their respective eigenvec-
tors.
We first examine the mean JS divergence
for each method in Figure 4.1. Note that the
JS divergence in the figure is scaled by the JS Figure 4.1: Mean JS Divergence Comparison by Method
37
Table 4.1: Description of Dist. Recon. Methods
Method Code Description

M1 NN Reconstruction of 50 Laplacian Coeff.
M2 NN Reconstruction of 3595 Laplacian Coeff.
M3 NN T-ome Reconstruction, then NN Dist. Reconstruction
M4 NN T-ome Reconstruction, then NN Reconstruction of 50 Laplacian Coeff.
M5 NN Distribution Reconstruction
PC Positive Control: 3000 Laplacian Coeff.
NC1 Negative Control: Uniform Distribution
NC2 Negative Control: Random Distribution
divergence for the first negative control.
We see that M1 and M5 are the best performing methods, so we choose to examine them in more
detail.
Figure 4.2: CDFs for our best methods, compared to the CDFs for true Laplacian coefficient reconstructions
We examine the CDF of the JS Divergence for M1 and M5 against the CDF of the JS Divergence
using different numbers of true Laplacian coefficients. The results are displayed in Figure 4.2. We
38
Table 4.2: Description of Transcriptome Recon. Methods
Method Code Description

M1 NN Reconstruction of 50 Laplacian Coeff. ->Weighted Average of Distribution
M2 NN Reconstruction of 50 Laplacian Coeff. ->NN Transcriptome Reconstruction
M3 NN Transcriptome Reconstruction
M4 Bottlenecked NN Transcriptome Reconstruction
PC1 Positive Control: Ground Truth Transcriptome with Gaussian Noise
PC2 Positive Control: 3000 Laplacian Coeff. + NN
NC1 Negative Control: The Average Transcriptome
NC2 Negative Control: A Random Transcriptome from a Multivariate Gaussian
NC3 Negative Control: A Randomly Sampled Transcriptome
see that our methods are comparable to using 50 − 100 of the true Laplacian coefficients.
4.1.2 Transcriptome Reconstruction Results
For transcriptome reconstruction, we’re testing
the methods in Table 4.2.
We examine the mean MSE for each method
in Figure 4.3.
From this, we see, somewhat surprisingly,
that the best method for transcriptomic recon-
struction is in fact using the weighted average
from the distribution resulting from our recon-
struction of Laplacian coefficients.

Figure 4.3: Mean MSE Comparison by Method
39
4.2 Diving Deeper into Distribution
Reconstructions
Our best methods
for the primary re-
construction were
M1 (using a neural
network to impute
50 Laplacian coef-
ficients, then using
Laplacian eigenvec-
tors to reconstruct the

Figure 4.4: Scatter‐plot of JS Divergence by Cell
distribution) and M5
(using a neural network to directly reconstruct the distribution). We now dive deeper into each of
these methods to critically evaluate their performance.
4.2.1 Method One
We first examine a scatter plot of the JS Diver-
gence and overlay the control JS Divergence as
an initial exploration in Figure 4.4.
We’d like to rescale the measure, so we center
it on the control JS Divergence and rescale by
the standard deviation in Figure 4.5.
We’d next like to understand the distribu-
tion of the (centered / scaled) divergence, so we Figure 4.5: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled
40
Figure 4.6: PDF and CDF of Centered and Scaled JS Divergence for M1
plot both the PDF and CDF in Figure 4.6.
We see that relative to the negative control,
our method performs quite well. Nearly the entire JS divergence distribution is below the negative
control, and we have a large negative skew. We now want to gain some intuition for what the distri-
bution reconstruction looks like visually. We proceed to plot one sample from the top and bottom
10% performing cells, as well as a random reconstruction (Figure 4.7, Figure 4.8, and Figure 4.9
respectively).
Figure 4.7: A Sample Reconstruction from the Best 10% of Reconstructions
41
Figure 4.8: A Sample Reconstruction from the Worst 10% of Reconstructions
Figure 4.9: A Random Reconstruction
42
We see that the good reconstructions capture the spatial localization in UMAP space, but tend to
be somewhat more diffuse than the ground truth distribution. The worst distribution reconstruc-
tions tend to barely capture the spatial localization, and are always significantly more diffuse than
the ground truth distributions.
To gain some more insight, we plot the reconstruction of marker genes against the ground truth
in Figure 4.10, Figure 4.11, and Figure 4.12. We reconstruct individual gene measurements by tak-
ing a weighted average according to the reconstructed distribution of the expression levels in our
training set over which our distribution is specified. The ground truth is the true expression levels of
the marker genes in the test set cells. Note that both are log values to allow for better visualization.
Figure 4.10: Erythroid Marker Gene (CA1) Reconstruction
43
Figure 4.11: Megakaryocyte Marker Gene (ITGA2B) Reconstruction
Figure 4.12: HSC/MPP Marker Gene (CRHBP) Reconstruction
44
Finally, we compute the joint UMAP of
the ground truth transcriptomes with the re-
constructed transcriptomes, and then draw
lines connecting each ground truth coordinate
with the corresponding reconstructed coordi-
nate. We do this to see how much reconstruc-
tions have shifted in UMAP space. We see that

Figure 4.13: UMAP Displaying Reconstruction Differences
reconctructions and clusters are overall quite
stable.
4.2.2 Method Five
We now want to per-
form the same anal-
ysis for our best per-
forming method; the
neural network re-
construction of the
distribution.
Again, we first
examine a scatter plot

Figure 4.14: Scatter‐plot of JS Divergence by Cell
of the JS Divergence
and overlay the control JS Divergence as an initial exploration in Figure 4.14.
Next, the centered and scaled JS divergence in Figure 4.15.
45
We plot the PDF and CDF of the centered and scaled JS divergence in Figure 4.16.
As expected, this method performs very well compared to the control. It also slightly outper-
forms M1 on the skew.
We again plot a sample from the best and
worst reconstructions side-by-side with the
ground truth distributions, as well as a random
cell.
For our best performing reconstructions,
unlike M1, we appear to have tight distribu-
tions that closely match the localization of the
original distributions. Even our worst recon- Figure 4.15: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled
structions have decent performance - some
degree of the original pattern is being recon-
structed. We see that our random cell reconstructions maintain the problem we saw with M1 with
distributions being diffuse.
(a) PDF (b) CDF
Figure 4.16: PDF and CDF of Centered and Scaled JS Divergence for M1
46
Figure 4.17: A Sample Reconstruction from the Best 10% of Reconstructions
Figure 4.18: A Sample Reconstruction from the Worst 10% of Reconstructions
47
Figure 4.19: A Random Reconstruction
Again, we plot the log marker gene reconstruction expression levels against the ground truth.
Figure 4.20: Erythroid Marker Gene (CA1) Reconstruction
48
Figure 4.21: Megakaryocyte Marker Gene (ITGA2B) Reconstruction
Figure 4.22: HSC/MPP Marker Gene (CRHBP) Reconstruction
Finally, we compute the joint UMAP of the ground truth transcriptomes with the reconstructed
transcriptomes, and then draw lines connecting each ground truth coordinate with the correspond-
ing reconstructed coordinate.
Our final examinations show that M5 performs similarly to M1, but slightly better on some eval-
uation metrics.
49
Figure 4.23: UMAP Displaying Reconstruction Differences
50
Remembering that I’ll be dead soon is the most important
tool I’ve ever encountered to help me make the big choices
in life.
Almost everything–all external expectations, all pride,
all fear of embarrassment or failure–these things just
fall away in the face of death, leaving only what is truly
important.
Remembering that you are going to die is the best way I
know to avoid the trap of thinking you have something
to lose. You are already naked. There is no reason not to
5
follow your heart.
Steve Jobs
Discussion
We first discuss the conclusions we gather from the results presented in the previous section.
5.1 Conclusions
Both our closely-examined methods for distribution reconstruction perform well. Indeed, they per-
form well enough to support our hypothesis that we can reconstruct the full transcriptome / state
of a cell from binary combinations of relatively few genes. The next steps are to take these methods
51
forward by validating them and bringing them into practice.
5.2 Further Directions
There are a variety of further directions to pursue this research, both to validate the results pre-
sented in this thesis and to expand the scope of the results.
From the results for M1, we can see that this method tends to predict more diffuse distributions
than the ground-truth. This may be a result of using the Laplacian coefficients as an intermediate
reconstruction, as they may be acting as regularization on the overall distribution reconstruction. If
this is true, we’re getting a result that we desired, but we’d also like to make the resulting distribution
tighter to have it fit our ground-truth distributions better. We could simply truncate the resulting
distribution in the same way we do for our ground-truth distributions.
We mentioned in the Methods section that we use scVI, a technique used to clean up raw scRNA-
seq data by imputing missing measurements and otherwise accounting for noise. One open prob-
lem in the field is that different imputation techniques are in use, and these techniques tend to give
different results. We’d like to see if our computational methods are robust to imputation technique,
as if they are, it indicates that they’re learning more biologically relevant information.
One long-standing open problem in the single-cell sequencing space is understanding how to
determine the accuracy of transcriptome reconstruction. This is one reason why we chose to re-
construct distributions as our primary endpoint as opposed to gene expression levels. It’s unclear
what a particular MSE in transcriptome reconstruction means biologically. This is a direction that
may prove fruitful for further research. A corollary of this problem is having some way to directly
compare our distribution reconstructions with our transcriptome reconstructions.
One interesting approach to this problem could be to learn biologically relevant labels, rather
than the difficult-to-interpret transcriptome. For example, we could learn manually-labelled cell
52
types as a classification problem, and it would thus be easier to gauge the performance of our meth-
ods.
We also would like to dig deeper into the assumption that our training data represents all pos-
sible transcriptomes we might see. One way to do this would be to simulate a variety of training-
validation-test set splits, then examine the distributions of distance of test set cells to the closest
training set cell. If we found that test set cells were consistently located quite close to their closest
training set neighbor, this would validate our assumption. Another way to bypass this assump-
tion entirely is simply to collect an enormous amount of training data, perhaps by figuring out how
to use the large amounts of scRNA-seq data that have already been generated for other purposes.
There are technical challenges involved with this, such as correcting for the effects of different ex-
perimental batches (so called ”batch effects”), but it appears to be a promising direction. One of
the core insights behind AlphaFold’s success in the protein folding domain was that there’s an enor-
mous amount of unlabelled biological data in the form of protein sequences, and that data was
leveraged to make headway on the protein folding problem. We posit that a similar situation holds
with the gluttony of single-cell data that’s being produced on a regular basis at this moment, and an
interesting question is understanding how to leverage this data to solve interesting biological ques-
tions.
Next, we’d like to come up with a method to represent a test cell as a distribution over the train-
ing cells in a more accurate way, rather than simply assigning it to the closest cell in the training set.
One example could be to consider the training cells as bases for cell state space, and then represent
new cells using their coefficients in this space. The problem with this is training becomes more diffi-
cult, as all the training cells are then just indicator functions, as opposed to the linear combinations
that other cells are.
The final, and most important direction to take this research would be to use these computa-
tional techniques in concert with an experimental workflow to validate the full-stack thesis that bi-
53
nary combinations of genes can be used to impute the full spatial transcriptome. The experimental
workflow would need to be carefully designed, since it’s unclear how to gather both ground-truth
binary combination data from a targeted FISH method as well as spatial transcriptome data from
the same sample, as would be needed to truly validate that this method works. Nevertheless, it’s the
next problem that would need to be solved to bring the results of this thesis to fruition.
54
References
[1] Asp, M., Bergenstråhle, J., & Lundeberg, J. (2020). Spatially Resolved Transcriptomes—
Next Generation Tools for Tissue Exploration. BioEssays, 42(10), 1–16.
[2] Bernstein, M. N. (2020). The graph laplacian.
[3] Cable, D. M., Murray, E., Zou, L. S., Goeva, A., Macosko, E. Z., Chen, F., & Irizarry, R. A.
(2021). Robust decomposition of cell type mixtures in spatial transcriptomics. Nature
Biotechnology, (pp. 1–25).
[4] Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S., & Zhuang, X. (2015). Spatially re-
solved, highly multiplexed RNA profiling in single cells. Science, 348(6233), 1360–1363.
[5] Cleary, B., Murray, E., Alam, S., Sinha, A., Habibi, E., Simonton, B., Bezney, J., Marshall,
J., Lander, E. S., Chen, F., & Regev, A. (2019). Compressed sensing for imaging transcrip-
tomics. bioRxiv.
[6] Crick F. H. (1958). On protein synthesis. Symposia of the Society for Experimental Biology,
12, 138–163.
[7] Femino, A. M., Fay, F. S., Fogarty, K., & Singer, R. H. (1998). Visualization of single RNA
transcripts in situ. Science, 280(5363), 585–590.
[8] Haghverdi, L., Buettner, F., & Theis, F. J. (2015). Diffusion maps for high-dimensional
single-cell analysis of differentiation data. Bioinformatics, 31(18), 2989–2998.
[9] Harvey, W. (2015). Intuition for laplacian matrix of a graph’s eigenvectors and eigenvalues.
[10] Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz,
D. A., & Kirschner, M. W. (2015). Droplet barcoding for single-cell transcriptomics applied
to embryonic stem cells. Cell, 161(5), 1187–1201.
[11] Langer-Safer, P. R., Levine, M., & Ward, D. C. (1982). Immunological methods for mapping
genes on Drosophila polytene chromosomes. Proceedings of the National Academy of Sciences
of the United States of America, 79(14 I), 4381–4385.
55
[12] Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative model-
ing for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058.
[13] McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. arXiv.
[14] Milo, R., Jorgensen, P., Moran, U., Weber, G., & Springer, M. (2009). BioNumbers The
database of key numbers in molecular and cell biology. Nucleic Acids Research, 38(SUPPL.1),
750–753.
[15] Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R.,
Welch, J., Chen, L. M., Chen, F., & Macosko, E. Z. (2019). Slide-seq: A scalable technology
for measuring genome-wide expression at high spatial resolution. Science, 363(6434), 1463–
1467.
[16] Stahl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Gia-
comello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi,
S., Borg, Å., Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J.,
& Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial
transcriptomics. Science (New York, N.Y.), 353(6294), 78–82.
[17] Vickovic, S., Eraslan, G., Salmén, F., Klughammer, J., Stenbeck, L., Schapiro, D., Äijö, T.,
Bonneau, R., Bergenstråhle, L., Navarro, J. F., Gould, J., Griffin, G. K., Borg, Å., Ronaghi,
M., Frisén, J., Lundeberg, J., Regev, A., & Ståhl, P. L. (2019). High-definition spatial tran-
scriptomics for in situ tissue profiling. Nature Methods, 16(10), 987–990.
[18] Weber, A. P. (2015). Discovering new biology through sequencing of RNA. Plant Physiol-
ogy, 169(3), 1524–1531.
56

Senior Thesis FINAL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Senior Thesis FINAL

Uploaded by

Copyright:

Available Formats

Leveraging Low-Dimensional Structure to

Enable Spatial Transcriptomics

in partial fulfillment of the requirements

Leveraging Low-Dimensional Structure to Enable Spatial

3 Experiment Goals and Design 19

3.1 Examining the Distribution of N(p(i)) for σ = 5 . . . . . . . . . . . . . . . . . 22

4.1 Mean JS Divergence Comparison by Method . . . . . . . . . . . . . . . . . . . 37

measurement specifically of genetic information to Frederick Griffith’s identification of the trans-

the ability to measure 10 − 50% of the RNAs inside a single cell.

nearly every other tissue-level biological problem.

question we ask in this research.

be measured with spatial context.

the ”high-dimensional” ( 105 genes) full transcriptome. To do this, we generate an experimentally

full transcriptome measurement.

this thesis is located on GitHub.

wide measurement as well as targeted measurement of specific genes.

1.1.1 Transcriptome-Wide Measurement

These techniques tend to be experimentally difficult and restricted to a few labs.

indicate typical cell profiles. 3

resolution and fewer problems with resolving transcripts to individual cells. 17

Compressed sensing for imaging transcriptomics

method and the differences in the proposed experimental techniques. 5

1.1.2 Targeted Measurement

will bind together in a process known as hybridization.

just Drosophilia chromosomes.

spots with the same order of intensity as a correctly identified transcript. 7

number of transcripts that can be targeted in a particular round of hybridization. Additionally,

MERFISH was created to solve some of these problems.

1.2 Computational Methods

1.2.1 Diffusion Maps

to single-cell RNA sequencing data.

viewed as states) to create lower-dimensional embeddings of the states.

transcriptome distance in our own work.

as well as the computational methods we use to perform our research.

using the 10x Genomics scRNA-seq platform.

that they originated from. 10

2.2 Computational Methods

2.2.1 Single-Cell Variational Inference (scVI)

ing low-abundance transcripts (known as ”dropout”). Single-cell variational inference (scVI) is a

To simulate experimentally realistic low-dimensional measurements, we generate a variety of ran-

to measure is not particularly important.

plausibility and good performance for reconstruction.

2.2.3 Uniform Manifold Approximation and Projection (UMAP)

UMAP is a non-linear dimensionality reduction technique commonly used in analysis of scRNA-

for visualizations of the outputs of our reconstructions. 13

2.2.4 Neural Networks

used to minimize the loss function.

validation set for different settings of the hyper-parameters.

2.2.5 KNN Graphs

dant, the common lymphoid progenitor.

bor graph (k-NNG) construction method.

calculations. The graph is undirected and unweighted.

2.2.6 Laplacian Matrices

Let N be the number of vertices in our graph. 2

corresponding eigenvectors capture higher-and-higher frequency features of the underlying func-

function constitutes a low-pass filter on the function. 9

function. We use all of these capabilities in our research.

this distribution in more detail in the following chapter.

2.2.8 Error Measures

measure the accuracy of transcriptome reconstruction.

Jensen Shannon divergence

divergence is defined as:

Where M = 21 (P + Q) and KL is the KL divergence, defined as:

Experiment Goals and Design