Professional Documents
Culture Documents
a dissertation presented
by
Kushagra Sharma
to
The Department of Computer Science
Harvard University
Cambridge, Massachusetts
December 2021
©2021 – Kushagra Sharma
all rights reserved.
Thesis advisor: Professor Sahand Hormoz Kushagra Sharma
Information about biological phenotype can be gleaned from a variety of sources. We’ve made
rapid progress in the last decade in more and more accurate measurements of one, central source:
gene expression levels inside of cells. We’re now able to rapidly and cheaply sequence the content
and abundance of RNA transcripts down to single-cell resolution. However, in the process we lose
information regarding the spatial context of the cell: where in the tissue it originated from. Tech-
niques have been developed in the last few years to remedy this problem, by incorporating spatial
information into gene expression measurements. However, these techniques tend to be restricted to
the lab of origin due to their high degree of technical complexity. We aim to alleviate this problem
by using low-dimensional structure in gene expression profiles to use low-dimensional experimental
measurements that are widely accessible to impute the full, high-dimensional spatial transcriptome.
iii
Contents
0 Introduction 1
0.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.3 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Related Work 6
1.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Methods 12
2.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 36
4.1 High Level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Diving Deeper into Distribution Reconstructions . . . . . . . . . . . . . . . . . 40
5 Discussion 51
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Further Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References 56
iv
Listing of figures
v
4.17 A Sample Reconstruction from the Best 10% of Reconstructions . . . . . . . . . 47
4.18 A Sample Reconstruction from the Worst 10% of Reconstructions . . . . . . . . 47
4.19 A Random Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.20 Erythroid Marker Gene (CA1) Reconstruction . . . . . . . . . . . . . . . . . . 48
4.21 Megakaryocyte Marker Gene (ITGA2B) Reconstruction . . . . . . . . . . . . . 49
4.22 HSC/MPP Marker Gene (CRHBP) Reconstruction . . . . . . . . . . . . . . . 49
4.23 UMAP Displaying Reconstruction Differences . . . . . . . . . . . . . . . . . . 50
vi
Acknowledgments
There are so many people in my life to whom credit for this accomplishment are due. Think-
ing and recalling everyone who in some way contributed to my life and success was a pleasure.
Starting from the beginning, my heartfelt thanks to my mom for being one of the most independently-
minded people I’ve ever met. You’re never satisfied by thinking with the crowd, and I’ve come to re-
alize how unique this is. You’ve taught me the same, and have always encouraged me to pursue what
I find interesting, independent of what the world may think. To Papa - thanks for buying me my
first computer, pushing me towards learning programming (even if it may not have worked at first),
and the support throughout my life. You’ve never in my life failed to be there when I’ve needed it
the most. Thanks for being a great father. Thanks to my sister for being a lifelong companion, and
having one of the warmest hearts of anyone I’ve ever met. Your intelligence and maturity continues
to surprise me and warm my heart.
Thanks to all my friends who’ve ever been by my side, exploring ideas, life, and this world to-
gether. To list but a few, Dominic Tanzillo, Kubilay Agi, Keaton Gibbs, Tanay Tandon, Evan Hart,
Ari Hatzimemos, Casey Carter, Soumil Singh, Oscar Avatare, Kendall Zhu, Isabel Haro, Snigdha
Roy, and Alex K. Chen. I’m grateful to have had the chance to share my life with you all.
Thanks to Tim Jaconette and Pai-Ling Yin for bringing in a 15 year old to their research group
and giving me my first exposure to the academic world. Enormous gratitude to Carlos Garay, who
was my first real mentor, and without whom I’d be a significantly worse software engineer. This
thesis and its 10, 000 lines of code would not have been possible without your early guidance. Re-
reading through some of our exchanges recently reminded me how brilliant a mentor you were for
me.
A heartfelt thanks to Terri Bittner, who is the most courageous educator I’ve ever had the privi-
lege to learn from. Without her selfless attempts to open the doors of higher mathematics to high
school students, this thesis and my entire career path wouldn’t have been possible. She resisted our
educational system’s drive to mediocrity and refused to accept ’good-enough.’
Thanks to Kenneth Blum for taking me seriously as a scientist, and making me realize that there
was a there ’there’ when it came to scientific exploration.
Thanks to Laura Deming for inspiring me with an example of personal integrity, agency, and
authenticity. You were critical for my development at a key fork in the road, and I’m deeply grateful.
Thanks to Lada Nuzhna for the deep support, encouragement, and nourishing conversations
vii
that are ever-reinvigorating. We share ambitions and value-systems to such an enormous degree, and
I’m excited to keep building with you. I also can’t forget your very instrumental contributions to
this thesis - your machine learning expertise was key to the success of this research.
A deep thanks to Fei Chen for supporting my first explorations in the field of biology. You took
an unproven student in and helped me build my abilities, interests, and confidence in research.
Thanks to Sahand Hormoz for guiding this research project, and David Jacobowitz to providing
just the right degree of involvement - helping me succeed when I could do it by myself, and support-
ing me when I couldn’t.
A final thanks to all who supported me over the last year of transitioning into biology with con-
versations, advice, guidance, and endowed me with the confidence that I could do this (Tony Kulesa,
Sam Rodriques, Patrick Hsu, Adam Marblestone, Ankit Ranjan, Martin Borch Jensen, Ed Boyden,
Daniel Goodwin, Rob Phillips, Tom Knight). Not all of you know it, but our conversations meant
an enormous deal to me and played a part in putting me where I am now.
Onwards!
viii
Introduction
0
0.1 Background
Science progresses by the mutually reinforcing feedback loops between higher quality measure-
ments, higher quality predictions, and higher quality deductions. As a field, biology has mostly
aimed to make better measurements, particularly since the era of molecular biology began. A po-
tential reason for this is that it has been clear for a long time that there are better measurements to
1
be made, and what those measurements are. It’s been similarly clear in what ways those measure-
ments would be useful if we were able to make them. We can trace back the origins of the focus on
forming factor, a substance that was able to transform a nonvirulent strain of Streptococcus pneumo-
niae, the bacterial cause of pneumonia, into a virulent strain of the same bacteria, thereby carrying
otherwise hereditary information from one strain to another. Once it became clear that there was a
hereditary material, we were off to the races. We began to identify what that material was, to deter-
mine the structure of that material, and to determine how that material came to create phenotype.
In the process of doing so, we uncovered the central dogma of molecular biology: that informa-
tion is transformed from DNA to RNA to protein, with the latter two (RNA and protein) creating
most of the visible phenotypes of the cell. This dogma was established in 1958, 6 and we’re still on
the quest to measure each of these components in their full context. Genome (DNA) sequencing
was the first to mature, although epigenetic sequencing is still largely beyond our grasp.
The product of DNA is RNA, is also known as a transcript. Each RNA molecule encodes the
information necessary to produce a single protein (in eukaryotes).* RNA is encoded in (roughly)
the same four-character language as DNA, and each RNA corresponds to a specific segment of
DNA known as the coding region, from which the RNA is synthesized (transcribed). Just as the
full set of genes is known as the genome, the full set of transcripts is known as the transcriptome.
Transcriptome (RNA) sequencing (transcriptomics) has more recently matured, and we now have
The same fundamental technology that was developed for genome sequencing was leveraged to
sequence the transcriptome, since RNA can be enzymatically converted to corresponding DNA
in a process known as reverse transcription, and then can be sequenced using genome sequencing
* With some exceptions, such as RNAs that do not code for proteins but are useful molecules in their own
right.
2
technology.
The key development in genome sequencing technology that enabled RNA sequencing (RNA-
seq) was next-generation sequencing (NGS). 18 NGS dramatically increased the throughput of
genome sequencing, and made it possible to sequence large numbers of transcripts from different
cells.
RNA-seq tells you how many RNA transcripts are present inside of a cell, and tells you what the
sequence of each RNA transcript is. For RNA transcripts that are used to make proteins (mRNA),
the sequences of the transcripts tell you what protein is going to be made, and the quantity of a
specific transcript gives you information about the abundance of the corresponding protein. Since
proteins determine a large portion of the phenotype (observable characteristics and behavior) of a
cell, RNA-seq provides extremely valuable information about the ’state’ of a cell.
Despite the incredible power of RNA-seq, it has a few major weaknesses. To sequence the RNA
from single cells in a tissue (scRNA-seq), we dissociate the cells from the tissue, thereby removing
the individual cells from their spatial context. In the most popular scRNA-seq method, we then
suspend each cell inside an oil droplet, which serves as a reaction chamber for all non-RNA material
to be degraded, and for the RNA inside a cell to be converted into DNA. However, once the cell is
inside this droplet, we lose all knowledge of where the cell came from inside its tissue of origin.
These spatial locations inside the tissue are key to understanding the biological function of tis-
sues, pathological and healthy. Cells tend to be localized, and they tend to interact with and in-
fluence the cells in their local neighborhood. This can be seen from the key importance of spatial
patterns in embryonic development, tumor formation, the development of Alzheimer’s disease, and
Spatial transcriptomics refers to a broad class of techniques that share a common goal: to measure
the transcriptome of individual cells while maintaining knowledge of their spatial context within a
3
tissue. †
There have been many methods developed to solve this problem, which are covered in detail
in the Related Work chapter. These techniques have one thing in common: they are technically
complex to perform, and require a degree of implicit knowledge that is difficult to transfer out of
the originating laboratory. Only 4/11 of the developed transcriptome-wide spatial transcriptomics
methods have been reproduced outside of the originators lab, and none are in widespread usage.
This is a flaw common to much of experimental biology, and is concerning for a method that has
the potential to produce otherwise inaccessible information to aid our understanding of basic and
translational biology. 1
0.2 Motivation
Right now, spatial transcriptomics data is only (in)accessible via the technically challenging and
centralized methods previously mentioned. What if we could use insights about the underlying
structure of transcriptomic data to broaden access to methods to generate this data? This is the
To be more specific about the difficulty of spatial transcriptomics methods, it is specifically full
transcriptome data that is difficult to generate. Eukaryotes have on the order of 105 − 106 transcripts
per cell, with number of unique transcripts on the order of 105 . 14 Measuring the entire range of
possible genes is difficult. However, there are methods that are widely used that allow 102 genes to
Our work seeks to make full spatial transcriptomic data more abundant by using data from
widely accessible techniques to measure 102 genes to reconstruct the measurements for 105 genes.
†
Transcripts also have associated temporal information. Transcripts are produced at a particular time,
and have a half-life on the order of minutes. Recent efforts are aiming at resolving the age of transcripts in
scRNA-seq to measure this information.
4
0.3 Our contributions
Our work seeks to use these ”low-dimensional” ( 102 genes) measurements to reconstruct / impute
realistic low-dimensional dataset and see if we can faithfully reconstruct the true high-dimensional
data. We use a variety of methods from statistics, machine learning, and mathematics that are elabo-
rated on in Methods and Experiment Goals and Design. If successful, these computational tech-
niques provide a framework for designing less technically challenging experimental paradigms for
We construct methods for imputation of data, for measurement of imputation error, for repre-
sentation of data, and for evaluation of performance. We then critically evaluate the performance of
our methods.
Further directions for research are discussed in the Discussion section of the paper. All code for
5
It is not the critic who counts; not the man who points out
how the strong man stumbles, or where the doer of deeds
could have done them better. The credit belongs to the
man who is actually in the arena, whose face is marred
by dust and sweat and blood; who strives valiantly; who
errs, who comes short again and again, because there is no
effort without error and shortcoming; but who does actu-
ally strive to do the deeds; who knows great enthusiasms,
the great devotions; who spends himself in a worthy cause;
who at the best knows in the end the triumph of high
1
achievement, and who at the worst, if he fails, at least
fails while daring greatly, so that his place shall never be
with those cold and timid souls who neither know victory
nor defeat.
Theodore Roosevelt
Related Work
To provide context for the current state of spatial transcriptomics, the following sections de-
scribe the most popular transcriptome-wide and targeted spatial transcriptomics. We also describe
some computational techniques that served as inspiration for the methods used in this research.
6
1.1 Experimental Methods
To provide context for the state of spatial transcriptomics, we sketch out methods for both transcriptome-
We first describe techniques that are able to measure all gene expression profiles with spatial context.
Spatial Transcriptomics
The original spatial transcriptomics method that bears the name is Spatial Transcriptomics, from
2016. This method reverse-transcribes mRNA from a tissue sample in place into cDNA, and uses
spatially specific oligonucleotide primers to perform the reverse-transcription. Thus, the cDNA
formed from the mRNA encodes the spatial position of the mRNA in the original tissue sample.
When the cDNA is sequenced, the original location of the mRNA can be resolved. 16
Slide-seq
One of the more recent spatial transcriptomics techniques that has the potential to become accessi-
ble and wide-spread is Slide-seq. Slide-seq involves the following process. First, you make a ”puck” -
a surface of 10μm DNA-barcoded beads with a known sequence to location map. Then, you place
a tissue slice on top of the puck, and allow RNA from the tissue to be captured by the individual
beads. When you sequence the beads, you read out both the RNA and the DNA barcode. The
DNA barcode allows you to resolve the spatial location of the RNA on the puck, creating a spatial
transciptome. 15
7
One of the main problems with the technique is the problem of ’doublets,’ where more than one
cell’s RNA is captured in a bead, and it is thus difficult to resolve which transcripts belong to which
cell. This is caused by the fact that the resolution of individual beads is 10μm which is around the
average diameter of a cell, meaning that multiple cells can be included in a single bead. Computa-
tional approaches are currently being developed to resolve cells using benchmark measurements that
HDST
HDST is very similar in technique to Slide-seq. The main difference is the size of the beads and the
resulting resolution - HDST uses beads with a 2μm spatial resolution, allowing for higher spatial
Compressed sensing for imaging transcriptomics is the method most similar to our own. This
method leverages compressed sensing, a technique from signal processing, to infer gene expression
levels from sparse linear combinations of particular genes. The sparse measurements are collected
with probes that are added in the stochiometric combinations corresponding to the linear combi-
nations needed. Then, the framework of compressed sensing is used to impute the overall gene ex-
pression profile. The main differences between this work and ours are the differences in imputation
There are a variety of methods for targeted measurement of specific transcripts in situ, with the
earliest method going back to 1982. All of the methods surveyed here share a core technique in
8
common: fluorescent in situ hybridization (FISH).
FISH
The original FISH method is quite simple. It aimed to localize particular DNA sequences on Drosophilia
chromosomes. Biotin molecules were enzymatically incorporated onto DNA probes. The DNA
probes were then bound to the target sequences in situ, since two complementary DNA sequences
The probe is then fluorescent tagged using any of a variety of methods, most commonly using
antibodies that bind to the biotin molecule. The fluorescence can then be read out using fluores-
cence microscopy, allowing for a readout of localization of specific DNA sequences in situ. 11 The
method is more generally applicable to identifying particular DNA sequences in any context, not
smFISH
FISH was then adapted to detect RNA molecules, since the basic methodology for targeting differ-
ent nucleic acids is quite extensible. smFISH uses multiple probes to bind to different regions of the
same mRNA molecule, creating a bright spot indicative of a transcript upon binding. This reduces
the false positive rate since unbound probes are unlikely to localize to the same region and create
MERFISH
smFISH was quite a powerful technique, but its main drawback was the lack of throughput. Even
using multiple fluorophores that could be read out on unique wavelengths, there’s a cap on the
9
each round of hybridization degrades the quality of the sample, so you can’t just run thousands of
rounds to measure all the genes. Finally, each round takes a substantial amount of time, due to the
need for high-resolution imaging of a large side (relative to the resolution of the necessary imaging).
Instead of making binary 0/1 measurements of transcripts, MERFISH assigns each transcript of
interest a K-bit barcode. There are K rounds of hybridization. During each round (i) of hybridiza-
tion, all RNAs with 1s in the ith position have fluorophores bound to them. The combined output
of all rounds is a read-out of binary barcodes at each spatial location where a transcript is located.
This allows the transcript at each location to be reconstructed. The Hamming Distance (a distance
measure between two binary strings) between each transcript’s barcode is maximized to minimize
errors.
This method allows several orders of magnitude higher scale than smFISH due to the efficiency
of information gathering in each round of imaging, since the rounds of imaging are the scarce quan-
tity in each of these methods, as they require both time and degrade the sample quality. 4
The main computational technique that influenced this work was the application of diffusion maps
Diffusion maps is a non-linear dimensionality reduction technique that uses the distance be-
tween two points and the corresponding transition probabilities between two points (with points
This idea was applied to scRNA-seq data to define differentiation trajectories. Cells develop by
travelling along a trajectory in transcriptome space from their undifferentiated state to a differenti-
10
ated state, and the notions of distance and transition probabilities between states are clearly present
in the fundamental biology. 8 We used this idea of non-linear dimensionality reduction based on
11
You keep on learning and learning, and pretty soon you
learn something no one has learned before.
Richard Feynman
Methods
2
In this chapter, we describe the experimental methods used to generate the data we work with,
12
2.1 Experimental Methods
2.1.1 scRNA-seq
Our data was gathered by members of the Hormoz Lab using droplet-based scRNA-seq on patient
bone marrow samples. The dataset was enriched for CD34+ cells (precursor cells) and processed
In droplet-based scRNA-seq, individual cells are dissociated from the tissue of origin and en-
capsulated into nanoliter oil droplets. Once inside the droplets, the mRNA is reverse-transcribed
into cDNA with a unique barcode sequence attaches. The cDNA is then pooled together and se-
quenced, and the unique barcodes allow the sequencing results to be resolved into the single cells
One of the main weaknesses of scRNA-seq is that it suffers from large amounts of noise. In partic-
ular, because it only captures at most 50% of total transcripts inside a cell, it’s prone to not measur-
method that is used to impute more accurate transcriptome data from the raw data collected from
scRNA-seq. It uses stochastic optimization and deep neural networks to approximate the underly-
ing distribution that generates scRNA-seq measurements, i.e. the biological ground truth. We use it
as a pre-processing step on our scRNA-seq data for all analyses in this research. 12
13
2.2.2 Random Binary Matrices
dom binary matrices which we use to construct a dataset from the full transcriptome measurements
we get from scRNA-seq measurements. We find, surprisingly, that the variance of the performance
of our randomized matrices is low, implying that using biological domain knowledge to select genes
We use random binary matrices with 50 genes measured across 10 distinct measurements, i.e.
each feature in our vector is a binary combination of 50 genes, and we have 10 such features. We ran-
domly select the genes from the set of all genes in the transcriptome, without replacement for any
particular feature but with replacement across features. We chose this setting due to its experimental
seq data. It creates a high-dimensional graph of the data and then tries to construct a low-dimensional
graph that maintains as many properties of the high-dimensional graph as possible. We use it mainly
Neural networks are a commonly used computational technique in computer science, used to pre-
dict an output (labels) from an input (features). They’re ’trained’ on a training set that is represen-
tative of the overall distribution the data is drawn from. The training set contains a large amount of
feature:label mappings.
14
The parameters of the neural network are optimized to minimize a loss function, which is com-
puted by comparing the outputs of the neural network to the ground-truth labels for the training
data. The loss function is a function of the parameters of the neural network and can be differen-
tiated with respect to those parameters, allowing for simple techniques like gradient descent to be
There are various hyper-parameters of the neural network such as the number of parameters,
the orientation of parameters, etc. We hold out a small number of feature:label maps known as the
validation set to optimize the hyper-parameters by examining values of the loss function on the
Our single cell data sits in an Ngenes -dimensional transcriptomic space, where Ngenes is the number
of genes measured. We have reason to believe that Euclidean distance between cells in this space
has biological meaning. Cells in a small neighborhood may be of the same subtype. For example,
CD8+ T-cells and microglia have distinct neighborhoods that can be identified via cluster analy-
sis. Cells that are close together in transcriptomic space may also be a part of the same cell lineage,
since cell fate transitions occur via changes in the transcriptome of a cell. For example, multipotent
hematopoietic stem cells (hemocytoblasts) are close in transcriptomic space to their direct descen-
We can formalize the importance of Euclidean distance on this space by constructing a graph,
where the nodes are cells and edges are some representation of distance. We use the k-nearest neigh-
The KNNG construction method is quite simple. We compute Euclidean distances between
all cells, and for each cell, we construct edges between that cell and each of its K nearest neighbors
in Euclidean space. Note that as the K nearest neighbor relation is not symmetric, the number of
15
edges can vary between nodes. We choose to make the graph symmetric manually for ease of further
Let A be the adjacency matrix for our KNNG and D be a diagonal matrix with the degrees of the
vertices down the diagonal (i.e. the degree matrix). Then the Laplacian Matrix is defined as ≡ D−A.
Since L is a real symmetric matrix, it’s diagonalizable and has eigenvalues λ1 , ..., λN and or-
thonormal eigenvectors⃗v1 , ...,⃗vN . Let us consider this list sorted by eigenvalue, such that λ1 ≤
λ2 ≤ ..., ≤ λN .
The eigenvectors of L form a basis set for integrable functions on our graph. This is because the
eigenvectors of the Laplacian roughly converge to the eigenfunctions of the Laplace-Beltrami op-
erator on the underlying manifold, and these eigenfunctions form a basis set for functions on the
manifold. This is itself a generalization of the Fourier basis functions from functions on R to com-
pact Riemannian manifolds. Importantly, because our eigenvalues are monotonically increasing, the
tion space, just as with Fourier basis functions. Thus, using the first N eigenvectors to represent a
We use these Laplacian eigenvectors to represent, embed, and reconstruct functions on our cell
graph. We can project a function onto the Laplacian eigenvectors, and use the resulting coefficients
to capture the function. We can use the first N coefficients to reconstruct the lower-frequency struc-
ture of the function. We can try to impute those coefficients for some data that we know has a cor-
responding function on the graph, and use the imputation to compute an approximation of the
16
2.2.7 Distribution Functions in Cell Space
The specific class of functions on graphs that we’re interested in are distributions over cells. We use
the aforementioned binary combinations to reconstruct a cell state, and we specifically reconstruct
a discrete Gaussian analogue, centered at the ground truth cell. We considered using an indicator
function, but the sharp peak on a specific cell is unforgiving for statistical imputation methods,
and a somewhat-diffuse distribution over cell states may be closer to the biological ground truth,
assuming that our unseen cell can be represented as some linear combination of seen cells, which is a
less strict requirement than explicitly assuming that our unseen cell is in the training set. We present
We use two error measures, one to measure the accuracy of distribution reconstruction and one to
The Jensen Shannon divergence (JS divergence) is a symmetric measure of distance between two
distributions. It’s based on the Kullback–Leibler (KL) divergence. For two distributions P, Q, the JS
Ngenes
∑ ( )
pi (x)
KL(P∥Q) = pi (x) log
i=1
qi (x)
17
Mean Squared Error
Mean squared error is a standard error measure in Euclidean space. It’s defined as:
Ngenes
1 ∑
MSE = (X(i) − X̂(i) )2
Ngenes i=1
18
But, look, I don’t think of myself, I guess for me, people tell
me, “Oh, you’re so smart about this or that thing,” but to
me, that’s not what it feels like to me. I’m always trying
to figure out things that are difficult for me to figure
out. Now, maybe some of those things are really, really
difficult for some other people to figure out. But for me,
I’m always kind of–I’m always struggling to figure stuff
out so it doesn’t…The, kind of, the internal perception is
not one of kind of—I mean, the fact that I’m, I’m always
trying to figure stuff out.
3
Stephen Wolfram
We’re trying to understand whether or not it’s feasible to reconstruct the whole transcrip-
surements by taking binary combinations of genes from scRNA-seq data. We begin by describing,
19
3.1 Overview
Our goal is to reconstruct the expression levels of every gene in a cell from binary combinations of
1. Reconstructing a distribution over cell states (transcriptomes) from our training set, in KNN
graph space
The first is our primary problem in this research. We choose to approach the problem in this way
for a few reasons. We believe that the KNNG structure encodes important information about the
biological ground truth, that isn’t captured in direct inference of expression levels. First, it makes
explicit the relationships between cells and their nearest neighbors. This has biological meaning be-
cause cells transition through transcriptome states continuously, for example, during differentiation
or stimulus-response. Therefore, cells with nearby transcriptome states are biologically related to
each other.
Additionally, the Laplacian on this graph incorporates additional information with biological
meaning. The Laplacian Matrix can be interpreted as a discrete analog of the Laplace operator ∇2 ,
representing the average rate of change at a vertex (cell). We can think of this as the local transition
probabilities around any cell, or the rate of diffusion for a cell starting at a particular location on the
graph.
Since we want to reconstruct a distribution on the KNNG, we need a natural coordinate system
to reconstruct to. The eigenvectors of the Laplacian form such a coordinate system, since imputing
the first N coordinates in this coordinate system is equivalent to reconstructing the most important
basis coefficients of the distribution function, in the sense of capturing the most information about
20
the underlying distribution. Additionally, we posit that using the Laplacian coordinate system as an
Thus, we use a variety of methods to reconstruct the coordinates of our distribution in the Lapla-
cian coordinate system (the Laplacian coefficients) as well as to reconstruct the distribution directly.
We also reconstruct the gene expression levels directly as a secondary line of effort.
3.2 Data
We split the data into a training, validation, and test set, as is typical for statistical inference prob-
lems. We use the training set to learn directly learnable parameters such as neural network weights,
and we use the validation set to determine hyper-parameters, such as the number of layers in the
neural network. Our models deliberately have no exposure to the test set until their evaluation.
We needed error measures to measure reconstruction error both between two distributions and
To measure error between two distributions, we chose to use the Jensen Shannon Divergence.
As previously mentioned, the Jensen Shannon divergence (JS divergence) is based on the Kullback–
Leibler divergence, but is symmetric and finite. We used the JS divergence for a variety of use cases,
To train our neural networks that aimed to predict the distribution directly, we used the Kullback-
Leibler divergence.
To measure errors between two transcriptome reconstructions, we used the mean squared error
21
3.4 Parameter Selection
The primary target of our reconstruction is a function over transcriptome space. In particular, we
aim to reconstruct a discrete Gaussian distribution centered on the target cell. The ”ground truth”
This begs the question of how to select the standard deviation, σ. The purpose of using a Gaus-
sian distribution as opposed to, say, an indicator function is to provide some degree of smoothness
around our target cell, but the aim is still for our reconstruction to be highly similar to our target
cell. In other words, we want some small number of cells around the target cell to be in our ground
truth distribution. To select σ, we chose N = 10 as the number of cells, not including the target cell,
as the number of cells we’d like to include in our ground truth distribution.
N(p(i)) ≥ 10.
22
distribution will vary by cell, since the local
distributions of cells in transcriptome space varies, both in our dataset and biologically. Thus, we
wanted to investigate the degree of variance in this respect for our dataset. We begin by considering
the distribution of N(p(i)) for σ = 5 over the cells in our training set in Figure 3.1.
The above results indicate that there are a large number of cells that require a significantly larger
value of σ to have an acceptable N(p(i)). There is a large amount of variance in N(p(i)) in general
for σ = 5. This indicates to us that we’ll need to choose custom values of σ for each cell that meet
One way we could do this could be, as previously mentioned, to run a binary search for the min-
imum value of σ such that our criteria, N(p(i)) ≥ 10, is met. Unfortunately, N(p(i)) is not mono-
tonic in σ. We can see this by noting that for σ → ∞, we get a uniform distribution, with all prob-
abilities lower than our threshold, and N(p(i)) = 0. This can also be seen in Figure 3.2 of the
Thus, N(p(i)) is not an appropriate measure to search over. One could alternatively search using
a function that is monotonic in σ, for example, the entropy, or the number of cells inside the 2σ
(j) (j)
interval. We instead choose an elegant exact method, which is setting σj = ∥x(j) − x10 ∥ where x10 is
the 10th nearest neighbor to x(j) . This functionally creates a distribution with 10 cells in the ≈ 95%
23
(c) Examining the Number of Cells in the 95% Interval (d) Examining the Distribution of the Entropy for
on a UMAP Visualization Systematic σ
Figure 3.3: Summary of the Properties of the Gaussian with Systematically Selected σ
24
Table 3.1: Summary Statistics for Outlier Cell
because our distribution is not precisely Gaussian, since it’s discrete. We can examine why this is the
case by looking in more detail at, for example, the cell at the rightmost end of the distribution, with
N = 3415 cells included in its 95% confidence interval. We’ll call this the ”Outlier Cell” for the re-
mainder of this section. We can first examine the summary statistics of the distribution in Table 3.1.
It appears that we have a highly uniform distribution. To understand why, we look at the dis-
tances of the 10 nearest neighbors to the outlier cell relative to the median cell with respect to the
confidence interval distribution, which will give us insight into σ for the underlying distribution.
25
Figure 3.4: The Euclidean Distance to the kth Neighbor,
Outlier and Median
distribution for each cell at the 11th highest
priately normalized.
tions by plotting probabilities on a UMAP visualization for a few cells. In Figure 3.5 we examine the
outlier cell, the median cell, and two randomly selected cells.
As we can see, the resulting distributions are not perfect but are decently localized and peaked
on the target cell. Note that UMAP visualizations themselves make assumptions about cell-cell
distances and thus distance on a UMAP plot is not necessarily equivalent to biologically-relevant
distance in transcriptome space. We can see this by plotting key marker genes and seeing non-perfect
Finally, we plot a few distributions for randomly selected cells side by side with the UMAP in
26
Figure 3.6 so we can get a sense for the shape of the distribution.
data points. We naively began with this setting, but found that this was likely too high considering
the structure of the data. Below, we draw the UMAP representation of the data in two settings - the
27
first connects each cell and its nearest neighbor, and the second connects each cell and its farthest
√
neighbor to which it shared an edge in the symmetric KNNG with K = N.
(a) UMAP with the Farthest Neighbor Connected (b) UMAP with the Nearest Neighbor Connected
√
Figure 3.7: Visualizing Neighbor Distances with K = N
We see in Figure 3.7 that the farthest neighbor is quite far in UMAP space, indicating that we’re
The UMAP dimensionality reduction method itself constructs such a graph, and uses K = 15 as
its standard. UMAP is a standard dimensionality reduction method for single cell sequencing data,
Since the UMAP method uses a graph representation of the data as an intermediate computa-
tion, and is a well accepted visualization method for single cell data, we can compare the distribution
of certain marker genes (genes that have a spatial pattern in UMAPd data as well as biological sig-
nificance in differentiating between cell types in the bone marrow) between the original UMAP
representation and UMAP computed using our KNNG. These are plotted in Figure 3.8.
In Table 3.2 are the marker genes we chose to plot, by cell type. We plot these same marker genes
for all marker gene analyses in this thesis. Note that we plot log rather than absolute expression
28
(a) Erythroid Marker Gene
We see in Figure 3.8 graph maintains the expected spatial distributions for these marker genes, so
We want to understand how effective the Laplacian formulation is at allowing for a reconstruction
of the Gaussian distribution. We also want to get a sense for how the Laplacian reconstruction func-
tions.
We first address the first question by plotting a few Laplacian eigenvectors on the UMAP of the
training cells in Figure 3.9. The eigenvectors are bases for functions on the graph, and by plotting
As we proceed from the first eigenvector to the last, each captures less and less variance of func-
tions on graph space. We can get a sense for what patterns each eigenvector is capturing from the
above visualization; some are capturing spatial patterns, some are more diffuse over the entire distri-
bution of cells.
Next, we examine the effectiveness of the Laplacian reconstruction and the number of coeffi-
cients needed to effectively reconstruct a distribution. To probe into this, we first look at the mean
JS Divergence when using different numbers of (true) Laplacian coefficients to reconstruct a distri-
Secondarily, when we perform our reconstruction using Laplacian coefficients, we need to turn
our reconstruction into a probability distribution. Since we’re not using all the Laplacian coeffi-
30
Figure 3.9: Visualizing Various Laplacian Eigenvectors
cients for many reconstructions, the resulting function is not constrained to be a distribution, so we
must make the function positive and normal. There are a few choices of ways we can make entries
positive: exponentiation, squaring, and taking the absolute value. We compare the JS Divergence
with respect to the number of coefficients used to make the reconstruction for all three of these
methods.
We see from Figure 3.10 that the absolute norm becomes better than the squared norm for a very
large number of coefficients. This makes sense, because the problem of having negative entries in
the reconstructed distribution would no longer be a problem for large numbers of coefficients, be-
cause the reconstruction would be naturally a probability distribution without any post-processing.
However, for the numbers of coefficients that we use in our reconstructions (<< 1000), the
31
We now want to see
Laplacian reconstruction
coefficients we train to
reconstruct. We trained
neural networks to pre- Figure 3.10: Examining the Reconstruction Performance with Varying Normalization
Methods, Coefficient Numbers
dict varying numbers of
constructed the distribution with the Laplacian eigenvectors, and computed the JS divergence. The
(a) Neural Network Reconstruction Performance by (b) Same as (a); but Zoomed in on Minimum Region
Laplacian Coefficient Count
Figure 3.11: Examining the Neural Network Reconstruction Performance by Laplacian Coefficient Count
We see a minimum at N = 50 coefficients used, so we choose to use that for our reconstruction.
32
3.5 Reconstruction Methods
We previously discussed the shape of the technical methods we used in our reconstruction, we now
discuss how we used those methods to ask the questions we’re interested in.
We begin with our primary task, the distribution reconstruction. We’re testing the following meth-
ods’ reconstruction accuracy (all beginning with the binary linear combination of genes):
1. Using a neural network to reconstruct the first 50 Laplacian coefficients, and then using the
2. Using a neural network to reconstruct the whole transcriptome, then using the transcrip-
3. Using a neural network to reconstruct the whole transcriptome, then using a neural network
to reconstruct the first 50 Laplacian coefficients, and then using the Laplacian eigenvectors to
For our positive control, we used the true first 3000 Laplacian coefficients, and then used the
To be clear, the positive control does not involve binary measurements, but instead uses the full
transcriptome for test-set cells to construct a distribution over the KNNG by finding the nearest
neighbor and constructing a distribution over that cell. We then take the first 3000 Laplacian coeffi-
33
1. A uniform distribution over cells
2. A random distribution over cells, p(i) ∝ R where R is a random real number in [0, 1]
Next is our secondary evaluation, the transcriptome reconstruction. We’re testing the following
methods’ reconstruction accuracy (all beginning with the binary linear combination of genes):
1. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using the
2. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using a neural
1. The ground truth transcriptome with multivariate Gaussian noise, with standard deviation
2. Using the true first 50 Laplacian coefficients, followed by a neural network reconstruction of
the transcriptome
34
2. A random transcriptome drawn from a multivariate Gaussian centered on the average tran-
scriptome with standard deviation calculated from each gene’s expression levels
To evaluate methods against each other, we first examined the average values of their error measures
against each other, evaluated on the test set. We then examined the CDF of the error measures over
3.6.2 Understanding
Then, to understand the performance of the most promising methods, we conducted an in-depth
analysis of their reconstructions. We examined the distribution of their error measures. We then
compared these distributions to the distributions of error a negative control. We examined the CDF
of their error measures, and examined illuminating properties of these CDFs, such as skew and cu-
mulative density below the negative control’s error. We then visually compared the reconstructed
35
4
Results
We now present the results of our reconstruction analyses. We first present high level results,
examining the results for our distribution reconstructions and our transcriptome reconstructions.
We then dive deeper into the two most promising methods for distribution reconstruction, and
The goal is to determine the best method for reconstructing cell state / transcriptome data from
36
a low-dimensional, binary combination of genes that is experimentally feasible to measure. Previous
analyses have found that the specific genes and combinations that are being taken do not have a large
impact on reconstruction accuracy, thus we use a pre-selected random matrix for our computations,
with the experimentally feasible 50 genes per combination and 10 total combinations.
Our primary evaluation metric is the Jensen Shannon between the reconstructed distribution
and the ground truth distribution; the Gaussian analogue we previously constructed. Our sec-
ondary evaluation metric is the mean squared error between the ground truth transcriptome and the
For cells not in the training set that are thereby lacking a ground truth measurement (e.g. the val-
idation and test sets), we use the closest cell in the training set to the given cell to construct a ground
tors.
JS divergence in the figure is scaled by the JS Figure 4.1: Mean JS Divergence Comparison by Method
37
Table 4.1: Description of Dist. Recon. Methods
We see that M1 and M5 are the best performing methods, so we choose to examine them in more
detail.
Figure 4.2: CDFs for our best methods, compared to the CDFs for true Laplacian coefficient reconstructions
We examine the CDF of the JS Divergence for M1 and M5 against the CDF of the JS Divergence
using different numbers of true Laplacian coefficients. The results are displayed in Figure 4.2. We
38
Table 4.2: Description of Transcriptome Recon. Methods
see that our methods are comparable to using 50 − 100 of the true Laplacian coefficients.
in Figure 4.3.
39
4.2 Diving Deeper into Distribution
Reconstructions
construction were
M1 (using a neural
network to impute
50 Laplacian coef-
Laplacian eigenvec-
(using a neural network to directly reconstruct the distribution). We now dive deeper into each of
tion of the (centered / scaled) divergence, so we Figure 4.5: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled
40
Figure 4.6: PDF and CDF of Centered and Scaled JS Divergence for M1
our method performs quite well. Nearly the entire JS divergence distribution is below the negative
control, and we have a large negative skew. We now want to gain some intuition for what the distri-
bution reconstruction looks like visually. We proceed to plot one sample from the top and bottom
10% performing cells, as well as a random reconstruction (Figure 4.7, Figure 4.8, and Figure 4.9
respectively).
41
Figure 4.8: A Sample Reconstruction from the Worst 10% of Reconstructions
42
We see that the good reconstructions capture the spatial localization in UMAP space, but tend to
be somewhat more diffuse than the ground truth distribution. The worst distribution reconstruc-
tions tend to barely capture the spatial localization, and are always significantly more diffuse than
To gain some more insight, we plot the reconstruction of marker genes against the ground truth
in Figure 4.10, Figure 4.11, and Figure 4.12. We reconstruct individual gene measurements by tak-
ing a weighted average according to the reconstructed distribution of the expression levels in our
training set over which our distribution is specified. The ground truth is the true expression levels of
the marker genes in the test set cells. Note that both are log values to allow for better visualization.
43
Figure 4.11: Megakaryocyte Marker Gene (ITGA2B) Reconstruction
44
Finally, we compute the joint UMAP of
stable.
construction of the
distribution.
Again, we first
45
We plot the PDF and CDF of the centered and scaled JS divergence in Figure 4.16.
As expected, this method performs very well compared to the control. It also slightly outper-
cell.
original distributions. Even our worst recon- Figure 4.15: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled
structions have decent performance - some
structed. We see that our random cell reconstructions maintain the problem we saw with M1 with
Figure 4.16: PDF and CDF of Centered and Scaled JS Divergence for M1
46
Figure 4.17: A Sample Reconstruction from the Best 10% of Reconstructions
47
Figure 4.19: A Random Reconstruction
Again, we plot the log marker gene reconstruction expression levels against the ground truth.
48
Figure 4.21: Megakaryocyte Marker Gene (ITGA2B) Reconstruction
Finally, we compute the joint UMAP of the ground truth transcriptomes with the reconstructed
transcriptomes, and then draw lines connecting each ground truth coordinate with the correspond-
Our final examinations show that M5 performs similarly to M1, but slightly better on some eval-
uation metrics.
49
Figure 4.23: UMAP Displaying Reconstruction Differences
50
Remembering that I’ll be dead soon is the most important
tool I’ve ever encountered to help me make the big choices
in life.
Almost everything–all external expectations, all pride,
all fear of embarrassment or failure–these things just
fall away in the face of death, leaving only what is truly
important.
Remembering that you are going to die is the best way I
know to avoid the trap of thinking you have something
to lose. You are already naked. There is no reason not to
5
follow your heart.
Steve Jobs
Discussion
We first discuss the conclusions we gather from the results presented in the previous section.
5.1 Conclusions
Both our closely-examined methods for distribution reconstruction perform well. Indeed, they per-
form well enough to support our hypothesis that we can reconstruct the full transcriptome / state
of a cell from binary combinations of relatively few genes. The next steps are to take these methods
51
forward by validating them and bringing them into practice.
There are a variety of further directions to pursue this research, both to validate the results pre-
From the results for M1, we can see that this method tends to predict more diffuse distributions
than the ground-truth. This may be a result of using the Laplacian coefficients as an intermediate
this is true, we’re getting a result that we desired, but we’d also like to make the resulting distribution
tighter to have it fit our ground-truth distributions better. We could simply truncate the resulting
We mentioned in the Methods section that we use scVI, a technique used to clean up raw scRNA-
seq data by imputing missing measurements and otherwise accounting for noise. One open prob-
lem in the field is that different imputation techniques are in use, and these techniques tend to give
different results. We’d like to see if our computational methods are robust to imputation technique,
as if they are, it indicates that they’re learning more biologically relevant information.
One long-standing open problem in the single-cell sequencing space is understanding how to
determine the accuracy of transcriptome reconstruction. This is one reason why we chose to re-
construct distributions as our primary endpoint as opposed to gene expression levels. It’s unclear
what a particular MSE in transcriptome reconstruction means biologically. This is a direction that
may prove fruitful for further research. A corollary of this problem is having some way to directly
One interesting approach to this problem could be to learn biologically relevant labels, rather
than the difficult-to-interpret transcriptome. For example, we could learn manually-labelled cell
52
types as a classification problem, and it would thus be easier to gauge the performance of our meth-
ods.
We also would like to dig deeper into the assumption that our training data represents all pos-
sible transcriptomes we might see. One way to do this would be to simulate a variety of training-
validation-test set splits, then examine the distributions of distance of test set cells to the closest
training set cell. If we found that test set cells were consistently located quite close to their closest
training set neighbor, this would validate our assumption. Another way to bypass this assump-
tion entirely is simply to collect an enormous amount of training data, perhaps by figuring out how
to use the large amounts of scRNA-seq data that have already been generated for other purposes.
There are technical challenges involved with this, such as correcting for the effects of different ex-
perimental batches (so called ”batch effects”), but it appears to be a promising direction. One of
the core insights behind AlphaFold’s success in the protein folding domain was that there’s an enor-
mous amount of unlabelled biological data in the form of protein sequences, and that data was
leveraged to make headway on the protein folding problem. We posit that a similar situation holds
with the gluttony of single-cell data that’s being produced on a regular basis at this moment, and an
interesting question is understanding how to leverage this data to solve interesting biological ques-
tions.
Next, we’d like to come up with a method to represent a test cell as a distribution over the train-
ing cells in a more accurate way, rather than simply assigning it to the closest cell in the training set.
One example could be to consider the training cells as bases for cell state space, and then represent
new cells using their coefficients in this space. The problem with this is training becomes more diffi-
cult, as all the training cells are then just indicator functions, as opposed to the linear combinations
The final, and most important direction to take this research would be to use these computa-
tional techniques in concert with an experimental workflow to validate the full-stack thesis that bi-
53
nary combinations of genes can be used to impute the full spatial transcriptome. The experimental
workflow would need to be carefully designed, since it’s unclear how to gather both ground-truth
binary combination data from a targeted FISH method as well as spatial transcriptome data from
the same sample, as would be needed to truly validate that this method works. Nevertheless, it’s the
next problem that would need to be solved to bring the results of this thesis to fruition.
54
References
[1] Asp, M., Bergenstråhle, J., & Lundeberg, J. (2020). Spatially Resolved Transcriptomes—
Next Generation Tools for Tissue Exploration. BioEssays, 42(10), 1–16.
[3] Cable, D. M., Murray, E., Zou, L. S., Goeva, A., Macosko, E. Z., Chen, F., & Irizarry, R. A.
(2021). Robust decomposition of cell type mixtures in spatial transcriptomics. Nature
Biotechnology, (pp. 1–25).
[4] Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S., & Zhuang, X. (2015). Spatially re-
solved, highly multiplexed RNA profiling in single cells. Science, 348(6233), 1360–1363.
[5] Cleary, B., Murray, E., Alam, S., Sinha, A., Habibi, E., Simonton, B., Bezney, J., Marshall,
J., Lander, E. S., Chen, F., & Regev, A. (2019). Compressed sensing for imaging transcrip-
tomics. bioRxiv.
[6] Crick F. H. (1958). On protein synthesis. Symposia of the Society for Experimental Biology,
12, 138–163.
[7] Femino, A. M., Fay, F. S., Fogarty, K., & Singer, R. H. (1998). Visualization of single RNA
transcripts in situ. Science, 280(5363), 585–590.
[8] Haghverdi, L., Buettner, F., & Theis, F. J. (2015). Diffusion maps for high-dimensional
single-cell analysis of differentiation data. Bioinformatics, 31(18), 2989–2998.
[9] Harvey, W. (2015). Intuition for laplacian matrix of a graph’s eigenvectors and eigenvalues.
[10] Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz,
D. A., & Kirschner, M. W. (2015). Droplet barcoding for single-cell transcriptomics applied
to embryonic stem cells. Cell, 161(5), 1187–1201.
[11] Langer-Safer, P. R., Levine, M., & Ward, D. C. (1982). Immunological methods for mapping
genes on Drosophila polytene chromosomes. Proceedings of the National Academy of Sciences
of the United States of America, 79(14 I), 4381–4385.
55
[12] Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative model-
ing for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058.
[13] McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. arXiv.
[14] Milo, R., Jorgensen, P., Moran, U., Weber, G., & Springer, M. (2009). BioNumbers The
database of key numbers in molecular and cell biology. Nucleic Acids Research, 38(SUPPL.1),
750–753.
[15] Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R.,
Welch, J., Chen, L. M., Chen, F., & Macosko, E. Z. (2019). Slide-seq: A scalable technology
for measuring genome-wide expression at high spatial resolution. Science, 363(6434), 1463–
1467.
[16] Stahl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Gia-
comello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi,
S., Borg, Å., Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J.,
& Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial
transcriptomics. Science (New York, N.Y.), 353(6294), 78–82.
[17] Vickovic, S., Eraslan, G., Salmén, F., Klughammer, J., Stenbeck, L., Schapiro, D., Äijö, T.,
Bonneau, R., Bergenstråhle, L., Navarro, J. F., Gould, J., Griffin, G. K., Borg, Å., Ronaghi,
M., Frisén, J., Lundeberg, J., Regev, A., & Ståhl, P. L. (2019). High-definition spatial tran-
scriptomics for in situ tissue profiling. Nature Methods, 16(10), 987–990.
[18] Weber, A. P. (2015). Discovering new biology through sequencing of RNA. Plant Physiol-
ogy, 169(3), 1524–1531.
56