You are on page 1of 64

Leveraging Low-Dimensional Structure to

Enable Spatial Transcriptomics

a dissertation presented
by
Kushagra Sharma
to
The Department of Computer Science

in partial fulfillment of the requirements


for the degree of
Artium Baccalaureus
in the subject of
Computer Science

Harvard University
Cambridge, Massachusetts
December 2021
©2021 – Kushagra Sharma
all rights reserved.
Thesis advisor: Professor Sahand Hormoz Kushagra Sharma

Leveraging Low-Dimensional Structure to Enable Spatial


Transcriptomics
Abstract

Information about biological phenotype can be gleaned from a variety of sources. We’ve made
rapid progress in the last decade in more and more accurate measurements of one, central source:
gene expression levels inside of cells. We’re now able to rapidly and cheaply sequence the content
and abundance of RNA transcripts down to single-cell resolution. However, in the process we lose
information regarding the spatial context of the cell: where in the tissue it originated from. Tech-
niques have been developed in the last few years to remedy this problem, by incorporating spatial
information into gene expression measurements. However, these techniques tend to be restricted to
the lab of origin due to their high degree of technical complexity. We aim to alleviate this problem
by using low-dimensional structure in gene expression profiles to use low-dimensional experimental
measurements that are widely accessible to impute the full, high-dimensional spatial transcriptome.

iii
Contents

0 Introduction 1
0.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.3 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Related Work 6
1.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Methods 12
2.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Experiment Goals and Design 19


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Evaluation, Comparison, and Understanding . . . . . . . . . . . . . . . . . . . 35

4 Results 36
4.1 High Level Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Diving Deeper into Distribution Reconstructions . . . . . . . . . . . . . . . . . 40

5 Discussion 51
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Further Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

References 56

iv
Listing of figures

3.1 Examining the Distribution of N(p(i)) for σ = 5 . . . . . . . . . . . . . . . . . 22


3.2 Examining the Distribution of N(p(i)) for σ = 100 . . . . . . . . . . . . . . . . 23
3.3 Summary of the Properties of the Gaussian with Systematically Selected σ . . . . . 24
3.4 The Euclidean Distance to the kth Neighbor, Outlier and Median . . . . . . . . . 25
3.5 Visualizing the Distribution for Various Cells . . . . . . . . . . . . . . . . . . . 26
3.6 Plotting UMAP Visualizations and PDFs to √Understand Distribution . . . . . . 27
3.7 Visualizing Neighbor Distances with K = N . . . . . . . . . . . . . . . . . . 28
3.8 Plotting Marker Genes on the Original UMAP and our KNNG UMAP . . . . . 29
3.9 Visualizing Various Laplacian Eigenvectors . . . . . . . . . . . . . . . . . . . . . 31
3.10 Examining the Reconstruction Performance with Varying Normalization Methods,
Coefficient Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.11 Examining the Neural Network Reconstruction Performance by Laplacian Coeffi-
cient Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Mean JS Divergence Comparison by Method . . . . . . . . . . . . . . . . . . . 37


4.2 CDFs for our best methods, compared to the CDFs for true Laplacian coefficient re-
constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Mean MSE Comparison by Method . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Scatter-plot of JS Divergence by Cell . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Scatter-plot of JS Divergence, Centered on Negative Control and Rescaled . . . . 40
4.6 PDF and CDF of Centered and Scaled JS Divergence for M1 . . . . . . . . . . . 41
4.7 A Sample Reconstruction from the Best 10% of Reconstructions . . . . . . . . . 41
4.8 A Sample Reconstruction from the Worst 10% of Reconstructions . . . . . . . . 42
4.9 A Random Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 Erythroid Marker Gene (CA1) Reconstruction . . . . . . . . . . . . . . . . . . 43
4.11 Megakaryocyte Marker Gene (ITGA2B) Reconstruction . . . . . . . . . . . . . 44
4.12 HSC/MPP Marker Gene (CRHBP) Reconstruction . . . . . . . . . . . . . . . 44
4.13 UMAP Displaying Reconstruction Differences . . . . . . . . . . . . . . . . . . 45
4.14 Scatter-plot of JS Divergence by Cell . . . . . . . . . . . . . . . . . . . . . . . . 45
4.15 Scatter-plot of JS Divergence, Centered on Negative Control and Rescaled . . . . 46
4.16 PDF and CDF of Centered and Scaled JS Divergence for M1 . . . . . . . . . . . 46

v
4.17 A Sample Reconstruction from the Best 10% of Reconstructions . . . . . . . . . 47
4.18 A Sample Reconstruction from the Worst 10% of Reconstructions . . . . . . . . 47
4.19 A Random Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.20 Erythroid Marker Gene (CA1) Reconstruction . . . . . . . . . . . . . . . . . . 48
4.21 Megakaryocyte Marker Gene (ITGA2B) Reconstruction . . . . . . . . . . . . . 49
4.22 HSC/MPP Marker Gene (CRHBP) Reconstruction . . . . . . . . . . . . . . . 49
4.23 UMAP Displaying Reconstruction Differences . . . . . . . . . . . . . . . . . . 50

vi
Acknowledgments

There are so many people in my life to whom credit for this accomplishment are due. Think-
ing and recalling everyone who in some way contributed to my life and success was a pleasure.
Starting from the beginning, my heartfelt thanks to my mom for being one of the most independently-
minded people I’ve ever met. You’re never satisfied by thinking with the crowd, and I’ve come to re-
alize how unique this is. You’ve taught me the same, and have always encouraged me to pursue what
I find interesting, independent of what the world may think. To Papa - thanks for buying me my
first computer, pushing me towards learning programming (even if it may not have worked at first),
and the support throughout my life. You’ve never in my life failed to be there when I’ve needed it
the most. Thanks for being a great father. Thanks to my sister for being a lifelong companion, and
having one of the warmest hearts of anyone I’ve ever met. Your intelligence and maturity continues
to surprise me and warm my heart.
Thanks to all my friends who’ve ever been by my side, exploring ideas, life, and this world to-
gether. To list but a few, Dominic Tanzillo, Kubilay Agi, Keaton Gibbs, Tanay Tandon, Evan Hart,
Ari Hatzimemos, Casey Carter, Soumil Singh, Oscar Avatare, Kendall Zhu, Isabel Haro, Snigdha
Roy, and Alex K. Chen. I’m grateful to have had the chance to share my life with you all.
Thanks to Tim Jaconette and Pai-Ling Yin for bringing in a 15 year old to their research group
and giving me my first exposure to the academic world. Enormous gratitude to Carlos Garay, who
was my first real mentor, and without whom I’d be a significantly worse software engineer. This
thesis and its 10, 000 lines of code would not have been possible without your early guidance. Re-
reading through some of our exchanges recently reminded me how brilliant a mentor you were for
me.
A heartfelt thanks to Terri Bittner, who is the most courageous educator I’ve ever had the privi-
lege to learn from. Without her selfless attempts to open the doors of higher mathematics to high
school students, this thesis and my entire career path wouldn’t have been possible. She resisted our
educational system’s drive to mediocrity and refused to accept ’good-enough.’
Thanks to Kenneth Blum for taking me seriously as a scientist, and making me realize that there
was a there ’there’ when it came to scientific exploration.
Thanks to Laura Deming for inspiring me with an example of personal integrity, agency, and
authenticity. You were critical for my development at a key fork in the road, and I’m deeply grateful.
Thanks to Lada Nuzhna for the deep support, encouragement, and nourishing conversations

vii
that are ever-reinvigorating. We share ambitions and value-systems to such an enormous degree, and
I’m excited to keep building with you. I also can’t forget your very instrumental contributions to
this thesis - your machine learning expertise was key to the success of this research.
A deep thanks to Fei Chen for supporting my first explorations in the field of biology. You took
an unproven student in and helped me build my abilities, interests, and confidence in research.
Thanks to Sahand Hormoz for guiding this research project, and David Jacobowitz to providing
just the right degree of involvement - helping me succeed when I could do it by myself, and support-
ing me when I couldn’t.
A final thanks to all who supported me over the last year of transitioning into biology with con-
versations, advice, guidance, and endowed me with the confidence that I could do this (Tony Kulesa,
Sam Rodriques, Patrick Hsu, Adam Marblestone, Ankit Ranjan, Martin Borch Jensen, Ed Boyden,
Daniel Goodwin, Rob Phillips, Tom Knight). Not all of you know it, but our conversations meant
an enormous deal to me and played a part in putting me where I am now.
Onwards!

viii
Introduction
0
0.1 Background

Science progresses by the mutually reinforcing feedback loops between higher quality measure-

ments, higher quality predictions, and higher quality deductions. As a field, biology has mostly

aimed to make better measurements, particularly since the era of molecular biology began. A po-

tential reason for this is that it has been clear for a long time that there are better measurements to

1
be made, and what those measurements are. It’s been similarly clear in what ways those measure-

ments would be useful if we were able to make them. We can trace back the origins of the focus on

measurement specifically of genetic information to Frederick Griffith’s identification of the trans-

forming factor, a substance that was able to transform a nonvirulent strain of Streptococcus pneumo-

niae, the bacterial cause of pneumonia, into a virulent strain of the same bacteria, thereby carrying

otherwise hereditary information from one strain to another. Once it became clear that there was a

hereditary material, we were off to the races. We began to identify what that material was, to deter-

mine the structure of that material, and to determine how that material came to create phenotype.

In the process of doing so, we uncovered the central dogma of molecular biology: that informa-

tion is transformed from DNA to RNA to protein, with the latter two (RNA and protein) creating

most of the visible phenotypes of the cell. This dogma was established in 1958, 6 and we’re still on

the quest to measure each of these components in their full context. Genome (DNA) sequencing

was the first to mature, although epigenetic sequencing is still largely beyond our grasp.

The product of DNA is RNA, is also known as a transcript. Each RNA molecule encodes the

information necessary to produce a single protein (in eukaryotes).* RNA is encoded in (roughly)

the same four-character language as DNA, and each RNA corresponds to a specific segment of

DNA known as the coding region, from which the RNA is synthesized (transcribed). Just as the

full set of genes is known as the genome, the full set of transcripts is known as the transcriptome.

Transcriptome (RNA) sequencing (transcriptomics) has more recently matured, and we now have

the ability to measure 10 − 50% of the RNAs inside a single cell.

The same fundamental technology that was developed for genome sequencing was leveraged to

sequence the transcriptome, since RNA can be enzymatically converted to corresponding DNA

in a process known as reverse transcription, and then can be sequenced using genome sequencing
* With some exceptions, such as RNAs that do not code for proteins but are useful molecules in their own
right.

2
technology.

The key development in genome sequencing technology that enabled RNA sequencing (RNA-

seq) was next-generation sequencing (NGS). 18 NGS dramatically increased the throughput of

genome sequencing, and made it possible to sequence large numbers of transcripts from different

cells.

RNA-seq tells you how many RNA transcripts are present inside of a cell, and tells you what the

sequence of each RNA transcript is. For RNA transcripts that are used to make proteins (mRNA),

the sequences of the transcripts tell you what protein is going to be made, and the quantity of a

specific transcript gives you information about the abundance of the corresponding protein. Since

proteins determine a large portion of the phenotype (observable characteristics and behavior) of a

cell, RNA-seq provides extremely valuable information about the ’state’ of a cell.

Despite the incredible power of RNA-seq, it has a few major weaknesses. To sequence the RNA

from single cells in a tissue (scRNA-seq), we dissociate the cells from the tissue, thereby removing

the individual cells from their spatial context. In the most popular scRNA-seq method, we then

suspend each cell inside an oil droplet, which serves as a reaction chamber for all non-RNA material

to be degraded, and for the RNA inside a cell to be converted into DNA. However, once the cell is

inside this droplet, we lose all knowledge of where the cell came from inside its tissue of origin.

These spatial locations inside the tissue are key to understanding the biological function of tis-

sues, pathological and healthy. Cells tend to be localized, and they tend to interact with and in-

fluence the cells in their local neighborhood. This can be seen from the key importance of spatial

patterns in embryonic development, tumor formation, the development of Alzheimer’s disease, and

nearly every other tissue-level biological problem.

Spatial transcriptomics refers to a broad class of techniques that share a common goal: to measure

the transcriptome of individual cells while maintaining knowledge of their spatial context within a

3
tissue. †

There have been many methods developed to solve this problem, which are covered in detail

in the Related Work chapter. These techniques have one thing in common: they are technically

complex to perform, and require a degree of implicit knowledge that is difficult to transfer out of

the originating laboratory. Only 4/11 of the developed transcriptome-wide spatial transcriptomics

methods have been reproduced outside of the originators lab, and none are in widespread usage.

This is a flaw common to much of experimental biology, and is concerning for a method that has

the potential to produce otherwise inaccessible information to aid our understanding of basic and

translational biology. 1

0.2 Motivation

Right now, spatial transcriptomics data is only (in)accessible via the technically challenging and

centralized methods previously mentioned. What if we could use insights about the underlying

structure of transcriptomic data to broaden access to methods to generate this data? This is the

question we ask in this research.

To be more specific about the difficulty of spatial transcriptomics methods, it is specifically full

transcriptome data that is difficult to generate. Eukaryotes have on the order of 105 − 106 transcripts

per cell, with number of unique transcripts on the order of 105 . 14 Measuring the entire range of

possible genes is difficult. However, there are methods that are widely used that allow 102 genes to

be measured with spatial context.

Our work seeks to make full spatial transcriptomic data more abundant by using data from

widely accessible techniques to measure 102 genes to reconstruct the measurements for 105 genes.

Transcripts also have associated temporal information. Transcripts are produced at a particular time,
and have a half-life on the order of minutes. Recent efforts are aiming at resolving the age of transcripts in
scRNA-seq to measure this information.

4
0.3 Our contributions

Our work seeks to use these ”low-dimensional” ( 102 genes) measurements to reconstruct / impute

the ”high-dimensional” ( 105 genes) full transcriptome. To do this, we generate an experimentally

realistic low-dimensional dataset and see if we can faithfully reconstruct the true high-dimensional

data. We use a variety of methods from statistics, machine learning, and mathematics that are elabo-

rated on in Methods and Experiment Goals and Design. If successful, these computational tech-

niques provide a framework for designing less technically challenging experimental paradigms for

full transcriptome measurement.

We construct methods for imputation of data, for measurement of imputation error, for repre-

sentation of data, and for evaluation of performance. We then critically evaluate the performance of

our methods.

Further directions for research are discussed in the Discussion section of the paper. All code for

this thesis is located on GitHub.

5
It is not the critic who counts; not the man who points out
how the strong man stumbles, or where the doer of deeds
could have done them better. The credit belongs to the
man who is actually in the arena, whose face is marred
by dust and sweat and blood; who strives valiantly; who
errs, who comes short again and again, because there is no
effort without error and shortcoming; but who does actu-
ally strive to do the deeds; who knows great enthusiasms,
the great devotions; who spends himself in a worthy cause;
who at the best knows in the end the triumph of high

1
achievement, and who at the worst, if he fails, at least
fails while daring greatly, so that his place shall never be
with those cold and timid souls who neither know victory
nor defeat.

Theodore Roosevelt

Related Work

To provide context for the current state of spatial transcriptomics, the following sections de-

scribe the most popular transcriptome-wide and targeted spatial transcriptomics. We also describe

some computational techniques that served as inspiration for the methods used in this research.

6
1.1 Experimental Methods

To provide context for the state of spatial transcriptomics, we sketch out methods for both transcriptome-

wide measurement as well as targeted measurement of specific genes.

1.1.1 Transcriptome-Wide Measurement

We first describe techniques that are able to measure all gene expression profiles with spatial context.

These techniques tend to be experimentally difficult and restricted to a few labs.

Spatial Transcriptomics

The original spatial transcriptomics method that bears the name is Spatial Transcriptomics, from

2016. This method reverse-transcribes mRNA from a tissue sample in place into cDNA, and uses

spatially specific oligonucleotide primers to perform the reverse-transcription. Thus, the cDNA

formed from the mRNA encodes the spatial position of the mRNA in the original tissue sample.

When the cDNA is sequenced, the original location of the mRNA can be resolved. 16

Slide-seq

One of the more recent spatial transcriptomics techniques that has the potential to become accessi-

ble and wide-spread is Slide-seq. Slide-seq involves the following process. First, you make a ”puck” -

a surface of 10μm DNA-barcoded beads with a known sequence to location map. Then, you place

a tissue slice on top of the puck, and allow RNA from the tissue to be captured by the individual

beads. When you sequence the beads, you read out both the RNA and the DNA barcode. The

DNA barcode allows you to resolve the spatial location of the RNA on the puck, creating a spatial

transciptome. 15

7
One of the main problems with the technique is the problem of ’doublets,’ where more than one

cell’s RNA is captured in a bead, and it is thus difficult to resolve which transcripts belong to which

cell. This is caused by the fact that the resolution of individual beads is 10μm which is around the

average diameter of a cell, meaning that multiple cells can be included in a single bead. Computa-

tional approaches are currently being developed to resolve cells using benchmark measurements that

indicate typical cell profiles. 3

HDST

HDST is very similar in technique to Slide-seq. The main difference is the size of the beads and the

resulting resolution - HDST uses beads with a 2μm spatial resolution, allowing for higher spatial

resolution and fewer problems with resolving transcripts to individual cells. 17

Compressed sensing for imaging transcriptomics

Compressed sensing for imaging transcriptomics is the method most similar to our own. This

method leverages compressed sensing, a technique from signal processing, to infer gene expression

levels from sparse linear combinations of particular genes. The sparse measurements are collected

with probes that are added in the stochiometric combinations corresponding to the linear combi-

nations needed. Then, the framework of compressed sensing is used to impute the overall gene ex-

pression profile. The main differences between this work and ours are the differences in imputation

method and the differences in the proposed experimental techniques. 5

1.1.2 Targeted Measurement

There are a variety of methods for targeted measurement of specific transcripts in situ, with the

earliest method going back to 1982. All of the methods surveyed here share a core technique in

8
common: fluorescent in situ hybridization (FISH).

FISH

The original FISH method is quite simple. It aimed to localize particular DNA sequences on Drosophilia

chromosomes. Biotin molecules were enzymatically incorporated onto DNA probes. The DNA

probes were then bound to the target sequences in situ, since two complementary DNA sequences

will bind together in a process known as hybridization.

The probe is then fluorescent tagged using any of a variety of methods, most commonly using

antibodies that bind to the biotin molecule. The fluorescence can then be read out using fluores-

cence microscopy, allowing for a readout of localization of specific DNA sequences in situ. 11 The

method is more generally applicable to identifying particular DNA sequences in any context, not

just Drosophilia chromosomes.

smFISH

FISH was then adapted to detect RNA molecules, since the basic methodology for targeting differ-

ent nucleic acids is quite extensible. smFISH uses multiple probes to bind to different regions of the

same mRNA molecule, creating a bright spot indicative of a transcript upon binding. This reduces

the false positive rate since unbound probes are unlikely to localize to the same region and create

spots with the same order of intensity as a correctly identified transcript. 7

MERFISH

smFISH was quite a powerful technique, but its main drawback was the lack of throughput. Even

using multiple fluorophores that could be read out on unique wavelengths, there’s a cap on the

number of transcripts that can be targeted in a particular round of hybridization. Additionally,

9
each round of hybridization degrades the quality of the sample, so you can’t just run thousands of

rounds to measure all the genes. Finally, each round takes a substantial amount of time, due to the

need for high-resolution imaging of a large side (relative to the resolution of the necessary imaging).

MERFISH was created to solve some of these problems.

Instead of making binary 0/1 measurements of transcripts, MERFISH assigns each transcript of

interest a K-bit barcode. There are K rounds of hybridization. During each round (i) of hybridiza-

tion, all RNAs with 1s in the ith position have fluorophores bound to them. The combined output

of all rounds is a read-out of binary barcodes at each spatial location where a transcript is located.

This allows the transcript at each location to be reconstructed. The Hamming Distance (a distance

measure between two binary strings) between each transcript’s barcode is maximized to minimize

errors.

This method allows several orders of magnitude higher scale than smFISH due to the efficiency

of information gathering in each round of imaging, since the rounds of imaging are the scarce quan-

tity in each of these methods, as they require both time and degrade the sample quality. 4

1.2 Computational Methods

1.2.1 Diffusion Maps

The main computational technique that influenced this work was the application of diffusion maps

to single-cell RNA sequencing data.

Diffusion maps is a non-linear dimensionality reduction technique that uses the distance be-

tween two points and the corresponding transition probabilities between two points (with points

viewed as states) to create lower-dimensional embeddings of the states.

This idea was applied to scRNA-seq data to define differentiation trajectories. Cells develop by

travelling along a trajectory in transcriptome space from their undifferentiated state to a differenti-

10
ated state, and the notions of distance and transition probabilities between states are clearly present

in the fundamental biology. 8 We used this idea of non-linear dimensionality reduction based on

transcriptome distance in our own work.

11
You keep on learning and learning, and pretty soon you
learn something no one has learned before.

Richard Feynman

Methods
2
In this chapter, we describe the experimental methods used to generate the data we work with,

as well as the computational methods we use to perform our research.

12
2.1 Experimental Methods

2.1.1 scRNA-seq

Our data was gathered by members of the Hormoz Lab using droplet-based scRNA-seq on patient

bone marrow samples. The dataset was enriched for CD34+ cells (precursor cells) and processed

using the 10x Genomics scRNA-seq platform.

In droplet-based scRNA-seq, individual cells are dissociated from the tissue of origin and en-

capsulated into nanoliter oil droplets. Once inside the droplets, the mRNA is reverse-transcribed

into cDNA with a unique barcode sequence attaches. The cDNA is then pooled together and se-

quenced, and the unique barcodes allow the sequencing results to be resolved into the single cells

that they originated from. 10

2.2 Computational Methods

2.2.1 Single-Cell Variational Inference (scVI)

One of the main weaknesses of scRNA-seq is that it suffers from large amounts of noise. In partic-

ular, because it only captures at most 50% of total transcripts inside a cell, it’s prone to not measur-

ing low-abundance transcripts (known as ”dropout”). Single-cell variational inference (scVI) is a

method that is used to impute more accurate transcriptome data from the raw data collected from

scRNA-seq. It uses stochastic optimization and deep neural networks to approximate the underly-

ing distribution that generates scRNA-seq measurements, i.e. the biological ground truth. We use it

as a pre-processing step on our scRNA-seq data for all analyses in this research. 12

13
2.2.2 Random Binary Matrices

To simulate experimentally realistic low-dimensional measurements, we generate a variety of ran-

dom binary matrices which we use to construct a dataset from the full transcriptome measurements

we get from scRNA-seq measurements. We find, surprisingly, that the variance of the performance

of our randomized matrices is low, implying that using biological domain knowledge to select genes

to measure is not particularly important.

We use random binary matrices with 50 genes measured across 10 distinct measurements, i.e.

each feature in our vector is a binary combination of 50 genes, and we have 10 such features. We ran-

domly select the genes from the set of all genes in the transcriptome, without replacement for any

particular feature but with replacement across features. We chose this setting due to its experimental

plausibility and good performance for reconstruction.

2.2.3 Uniform Manifold Approximation and Projection (UMAP)

UMAP is a non-linear dimensionality reduction technique commonly used in analysis of scRNA-

seq data. It creates a high-dimensional graph of the data and then tries to construct a low-dimensional

graph that maintains as many properties of the high-dimensional graph as possible. We use it mainly

for visualizations of the outputs of our reconstructions. 13

2.2.4 Neural Networks

Neural networks are a commonly used computational technique in computer science, used to pre-

dict an output (labels) from an input (features). They’re ’trained’ on a training set that is represen-

tative of the overall distribution the data is drawn from. The training set contains a large amount of

feature:label mappings.

14
The parameters of the neural network are optimized to minimize a loss function, which is com-

puted by comparing the outputs of the neural network to the ground-truth labels for the training

data. The loss function is a function of the parameters of the neural network and can be differen-

tiated with respect to those parameters, allowing for simple techniques like gradient descent to be

used to minimize the loss function.

There are various hyper-parameters of the neural network such as the number of parameters,

the orientation of parameters, etc. We hold out a small number of feature:label maps known as the

validation set to optimize the hyper-parameters by examining values of the loss function on the

validation set for different settings of the hyper-parameters.

2.2.5 KNN Graphs

Our single cell data sits in an Ngenes -dimensional transcriptomic space, where Ngenes is the number

of genes measured. We have reason to believe that Euclidean distance between cells in this space

has biological meaning. Cells in a small neighborhood may be of the same subtype. For example,

CD8+ T-cells and microglia have distinct neighborhoods that can be identified via cluster analy-

sis. Cells that are close together in transcriptomic space may also be a part of the same cell lineage,

since cell fate transitions occur via changes in the transcriptome of a cell. For example, multipotent

hematopoietic stem cells (hemocytoblasts) are close in transcriptomic space to their direct descen-

dant, the common lymphoid progenitor.

We can formalize the importance of Euclidean distance on this space by constructing a graph,

where the nodes are cells and edges are some representation of distance. We use the k-nearest neigh-

bor graph (k-NNG) construction method.

The KNNG construction method is quite simple. We compute Euclidean distances between

all cells, and for each cell, we construct edges between that cell and each of its K nearest neighbors

in Euclidean space. Note that as the K nearest neighbor relation is not symmetric, the number of

15
edges can vary between nodes. We choose to make the graph symmetric manually for ease of further

calculations. The graph is undirected and unweighted.

2.2.6 Laplacian Matrices

Let A be the adjacency matrix for our KNNG and D be a diagonal matrix with the degrees of the

vertices down the diagonal (i.e. the degree matrix). Then the Laplacian Matrix is defined as ≡ D−A.

Let N be the number of vertices in our graph. 2

Since L is a real symmetric matrix, it’s diagonalizable and has eigenvalues λ1 , ..., λN and or-

thonormal eigenvectors⃗v1 , ...,⃗vN . Let us consider this list sorted by eigenvalue, such that λ1 ≤

λ2 ≤ ..., ≤ λN .

The eigenvectors of L form a basis set for integrable functions on our graph. This is because the

eigenvectors of the Laplacian roughly converge to the eigenfunctions of the Laplace-Beltrami op-

erator on the underlying manifold, and these eigenfunctions form a basis set for functions on the

manifold. This is itself a generalization of the Fourier basis functions from functions on R to com-

pact Riemannian manifolds. Importantly, because our eigenvalues are monotonically increasing, the

corresponding eigenvectors capture higher-and-higher frequency features of the underlying func-

tion space, just as with Fourier basis functions. Thus, using the first N eigenvectors to represent a

function constitutes a low-pass filter on the function. 9

We use these Laplacian eigenvectors to represent, embed, and reconstruct functions on our cell

graph. We can project a function onto the Laplacian eigenvectors, and use the resulting coefficients

to capture the function. We can use the first N coefficients to reconstruct the lower-frequency struc-

ture of the function. We can try to impute those coefficients for some data that we know has a cor-

responding function on the graph, and use the imputation to compute an approximation of the

function. We use all of these capabilities in our research.

16
2.2.7 Distribution Functions in Cell Space

The specific class of functions on graphs that we’re interested in are distributions over cells. We use

the aforementioned binary combinations to reconstruct a cell state, and we specifically reconstruct

a discrete Gaussian analogue, centered at the ground truth cell. We considered using an indicator

function, but the sharp peak on a specific cell is unforgiving for statistical imputation methods,

and a somewhat-diffuse distribution over cell states may be closer to the biological ground truth,

assuming that our unseen cell can be represented as some linear combination of seen cells, which is a

less strict requirement than explicitly assuming that our unseen cell is in the training set. We present

this distribution in more detail in the following chapter.

2.2.8 Error Measures

We use two error measures, one to measure the accuracy of distribution reconstruction and one to

measure the accuracy of transcriptome reconstruction.

Jensen Shannon divergence

The Jensen Shannon divergence (JS divergence) is a symmetric measure of distance between two

distributions. It’s based on the Kullback–Leibler (KL) divergence. For two distributions P, Q, the JS

divergence is defined as:


1 1
JS(P∥Q) = KL(P∥M) + KL(Q∥M)
2 2

Where M = 21 (P + Q) and KL is the KL divergence, defined as:

Ngenes
∑ ( )
pi (x)
KL(P∥Q) = pi (x) log
i=1
qi (x)

17
Mean Squared Error

Mean squared error is a standard error measure in Euclidean space. It’s defined as:

Ngenes
1 ∑
MSE = (X(i) − X̂(i) )2
Ngenes i=1

18
But, look, I don’t think of myself, I guess for me, people tell
me, “Oh, you’re so smart about this or that thing,” but to
me, that’s not what it feels like to me. I’m always trying
to figure out things that are difficult for me to figure
out. Now, maybe some of those things are really, really
difficult for some other people to figure out. But for me,
I’m always kind of–I’m always struggling to figure stuff
out so it doesn’t…The, kind of, the internal perception is
not one of kind of—I mean, the fact that I’m, I’m always
trying to figure stuff out.

3
Stephen Wolfram

Experiment Goals and Design

We’re trying to understand whether or not it’s feasible to reconstruct the whole transcrip-

tome from low-dimensional measurements. The low-dimensional measurements are meant to

represent experimentally feasible measurements. We concretely simulate these experimental mea-

surements by taking binary combinations of genes from scRNA-seq data. We begin by describing,

at a high level, what we’re aiming to do and how we accomplish it.

19
3.1 Overview

Our goal is to reconstruct the expression levels of every gene in a cell from binary combinations of

genes. We frame this problem in two ways:

1. Reconstructing a distribution over cell states (transcriptomes) from our training set, in KNN

graph space

2. Reconstructing exact expression levels

The first is our primary problem in this research. We choose to approach the problem in this way

for a few reasons. We believe that the KNNG structure encodes important information about the

biological ground truth, that isn’t captured in direct inference of expression levels. First, it makes

explicit the relationships between cells and their nearest neighbors. This has biological meaning be-

cause cells transition through transcriptome states continuously, for example, during differentiation

or stimulus-response. Therefore, cells with nearby transcriptome states are biologically related to

each other.

Additionally, the Laplacian on this graph incorporates additional information with biological

meaning. The Laplacian Matrix can be interpreted as a discrete analog of the Laplace operator ∇2 ,

representing the average rate of change at a vertex (cell). We can think of this as the local transition

probabilities around any cell, or the rate of diffusion for a cell starting at a particular location on the

graph.

Since we want to reconstruct a distribution on the KNNG, we need a natural coordinate system

to reconstruct to. The eigenvectors of the Laplacian form such a coordinate system, since imputing

the first N coordinates in this coordinate system is equivalent to reconstructing the most important

basis coefficients of the distribution function, in the sense of capturing the most information about

20
the underlying distribution. Additionally, we posit that using the Laplacian coordinate system as an

intermediate reconstruction target will act as a form of regularization, preventing over-fitting.

Thus, we use a variety of methods to reconstruct the coordinates of our distribution in the Lapla-

cian coordinate system (the Laplacian coefficients) as well as to reconstruct the distribution directly.

We also reconstruct the gene expression levels directly as a secondary line of effort.

3.2 Data

We split the data into a training, validation, and test set, as is typical for statistical inference prob-

lems. We use the training set to learn directly learnable parameters such as neural network weights,

and we use the validation set to determine hyper-parameters, such as the number of layers in the

neural network. Our models deliberately have no exposure to the test set until their evaluation.

3.3 Error Measures

We needed error measures to measure reconstruction error both between two distributions and

between two transcriptomes.

To measure error between two distributions, we chose to use the Jensen Shannon Divergence.

As previously mentioned, the Jensen Shannon divergence (JS divergence) is based on the Kullback–

Leibler divergence, but is symmetric and finite. We used the JS divergence for a variety of use cases,

including parameter selection and performance evaluation.

To train our neural networks that aimed to predict the distribution directly, we used the Kullback-

Leibler divergence.

To measure errors between two transcriptome reconstructions, we used the mean squared error

between the two transcriptomes.

21
3.4 Parameter Selection

3.4.1 Distribution Representation

The primary target of our reconstruction is a function over transcriptome space. In particular, we

aim to reconstruct a discrete Gaussian distribution centered on the target cell. The ”ground truth”

distribution centered on the jth cell is thus p(i) ∝ exp(−( ∥x


(j) −x(i) ∥
σ )2 ) for the ith cell. Note that

the distance measure is the squared L2 norm in transcriptome space. *

This begs the question of how to select the standard deviation, σ. The purpose of using a Gaus-

sian distribution as opposed to, say, an indicator function is to provide some degree of smoothness

around our target cell, but the aim is still for our reconstruction to be highly similar to our target

cell. In other words, we want some small number of cells around the target cell to be in our ground

truth distribution. To select σ, we chose N = 10 as the number of cells, not including the target cell,

as the number of cells we’d like to include in our ground truth distribution.

One way we could systematically choose

σ is be the following method. For a given dis-

tribution p(i), we define the number of cells

included in the distribution as N(p(i)) = |{i}|

for i such that p(i) is non-negligible, i.e.

p(i) ≥ ε for ε = 5 × 10−3 . We could then

choose the minimum value of σ such that

N(p(i)) ≥ 10.

The value of σ that will give us the desired


Figure 3.1: Examining the Distribution of N(p(i)) for
number of proximal cells in the ground truth σ=5
* The fundamental assumption behind this is that the training set contains all cells that we can measure,
since we reconstruct test set cells by first mapping them to the closest training set cell, and then reconstructing
a Gaussian centered on that cell.

22
distribution will vary by cell, since the local

distributions of cells in transcriptome space varies, both in our dataset and biologically. Thus, we

wanted to investigate the degree of variance in this respect for our dataset. We begin by considering

the distribution of N(p(i)) for σ = 5 over the cells in our training set in Figure 3.1.

The above results indicate that there are a large number of cells that require a significantly larger

value of σ to have an acceptable N(p(i)). There is a large amount of variance in N(p(i)) in general

for σ = 5. This indicates to us that we’ll need to choose custom values of σ for each cell that meet

our desired criteria.

One way we could do this could be, as previously mentioned, to run a binary search for the min-

imum value of σ such that our criteria, N(p(i)) ≥ 10, is met. Unfortunately, N(p(i)) is not mono-

tonic in σ. We can see this by noting that for σ → ∞, we get a uniform distribution, with all prob-

abilities lower than our threshold, and N(p(i)) = 0. This can also be seen in Figure 3.2 of the

distribution of N(p(i)) with σ = 100.

Thus, N(p(i)) is not an appropriate measure to search over. One could alternatively search using

a function that is monotonic in σ, for example, the entropy, or the number of cells inside the 2σ
(j) (j)
interval. We instead choose an elegant exact method, which is setting σj = ∥x(j) − x10 ∥ where x10 is

the 10th nearest neighbor to x(j) . This functionally creates a distribution with 10 cells in the ≈ 95%

interval from the mean.

We now set σj to the above value. To ensure

that this setting gives us the desired results, we

plot the distribution of the number of cells

in the 95% interval, the distribution of the

entropy for each distribution, and for the sake

of consistency with our previous metric, the

distribution of N(p(i)), all in Figure 3.3.

23

Figure 3.2: Examining the Distribution of N(p(i)) for


σ = 100
(a) Examining the Distribution of the Number of Cells (b) Examining the Distribution of N(p(i)) for System‐
in the 95% Interval atic σ

(c) Examining the Number of Cells in the 95% Interval (d) Examining the Distribution of the Entropy for
on a UMAP Visualization Systematic σ

Figure 3.3: Summary of the Properties of the Gaussian with Systematically Selected σ

24
Table 3.1: Summary Statistics for Outlier Cell

Summary Statistic Value


Mean 0.000278
σ 0.000011
Min 0.000267
25% 0.000277
50% 0.000277
75% 0.000277
Max 0.000785

Unfortunately, as can be seen in the first fig-

ure, there is still more variance in the number

of cells with non-negligible ground truth dis-

tribution probabilities that we’d like. This is

because our distribution is not precisely Gaussian, since it’s discrete. We can examine why this is the

case by looking in more detail at, for example, the cell at the rightmost end of the distribution, with

N = 3415 cells included in its 95% confidence interval. We’ll call this the ”Outlier Cell” for the re-

mainder of this section. We can first examine the summary statistics of the distribution in Table 3.1.

It appears that we have a highly uniform distribution. To understand why, we look at the dis-

tances of the 10 nearest neighbors to the outlier cell relative to the median cell with respect to the

confidence interval distribution, which will give us insight into σ for the underlying distribution.

As we can see, there’s huge variance in dis-

tance to the 10th nearest neighbor (and nearest

neighbors in general), with our outlier cell be-

ing very far away from nearly all cells. Note

that even the median cell plotted here has

≈ N = 200 cells in its 95% confidence in-

terval. Thus, our solution is to truncate the

25
Figure 3.4: The Euclidean Distance to the kth Neighbor,
Outlier and Median
distribution for each cell at the 11th highest

probability cells (10 plus the true cell), appro-

priately normalized.

We can now visualize the resulting distribu-

tions by plotting probabilities on a UMAP visualization for a few cells. In Figure 3.5 we examine the

outlier cell, the median cell, and two randomly selected cells.

Figure 3.5: Visualizing the Distribution for Various Cells

As we can see, the resulting distributions are not perfect but are decently localized and peaked

on the target cell. Note that UMAP visualizations themselves make assumptions about cell-cell

distances and thus distance on a UMAP plot is not necessarily equivalent to biologically-relevant

distance in transcriptome space. We can see this by plotting key marker genes and seeing non-perfect

localization, as is done in our section on choosing K for our KNNG.

Finally, we plot a few distributions for randomly selected cells side by side with the UMAP in

26
Figure 3.6 so we can get a sense for the shape of the distribution.

(a) UMAP and PDF for First Random Cell

(b) UMAP and PDF for Second Random Cell

Figure 3.6: Plotting UMAP Visualizations and PDFs to Understand Distribution

3.4.2 Graph Construction



Standard KNN graph construction methods use K = N where N is the number of samples /

data points. We naively began with this setting, but found that this was likely too high considering

the structure of the data. Below, we draw the UMAP representation of the data in two settings - the

27
first connects each cell and its nearest neighbor, and the second connects each cell and its farthest

neighbor to which it shared an edge in the symmetric KNNG with K = N.

(a) UMAP with the Farthest Neighbor Connected (b) UMAP with the Nearest Neighbor Connected

Figure 3.7: Visualizing Neighbor Distances with K = N

We see in Figure 3.7 that the farthest neighbor is quite far in UMAP space, indicating that we’re

using too high a setting for K.

The UMAP dimensionality reduction method itself constructs such a graph, and uses K = 15 as

its standard. UMAP is a standard dimensionality reduction method for single cell sequencing data,

so we chose to follow its standard.

Since the UMAP method uses a graph representation of the data as an intermediate computa-

tion, and is a well accepted visualization method for single cell data, we can compare the distribution

of certain marker genes (genes that have a spatial pattern in UMAPd data as well as biological sig-

nificance in differentiating between cell types in the bone marrow) between the original UMAP

representation and UMAP computed using our KNNG. These are plotted in Figure 3.8.

In Table 3.2 are the marker genes we chose to plot, by cell type. We plot these same marker genes

for all marker gene analyses in this thesis. Note that we plot log rather than absolute expression

levels for ease of visualization.

28
(a) Erythroid Marker Gene

(b) Megakaryocyte Marker Gene

(c) HSC/MPP Marker Gene 29


Figure 3.8: Plotting Marker Genes on the Original UMAP and our KNNG UMAP
Table 3.2: Marker Genes by Cell Type

Cell Type Marker Gene


Erythroid CA1
Megakaryocyte ITGA2B
HSC/MPP CRHBP

We see in Figure 3.8 graph maintains the expected spatial distributions for these marker genes, so

we conclude that our KNNG is appropriate to use for our analysis.

3.4.3 Laplacian Coefficient Selection

We want to understand how effective the Laplacian formulation is at allowing for a reconstruction

of the Gaussian distribution. We also want to get a sense for how the Laplacian reconstruction func-

tions.

We first address the first question by plotting a few Laplacian eigenvectors on the UMAP of the

training cells in Figure 3.9. The eigenvectors are bases for functions on the graph, and by plotting

them we can get a sense for how the reconstruction is functioning.

As we proceed from the first eigenvector to the last, each captures less and less variance of func-

tions on graph space. We can get a sense for what patterns each eigenvector is capturing from the

above visualization; some are capturing spatial patterns, some are more diffuse over the entire distri-

bution of cells.

Next, we examine the effectiveness of the Laplacian reconstruction and the number of coeffi-

cients needed to effectively reconstruct a distribution. To probe into this, we first look at the mean

JS Divergence when using different numbers of (true) Laplacian coefficients to reconstruct a distri-

bution, in the validation set.

Secondarily, when we perform our reconstruction using Laplacian coefficients, we need to turn

our reconstruction into a probability distribution. Since we’re not using all the Laplacian coeffi-

30
Figure 3.9: Visualizing Various Laplacian Eigenvectors

cients for many reconstructions, the resulting function is not constrained to be a distribution, so we

must make the function positive and normal. There are a few choices of ways we can make entries

positive: exponentiation, squaring, and taking the absolute value. We compare the JS Divergence

with respect to the number of coefficients used to make the reconstruction for all three of these

methods.

We see from Figure 3.10 that the absolute norm becomes better than the squared norm for a very

large number of coefficients. This makes sense, because the problem of having negative entries in

the reconstructed distribution would no longer be a problem for large numbers of coefficients, be-

cause the reconstruction would be naturally a probability distribution without any post-processing.

However, for the numbers of coefficients that we use in our reconstructions (<< 1000), the

squared method is better, so we proceed with that.

31
We now want to see

what the validation accu-

racy of our neural network

Laplacian reconstruction

method is when we vary

the number of Laplacian

coefficients we train to

reconstruct. We trained

neural networks to pre- Figure 3.10: Examining the Reconstruction Performance with Varying Normalization
Methods, Coefficient Numbers
dict varying numbers of

Laplacian coefficients, re-

constructed the distribution with the Laplacian eigenvectors, and computed the JS divergence. The

results are shown in Figure 3.11.

(a) Neural Network Reconstruction Performance by (b) Same as (a); but Zoomed in on Minimum Region
Laplacian Coefficient Count

Figure 3.11: Examining the Neural Network Reconstruction Performance by Laplacian Coefficient Count

We see a minimum at N = 50 coefficients used, so we choose to use that for our reconstruction.

32
3.5 Reconstruction Methods

We previously discussed the shape of the technical methods we used in our reconstruction, we now

discuss how we used those methods to ask the questions we’re interested in.

3.5.1 Distribution Reconstruction Methods

We begin with our primary task, the distribution reconstruction. We’re testing the following meth-

ods’ reconstruction accuracy (all beginning with the binary linear combination of genes):

1. Using a neural network to reconstruct the first 50 Laplacian coefficients, and then using the

Laplacian eigenvectors to reconstruct the distribution

2. Using a neural network to reconstruct the whole transcriptome, then using the transcrip-

tome to reconstruct a distribution over cells

3. Using a neural network to reconstruct the whole transcriptome, then using a neural network

to reconstruct the first 50 Laplacian coefficients, and then using the Laplacian eigenvectors to

reconstruct the distribution

4. Using a neural network to directly reconstruct the distribution over cells

For our positive control, we used the true first 3000 Laplacian coefficients, and then used the

Laplacian eigenvectors to reconstruct the distribution.

To be clear, the positive control does not involve binary measurements, but instead uses the full

transcriptome for test-set cells to construct a distribution over the KNNG by finding the nearest

neighbor and constructing a distribution over that cell. We then take the first 3000 Laplacian coeffi-

cients of that distribution.

We used the following negative controls:

33
1. A uniform distribution over cells

2. A random distribution over cells, p(i) ∝ R where R is a random real number in [0, 1]

3.5.2 Transcriptome Reconstruction Methods

Next is our secondary evaluation, the transcriptome reconstruction. We’re testing the following

methods’ reconstruction accuracy (all beginning with the binary linear combination of genes):

1. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using the

Laplacian eigenfunctions to reconstruct the distribution, then taking a weighted average

of the transcriptomes of cells in the distribution

2. Using a neural network to reconstruct the first 50 Laplacian coefficients, then using a neural

network to reconstruct the whole transcriptome

3. Using a neural network to directly reconstruct the whole transcriptome

4. Using a bottlenecked neural network to reconstruct the whole transcriptome

We’re comparing these methods to the following positive controls:

1. The ground truth transcriptome with multivariate Gaussian noise, with standard deviation

calculated from each gene’s expression levels

2. Using the true first 50 Laplacian coefficients, followed by a neural network reconstruction of

the transcriptome

And the following negative controls:

1. The average transcriptome of all cells on the graph

34
2. A random transcriptome drawn from a multivariate Gaussian centered on the average tran-

scriptome with standard deviation calculated from each gene’s expression levels

3. A random cell’s transcriptome drawn from the training set

3.6 Evaluation, Comparison, and Understanding

3.6.1 Evaluation and Comparison

To evaluate methods against each other, we first examined the average values of their error measures

against each other, evaluated on the test set. We then examined the CDF of the error measures over

cells for the methods we were most interested in.

3.6.2 Understanding

Then, to understand the performance of the most promising methods, we conducted an in-depth

analysis of their reconstructions. We examined the distribution of their error measures. We then

compared these distributions to the distributions of error a negative control. We examined the CDF

of their error measures, and examined illuminating properties of these CDFs, such as skew and cu-

mulative density below the negative control’s error. We then visually compared the reconstructed

distributions to the original distributions for a variety of situations:

1. The best reconstructions (according to the error measure)

2. The worst reconstructions

3. Reconstructions of marker genes

Finally, we visually examined the differences in reconstruction in UMAP space.

35
4
Results

We now present the results of our reconstruction analyses. We first present high level results,

examining the results for our distribution reconstructions and our transcriptome reconstructions.

We then dive deeper into the two most promising methods for distribution reconstruction, and

critically analyze their successes and failures.

The goal is to determine the best method for reconstructing cell state / transcriptome data from

36
a low-dimensional, binary combination of genes that is experimentally feasible to measure. Previous

analyses have found that the specific genes and combinations that are being taken do not have a large

impact on reconstruction accuracy, thus we use a pre-selected random matrix for our computations,

with the experimentally feasible 50 genes per combination and 10 total combinations.

Our primary evaluation metric is the Jensen Shannon between the reconstructed distribution

and the ground truth distribution; the Gaussian analogue we previously constructed. Our sec-

ondary evaluation metric is the mean squared error between the ground truth transcriptome and the

reconstructed transcriptome. We determine the reconstructed transcriptome by taking a weighted

average of the training cells as indicated by the reconstructed distribution.

For cells not in the training set that are thereby lacking a ground truth measurement (e.g. the val-

idation and test sets), we use the closest cell in the training set to the given cell to construct a ground

truth distribution to evaluate in comparison to.

4.1 High Level Results

4.1.1 Distribution Reconstruction Results

As a reminder, we’re testing the methods in

Table 4.1. Whenever we reconstruct Laplacian

coefficients, we use them to reconstruct the dis-

tribution by taking the linear combination of

the coefficients with their respective eigenvec-

tors.

We first examine the mean JS divergence

for each method in Figure 4.1. Note that the

JS divergence in the figure is scaled by the JS Figure 4.1: Mean JS Divergence Comparison by Method

37
Table 4.1: Description of Dist. Recon. Methods

Method Code Description


M1 NN Reconstruction of 50 Laplacian Coeff.
M2 NN Reconstruction of 3595 Laplacian Coeff.
M3 NN T-ome Reconstruction, then NN Dist. Reconstruction
M4 NN T-ome Reconstruction, then NN Reconstruction of 50 Laplacian Coeff.
M5 NN Distribution Reconstruction
PC Positive Control: 3000 Laplacian Coeff.
NC1 Negative Control: Uniform Distribution
NC2 Negative Control: Random Distribution

divergence for the first negative control.

We see that M1 and M5 are the best performing methods, so we choose to examine them in more

detail.

Figure 4.2: CDFs for our best methods, compared to the CDFs for true Laplacian coefficient reconstructions

We examine the CDF of the JS Divergence for M1 and M5 against the CDF of the JS Divergence

using different numbers of true Laplacian coefficients. The results are displayed in Figure 4.2. We

38
Table 4.2: Description of Transcriptome Recon. Methods

Method Code Description


M1 NN Reconstruction of 50 Laplacian Coeff. ->Weighted Average of Distribution
M2 NN Reconstruction of 50 Laplacian Coeff. ->NN Transcriptome Reconstruction
M3 NN Transcriptome Reconstruction
M4 Bottlenecked NN Transcriptome Reconstruction
PC1 Positive Control: Ground Truth Transcriptome with Gaussian Noise
PC2 Positive Control: 3000 Laplacian Coeff. + NN
NC1 Negative Control: The Average Transcriptome
NC2 Negative Control: A Random Transcriptome from a Multivariate Gaussian
NC3 Negative Control: A Randomly Sampled Transcriptome

see that our methods are comparable to using 50 − 100 of the true Laplacian coefficients.

4.1.2 Transcriptome Reconstruction Results

For transcriptome reconstruction, we’re testing

the methods in Table 4.2.

We examine the mean MSE for each method

in Figure 4.3.

From this, we see, somewhat surprisingly,

that the best method for transcriptomic recon-

struction is in fact using the weighted average

from the distribution resulting from our recon-

struction of Laplacian coefficients.


Figure 4.3: Mean MSE Comparison by Method

39
4.2 Diving Deeper into Distribution
Reconstructions

Our best methods

for the primary re-

construction were

M1 (using a neural

network to impute

50 Laplacian coef-

ficients, then using

Laplacian eigenvec-

tors to reconstruct the


Figure 4.4: Scatter‐plot of JS Divergence by Cell
distribution) and M5

(using a neural network to directly reconstruct the distribution). We now dive deeper into each of

these methods to critically evaluate their performance.

4.2.1 Method One

We first examine a scatter plot of the JS Diver-

gence and overlay the control JS Divergence as

an initial exploration in Figure 4.4.

We’d like to rescale the measure, so we center

it on the control JS Divergence and rescale by

the standard deviation in Figure 4.5.

We’d next like to understand the distribu-

tion of the (centered / scaled) divergence, so we Figure 4.5: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled

40
Figure 4.6: PDF and CDF of Centered and Scaled JS Divergence for M1

plot both the PDF and CDF in Figure 4.6.

We see that relative to the negative control,

our method performs quite well. Nearly the entire JS divergence distribution is below the negative

control, and we have a large negative skew. We now want to gain some intuition for what the distri-

bution reconstruction looks like visually. We proceed to plot one sample from the top and bottom

10% performing cells, as well as a random reconstruction (Figure 4.7, Figure 4.8, and Figure 4.9

respectively).

Figure 4.7: A Sample Reconstruction from the Best 10% of Reconstructions

41
Figure 4.8: A Sample Reconstruction from the Worst 10% of Reconstructions

Figure 4.9: A Random Reconstruction

42
We see that the good reconstructions capture the spatial localization in UMAP space, but tend to

be somewhat more diffuse than the ground truth distribution. The worst distribution reconstruc-

tions tend to barely capture the spatial localization, and are always significantly more diffuse than

the ground truth distributions.

To gain some more insight, we plot the reconstruction of marker genes against the ground truth

in Figure 4.10, Figure 4.11, and Figure 4.12. We reconstruct individual gene measurements by tak-

ing a weighted average according to the reconstructed distribution of the expression levels in our

training set over which our distribution is specified. The ground truth is the true expression levels of

the marker genes in the test set cells. Note that both are log values to allow for better visualization.

Figure 4.10: Erythroid Marker Gene (CA1) Reconstruction

43
Figure 4.11: Megakaryocyte Marker Gene (ITGA2B) Reconstruction

Figure 4.12: HSC/MPP Marker Gene (CRHBP) Reconstruction

44
Finally, we compute the joint UMAP of

the ground truth transcriptomes with the re-

constructed transcriptomes, and then draw

lines connecting each ground truth coordinate

with the corresponding reconstructed coordi-

nate. We do this to see how much reconstruc-

tions have shifted in UMAP space. We see that


Figure 4.13: UMAP Displaying Reconstruction Differences
reconctructions and clusters are overall quite

stable.

4.2.2 Method Five

We now want to per-

form the same anal-

ysis for our best per-

forming method; the

neural network re-

construction of the

distribution.

Again, we first

examine a scatter plot


Figure 4.14: Scatter‐plot of JS Divergence by Cell
of the JS Divergence

and overlay the control JS Divergence as an initial exploration in Figure 4.14.

Next, the centered and scaled JS divergence in Figure 4.15.

45
We plot the PDF and CDF of the centered and scaled JS divergence in Figure 4.16.

As expected, this method performs very well compared to the control. It also slightly outper-

forms M1 on the skew.

We again plot a sample from the best and

worst reconstructions side-by-side with the

ground truth distributions, as well as a random

cell.

For our best performing reconstructions,

unlike M1, we appear to have tight distribu-

tions that closely match the localization of the

original distributions. Even our worst recon- Figure 4.15: Scatter‐plot of JS Divergence, Centered on
Negative Control and Rescaled
structions have decent performance - some

degree of the original pattern is being recon-

structed. We see that our random cell reconstructions maintain the problem we saw with M1 with

distributions being diffuse.

(a) PDF (b) CDF

Figure 4.16: PDF and CDF of Centered and Scaled JS Divergence for M1

46
Figure 4.17: A Sample Reconstruction from the Best 10% of Reconstructions

Figure 4.18: A Sample Reconstruction from the Worst 10% of Reconstructions

47
Figure 4.19: A Random Reconstruction

Again, we plot the log marker gene reconstruction expression levels against the ground truth.

Figure 4.20: Erythroid Marker Gene (CA1) Reconstruction

48
Figure 4.21: Megakaryocyte Marker Gene (ITGA2B) Reconstruction

Figure 4.22: HSC/MPP Marker Gene (CRHBP) Reconstruction

Finally, we compute the joint UMAP of the ground truth transcriptomes with the reconstructed

transcriptomes, and then draw lines connecting each ground truth coordinate with the correspond-

ing reconstructed coordinate.

Our final examinations show that M5 performs similarly to M1, but slightly better on some eval-

uation metrics.

49
Figure 4.23: UMAP Displaying Reconstruction Differences

50
Remembering that I’ll be dead soon is the most important
tool I’ve ever encountered to help me make the big choices
in life.
Almost everything–all external expectations, all pride,
all fear of embarrassment or failure–these things just
fall away in the face of death, leaving only what is truly
important.
Remembering that you are going to die is the best way I
know to avoid the trap of thinking you have something
to lose. You are already naked. There is no reason not to

5
follow your heart.

Steve Jobs

Discussion

We first discuss the conclusions we gather from the results presented in the previous section.

5.1 Conclusions

Both our closely-examined methods for distribution reconstruction perform well. Indeed, they per-

form well enough to support our hypothesis that we can reconstruct the full transcriptome / state

of a cell from binary combinations of relatively few genes. The next steps are to take these methods

51
forward by validating them and bringing them into practice.

5.2 Further Directions

There are a variety of further directions to pursue this research, both to validate the results pre-

sented in this thesis and to expand the scope of the results.

From the results for M1, we can see that this method tends to predict more diffuse distributions

than the ground-truth. This may be a result of using the Laplacian coefficients as an intermediate

reconstruction, as they may be acting as regularization on the overall distribution reconstruction. If

this is true, we’re getting a result that we desired, but we’d also like to make the resulting distribution

tighter to have it fit our ground-truth distributions better. We could simply truncate the resulting

distribution in the same way we do for our ground-truth distributions.

We mentioned in the Methods section that we use scVI, a technique used to clean up raw scRNA-

seq data by imputing missing measurements and otherwise accounting for noise. One open prob-

lem in the field is that different imputation techniques are in use, and these techniques tend to give

different results. We’d like to see if our computational methods are robust to imputation technique,

as if they are, it indicates that they’re learning more biologically relevant information.

One long-standing open problem in the single-cell sequencing space is understanding how to

determine the accuracy of transcriptome reconstruction. This is one reason why we chose to re-

construct distributions as our primary endpoint as opposed to gene expression levels. It’s unclear

what a particular MSE in transcriptome reconstruction means biologically. This is a direction that

may prove fruitful for further research. A corollary of this problem is having some way to directly

compare our distribution reconstructions with our transcriptome reconstructions.

One interesting approach to this problem could be to learn biologically relevant labels, rather

than the difficult-to-interpret transcriptome. For example, we could learn manually-labelled cell

52
types as a classification problem, and it would thus be easier to gauge the performance of our meth-

ods.

We also would like to dig deeper into the assumption that our training data represents all pos-

sible transcriptomes we might see. One way to do this would be to simulate a variety of training-

validation-test set splits, then examine the distributions of distance of test set cells to the closest

training set cell. If we found that test set cells were consistently located quite close to their closest

training set neighbor, this would validate our assumption. Another way to bypass this assump-

tion entirely is simply to collect an enormous amount of training data, perhaps by figuring out how

to use the large amounts of scRNA-seq data that have already been generated for other purposes.

There are technical challenges involved with this, such as correcting for the effects of different ex-

perimental batches (so called ”batch effects”), but it appears to be a promising direction. One of

the core insights behind AlphaFold’s success in the protein folding domain was that there’s an enor-

mous amount of unlabelled biological data in the form of protein sequences, and that data was

leveraged to make headway on the protein folding problem. We posit that a similar situation holds

with the gluttony of single-cell data that’s being produced on a regular basis at this moment, and an

interesting question is understanding how to leverage this data to solve interesting biological ques-

tions.

Next, we’d like to come up with a method to represent a test cell as a distribution over the train-

ing cells in a more accurate way, rather than simply assigning it to the closest cell in the training set.

One example could be to consider the training cells as bases for cell state space, and then represent

new cells using their coefficients in this space. The problem with this is training becomes more diffi-

cult, as all the training cells are then just indicator functions, as opposed to the linear combinations

that other cells are.

The final, and most important direction to take this research would be to use these computa-

tional techniques in concert with an experimental workflow to validate the full-stack thesis that bi-

53
nary combinations of genes can be used to impute the full spatial transcriptome. The experimental

workflow would need to be carefully designed, since it’s unclear how to gather both ground-truth

binary combination data from a targeted FISH method as well as spatial transcriptome data from

the same sample, as would be needed to truly validate that this method works. Nevertheless, it’s the

next problem that would need to be solved to bring the results of this thesis to fruition.

54
References

[1] Asp, M., Bergenstråhle, J., & Lundeberg, J. (2020). Spatially Resolved Transcriptomes—
Next Generation Tools for Tissue Exploration. BioEssays, 42(10), 1–16.

[2] Bernstein, M. N. (2020). The graph laplacian.

[3] Cable, D. M., Murray, E., Zou, L. S., Goeva, A., Macosko, E. Z., Chen, F., & Irizarry, R. A.
(2021). Robust decomposition of cell type mixtures in spatial transcriptomics. Nature
Biotechnology, (pp. 1–25).

[4] Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S., & Zhuang, X. (2015). Spatially re-
solved, highly multiplexed RNA profiling in single cells. Science, 348(6233), 1360–1363.

[5] Cleary, B., Murray, E., Alam, S., Sinha, A., Habibi, E., Simonton, B., Bezney, J., Marshall,
J., Lander, E. S., Chen, F., & Regev, A. (2019). Compressed sensing for imaging transcrip-
tomics. bioRxiv.

[6] Crick F. H. (1958). On protein synthesis. Symposia of the Society for Experimental Biology,
12, 138–163.

[7] Femino, A. M., Fay, F. S., Fogarty, K., & Singer, R. H. (1998). Visualization of single RNA
transcripts in situ. Science, 280(5363), 585–590.

[8] Haghverdi, L., Buettner, F., & Theis, F. J. (2015). Diffusion maps for high-dimensional
single-cell analysis of differentiation data. Bioinformatics, 31(18), 2989–2998.

[9] Harvey, W. (2015). Intuition for laplacian matrix of a graph’s eigenvectors and eigenvalues.

[10] Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz,
D. A., & Kirschner, M. W. (2015). Droplet barcoding for single-cell transcriptomics applied
to embryonic stem cells. Cell, 161(5), 1187–1201.

[11] Langer-Safer, P. R., Levine, M., & Ward, D. C. (1982). Immunological methods for mapping
genes on Drosophila polytene chromosomes. Proceedings of the National Academy of Sciences
of the United States of America, 79(14 I), 4381–4385.

55
[12] Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative model-
ing for single-cell transcriptomics. Nature Methods, 15(12), 1053–1058.

[13] McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. arXiv.

[14] Milo, R., Jorgensen, P., Moran, U., Weber, G., & Springer, M. (2009). BioNumbers The
database of key numbers in molecular and cell biology. Nucleic Acids Research, 38(SUPPL.1),
750–753.

[15] Rodriques, S. G., Stickels, R. R., Goeva, A., Martin, C. A., Murray, E., Vanderburg, C. R.,
Welch, J., Chen, L. M., Chen, F., & Macosko, E. Z. (2019). Slide-seq: A scalable technology
for measuring genome-wide expression at high spatial resolution. Science, 363(6434), 1463–
1467.

[16] Stahl, P. L., Salmén, F., Vickovic, S., Lundmark, A., Navarro, J. F., Magnusson, J., Gia-
comello, S., Asp, M., Westholm, J. O., Huss, M., Mollbrink, A., Linnarsson, S., Codeluppi,
S., Borg, Å., Pontén, F., Costea, P. I., Sahlén, P., Mulder, J., Bergmann, O., Lundeberg, J.,
& Frisén, J. (2016). Visualization and analysis of gene expression in tissue sections by spatial
transcriptomics. Science (New York, N.Y.), 353(6294), 78–82.

[17] Vickovic, S., Eraslan, G., Salmén, F., Klughammer, J., Stenbeck, L., Schapiro, D., Äijö, T.,
Bonneau, R., Bergenstråhle, L., Navarro, J. F., Gould, J., Griffin, G. K., Borg, Å., Ronaghi,
M., Frisén, J., Lundeberg, J., Regev, A., & Ståhl, P. L. (2019). High-definition spatial tran-
scriptomics for in situ tissue profiling. Nature Methods, 16(10), 987–990.

[18] Weber, A. P. (2015). Discovering new biology through sequencing of RNA. Plant Physiol-
ogy, 169(3), 1524–1531.

56

You might also like