Professional Documents
Culture Documents
Count QC
Normalization
The workflow is dataset-specific:
• Research question
PCA • Batches
• Experimental Conditions
• Sequencing method
tSNE / UMAP • …
Clustering
Diff. Expression
…
Why dimensionality reduction?
M genes 2 dim.
• Simplify complexity, so it becomes easier to work with.
N samples
N samples
Reduce number of features (genes)
In some: Transform non-linear relationships to linear
• Data visualization
… and more …
Some dimensionality reduction algorithms
They can be divided into 3 major groups:
PCA linear Matrix Factorization
ICA linear Matrix Factorization
MDS non-linear Matrix Factorization
https://pdfs.semanticscholar.org/664d/40258f12ad28ed0b7d4
Sparce NNMF non-linear Matrix Factorization 2010 c272935ad72a150db.pdf
cPCA non-linear Matrix Factorization 2018 https://doi.org/10.1038/s41467-018-04608-8
ZIFA non-linear Matrix Factorization 2015 https://doi.org/10.1186/s13059-015-0805-z
ZINB-WaVE non-linear Matrix Factorization 2018 https://doi.org/10.1038/s41467-017-02554-5
samples
genes
genes
Original Center
Scaling samples Planar
data Rotation rotation
data PC$rotation PC$x
0 0
sa
sa
sa
sa
−1 −1 −1 −1
−3
−2 −2
−3
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3 −3 −2 −1
sample 1 0 1 2 3 sample 1
original data (Z-score) sample 1 sample 1
33 33 3 3
322 3
22
(0,0) 2 2
2
1
gene_B
PC
PC
11 211 1 2 1
sample 22
sample 2
sample 2
sample 2
2
sample
00 samplesample
2 100 0
1 0
sample 2
−1
−1 −1
−1 −1 −1
0 0
−2
−2 −2
−2 −2 −2
−1 −1
−3
−3 −3
−3 −3 −3
−3 −2
−3 −2 −1
gene_A
−1 00
sample 11
sample
11 22 33 −2 −3 −2
−3 −2 −1
−1 00
sample 11
sample
11 22 33 −3
−2 −2 −1 0
sample 1
1 2 3 −3 −2 −1 0
sample 1
1 2 3
−3 rotation to another
−3
0coordinate3 system
33 33
2
−3 −2 −1 1 2 −3 −2 −1 0 1 2 3
22 22 sample 1 sample 1
PCA
1
11 11
sample 22
sample 22
2
sample
sample
PC2
00 00
0
1.0 1.0
−1
−1 −1
−1
1
−1
0.8 0.8
−2
−2 −2
−2 percentage
0.6 0.6
of variance
PC2
PC2
−2
−3
−3 −3
−3
0
0.2 0.2
−1
22
0.0 0.0
PC1 PC2 PC1
PC1 PC2
PC2
−2
11
−3 −2 −1 PC1
0 1 2 3
PC1
22
0 0
sa
sa
−1 −1
−3
−2
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
3 3
2 2
2
1
1 PC thus represents 2 genes very well
PC
PC
1 1
sample 2
sample 2
“Removing” redundancy 0 0
−1 −1
−2 −2
Could be disregarded
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
sample 1 sample 1
2
In real life …
1
PC2
0
1.0 1.0
−1
0.8 0.8
percentage
0.6 0.6
of variance
−2
0.4
−3
−3 −2 −1 0 explained
1 2 3 0.4
PC1
0.2 0.2
0.0 0.0
PC1 PC2 PC1
PC1 PC2
PC2
Seurat Pipeline
PCA in single cell data
PC1
• Data is usually SCALED prior to PCA (Z-score | see ScaleData in the Seurat)
• The TOP principal components contain higher variance from the data
Problems:
• It performs poorly to separate cells in 0-inflated data types (because of it non-linearity
nature)
• Cell sizes and sequencing depth are usually captured in the top principal components
A very brief intro to
graphs
Graphs
Manifold:
distance in PCA space
(euclidean distance)
In other words, t-SNE calculates the distances based on the distance to the neighbor cell
j j
i i
t-distribution
= pj|i = qj|i
Higher KL divergence
(cost / error)
iterations
gene B
iterations
iterations
gene A
Lower KL divergence
(cost / error)
The definition of the t-SNE and the chances of converging correctly depends on the
hyper-parameters (“tuning” parameters).
t-SNE has over 10 hyper-parameters that can be optimized for your specific data.
250k cells
1 hour
To keep in mind:
Problems:
• It’s cost function is not convex – This means that the optimal t-SNE cannot be computed
Points are connected with a line (edge) if the distance between them is below a
threshold:
- Any distance metric can be used (euclidean)
(0,1) …
(g1,g2,g3,g4,g5, …)
The distance in the manifold are the same, but not in the REAL space.
The distance is now “variable” in the REAL space for each point (t-SNE was fixed)
gene A
Correlation mapping
Cell number
positions
McInnes et al (2018) BioRxiv
Becht & McInnes et al (2019) Nat Biot
UMAP hyper-parameters
UMAP assumes that there is a manifold in the dataset, it could also tend to cluster noise.
250k cells
7 min
To keep in mind:
3. tSNE: What happens if you use only 5 PCs as input for your tSNE?
4. tSNE: What happens if you increase perplexity to 50 ? And with 5?
5. tSNE: What happens if you decrease theta? Try between 0.1 - 0.2.
6. tSNE: What happens if decrease the number of iteration to 400?