You are on page 1of 16

Preprocessing of Genotyping & scRna seq data

Checkpoint (1) 25/09/2023


Director: Pr. Lluis Quintana-Murci Co-supervisor: Dr. Maxime Rotival
Human Evolutionary Genetics Unit
Marwan Sharawy
Pipline progress
• Utilizing a combination of KNN10 and doublet detection for cryptic doublets.

• Filtering out cells with more than 20% mitochondrial reads.

• Excluding cells with a low number of genes (< 0.02) by computing the 2nd percentile of gene counts per library and
removing cells below this threshold.

• Total cells removed: 86,039 (11% of total cells).

• Performing Scran normalization, feature selection, dimensionality reduction (PCA), and clustering.
Total merged libraries

726k total cells after filtering /88 library


Total merged libraries

726k total cells after filtering /88 library

Filtered_out 86k

• True_DBL 64226
• low_Gene_Counts 16280
• Cryptic_doublet 5443
Total merged libraries
Total merged libraries
Cryptic doublets accuracy

DD&knn10 &scr
DD&KNN5 TPR 89%
TPR 84% DD&scr
TPR 0.81%

DD&knn5&scr
DD&knn10 TPR 85 %
True postive ratio (TPR) :88%
Cryptic doublets accuracy

Filtered_out
• True_DBL 64226
• low_Gene_Counts 16280
• Cryptic_doublet 5443

• Anticipated cryptic doublets across all libraries:


5,320
(computed as TRUE_DBL * 1/12).

• Considering a 12% false positive rate, there are 642


expected false positives in the dataset.

• Out of 725,000 total cells, 642 cells are identified as


cryptic doublets (0.08% of total cells).
Features selection

• Perform feature selection by computing the coefficient of variation for all genes.
• Select N genes based on their coefficient of variation and mean expression and utilize these features for subsequent
downstream analysis. means>0.0125b& dispersions>0.2
means>0.001 &dispersions<1
Features selection

• Testing deviance for feature selection which works on raw counts [Germain et al., 2020]
• Quantifies whether genes show a constant expression profile across cells

Additionally, any gene labeled as highly


deviant (top 12,000 deviant genes) within
all 88 libraries is kept.

• means>0.0125b&dispersions>0.2
• means>0.001 &dispersions<1
• Highly deviant in all 88 libraries

• Total genes filtered : 21241


• Total genes retained 15885
Clustering

• Computing the neighborhood graph through calculating a Euclidean distance matrix on the PC-reduced
expression space for all cells and then connect each cell to its K most similar cells.

('PCA + euclidean + kNN + Leiden')


• Currently testing :

('OT + kNN + Leiden')


• OT defines distances to compare high-dimensional data represented as probability distributions.
Integration

• SCVI integration of 88 library


Integration

• Leiden clustering : 20 cluster at 0.5 resolution


Cell type annotation

• Using CellID

• 2 references used to automatically annotate cell type( hao20 “same one yann used”)
Cell type annotation
• Using CellID

• 2 references used to automatically annotate cell type (“using 150k cells from yann datatset ”)
To Do

• Manual cell annotations

• Rerun yann scripts to confirm my results (“multiBatchNorm followed by harmony integration” )

• Enhance integration accuracy by re-running integration using scANVI, a scvi model that incorporates
annotated cell types for more precise results.

• Conduct an in-depth analysis of pilot 5 libraries, with a specific focus on comparing individuals
sampled in both Pilot 5 and V3, involving different strains of COVID and influenza.

• Obtain preliminary insights into the effects of different viruses by November 28.

• 13:15 october preparing for exam on 16th of october

• 27 of november mandatory poster session “ sorbonne ”

You might also like