如何利用python包进行单细胞测序数据分析(Scanpy) →
group
date
Nov 5, 2024
slug
scanpy
status
Published
tags
bioinformatics
scRNA-sequencing
10X genomics
python
summary
传统的单细胞测序分析用到的主要分析工具为R的Seurat包,但是对于R语言使用不太熟练的生信工作者,或者是以python为主要编程语言的人,对于Seurat的使用和理解总是不太熟悉。所以本篇文章主要基于python的Scanpy包对单细胞测序数据进行分析,希望可以对你的数据处理有所帮助🎉🎈🍾👇
type
Post
Preprocessing and clustering
加载所需数据包
数据读取
质控
- 质控小提琴图

- QC指标组合散点图

Based on the QC metric plots, one could now remove cells that have too many mitochondrial genes expressed or too many total counts by setting manual or automatic thresholds. However, sometimes what appears to be poor QC metrics can be driven by real biology so we suggest starting with a very permissive filtering strategy and revisiting it at a later point. We therefore now only filter cells with less than 100 genes expressed and genes that are detected in less than 3 cells.
Doublet detection
As a next step, we run a doublet detection algorithm. Identifying doublets is crucial as they can lead to misclassifications or distortions in downstream analysis steps. Scanpy contains the doublet detection method Scrublet [Wolock et al., 2019]. Scrublet predicts cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets. scanpy.pp.scrublet() adds doublet_score and predicted_doublet to .obs. One can now either filter directly on predicted_doublet or use the doublet_score later during clustering to filter clusters with high doublet scores.
归一化
特征选择

降维
选择合适的PC个数
You can also plot the principal components to see if there are any potentially undesired features (e.g. batch, QC metrics) driving signifigant variation in this dataset. In this case, there isn’t anything too alarming, but it’s a good idea to explore this.

Nearest neighbor graph constuction and visualization

Even though the data considered in this tutorial includes two different samples, we only observe a minor batch effect and we can continue with clustering and annotation of our data.If you inspect batch effects in your UMAP it can be beneficial to integrate across samples and perform batch correction/integration. We recommend checking out scanorama and scvi-tools for batch integration.
聚类

Re-assess quality control and cell filtering


手动注释
We have now reached a point where we have obtained a set of cells with decent quality, and we can proceed to their annotation to known cell types. Typically, this is done using genes that are exclusively expressed by a given cell type, or in other words these genes are the marker genes of the cell types, and are thus used to distinguish the heterogeneous groups of cells in our data. Previous efforts have collected and curated various marker genes into available resources, such as CellMarker, TF-Marker, and PanglaoDB. The cellxgene gene expression tool can also be quite useful to see which cell types a gene has been expressed in across many existing datasets.
不同的cluster的数量由resolution控制

Marker gene set

There are fairly clear patterns of expression for our markers show here, which we can use to label our coarsest clustering with broad lineages.

This seems like a resolution that suitable to distinguish most of the different cell types in our data. As such, let’s try to annotate those by manually using the dotplot above, together with the UMAP of our clusters. Ideally, one would also look specifically into each cluster, and attempt to subcluster those if required.
差异表达基因作为marker
可视化基于wilcoxon的前5的差异表达基因点图

- 查看具体的group的基因表达情况可以使用scanpy.get.rank_genes_groups_df() 方法


You may have noticed that the p-values found here are extremely low. This is due to the statistical test being performed considering each cell as an independent sample. For a more conservative approach you may want to consider “pseudo-bulking” your data by sample (e.g. sc.get.aggregate(adata, by=["sample", "cell_type"], func="sum", layer="counts")) and using a more powerful differential expression tool, like pydeseq2.
