Single-Cell Atlas Construction

Full single-cell RNA-seq pipeline from raw counts to biological interpretation. Covers QC, normalization, batch integration, clustering, annotation, pseudobulk DE, trajectory inference, cell-cell communication, and TF activity. Dual-language: Seurat v5 (R) and scanpy (Python).

When to Use This Skill

Activate when the user requests:

Single-cell RNA-seq quality control and preprocessing
10x Chromium data processing (Cell Ranger output)
Doublet detection and removal
Normalization of UMI count matrices
Feature selection for dimensionality reduction
Processing scRNA-seq data before integration, clustering, or annotation
Batch integration across samples or experiments
Clustering and resolution selection
Automated or marker-based cell type annotation
UMAP visualization
Differential expression between conditions (pseudobulk)
Trajectory inference and RNA velocity
Cell-cell communication analysis
Transcription factor activity and gene regulatory networks

Inputs

Data Type	Format	Source
Count matrix	10x HDF5 (`.h5`), MEX (matrix.mtx + barcodes + features)	Cell Ranger, STARsolo
Count matrix	AnnData (`.h5ad`) for scanpy, Seurat object (`.rds`) for R	Processed datasets
Reference	PBMC 3k, Tabula Muris, HCA datasets	10x Genomics, HCA

Loading Data

Seurat v5 (R)

library(Seurat)  # v5.4+

# From Cell Ranger output (filtered_feature_bc_matrix/)
obj <- Read10X("path/to/filtered_feature_bc_matrix/") |>
  CreateSeuratObject(project = "sample1", min.cells = 3, min.features = 200)

# From HDF5
obj <- Read10X_h5("path/to/filtered_feature_bc_matrix.h5") |>
  CreateSeuratObject(project = "sample1", min.cells = 3, min.features = 200)

# Seurat v5 uses Assay5 (layers: counts, data, scale.data)
# Access counts: obj[["RNA"]]$counts or LayerData(obj, layer = "counts")

scanpy (Python)

import scanpy as sc  # v1.12+

# From Cell Ranger output
adata = sc.read_10x_mtx("path/to/filtered_feature_bc_matrix/", var_names="gene_symbols")

# From HDF5
adata = sc.read_10x_h5("path/to/filtered_feature_bc_matrix.h5")

# Store raw counts for later
adata.raw = adata.copy()

Quality Control

Computing QC Metrics

Seurat

obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
# Mouse: pattern = "^mt-"
# Ribosomal: pattern = "^RP[SL]" (optional, rarely used for filtering)

# QC stats are in obj@meta.data:
#   nCount_RNA = total UMI per cell
#   nFeature_RNA = genes detected per cell
#   percent.mt = mitochondrial fraction

scanpy

adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=True, inplace=True)

# Adds to adata.obs:
#   total_counts, n_genes_by_counts, pct_counts_mt
#   log1p_total_counts, log1p_n_genes_by_counts

MAD-Based Filtering

Why MAD over fixed thresholds:
  Fixed thresholds (>200 genes, <5% MT) are arbitrary and dataset-specific.
  A threshold that works for PBMCs fails for tumor tissues (higher MT baseline)
  or deep-sequenced datasets (more genes detected per cell).

  MAD (median absolute deviation) adapts to the actual data distribution.
  Flag cells that deviate > N MADs from the median in the problematic direction.
  Default: 3 MADs (~99% of values under normality).
  Permissive: 5 MADs (for heterogeneous tissues).

Seurat (manual MAD implementation)

# Compute MAD thresholds per metric
mad_outlier <- function(x, nmads = 3, direction = "both") {
  med <- median(x, na.rm = TRUE)
  mad_val <- mad(x, na.rm = TRUE)
  if (direction == "lower") return(x < med - nmads * mad_val)
  if (direction == "higher") return(x > med + nmads * mad_val)
  (x < med - nmads * mad_val) | (x > med + nmads * mad_val)
}

obj$outlier_mt <- mad_outlier(obj$percent.mt, nmads = 3, direction = "higher")
obj$outlier_ncount <- mad_outlier(log1p(obj$nCount_RNA), nmads = 3, direction = "both")
obj$outlier_nfeat <- mad_outlier(log1p(obj$nFeature_RNA), nmads = 3, direction = "lower")

obj$discard <- obj$outlier_mt | obj$outlier_ncount | obj$outlier_nfeat

cat(sprintf("Cells flagged: %d / %d (%.1f%%)\n",
  sum(obj$discard), ncol(obj), 100 * mean(obj$discard)))

obj <- subset(obj, subset = discard == FALSE)

scanpy (MAD filtering)

import numpy as np

def mad_outlier(adata, metric, nmads=3, direction="higher"):
    """Flag outliers beyond N MADs from median."""
    vals = adata.obs[metric].values
    med = np.median(vals)
    mad = np.median(np.abs(vals - med)) * 1.4826  # scale to match SD
    if direction == "higher":
        return vals > med + nmads * mad
    elif direction == "lower":
        return vals < med - nmads * mad
    else:
        return (vals < med - nmads * mad) | (vals > med + nmads * mad)

adata.obs["outlier_mt"] = mad_outlier(adata, "pct_counts_mt", nmads=3, direction="higher")
adata.obs["outlier_counts"] = mad_outlier(adata, "log1p_total_counts", nmads=3, direction="both")
adata.obs["outlier_genes"] = mad_outlier(adata, "log1p_n_genes_by_counts", nmads=3, direction="lower")

adata.obs["discard"] = adata.obs["outlier_mt"] | adata.obs["outlier_counts"] | adata.obs["outlier_genes"]

print(f"Cells flagged: {adata.obs['discard'].sum()} / {adata.n_obs} "
      f"({100 * adata.obs['discard'].mean():.1f}%)")

adata = adata[~adata.obs["discard"]].copy()

Bioconductor (scater — dedicated function)

library(scater)  # Bioconductor
sce <- as.SingleCellExperiment(obj)
sce <- addPerCellQCMetrics(sce, subsets = list(mt = grep("^MT-", rownames(sce))))

# isOutlier handles log-transformation and batch-aware thresholds
discard <- isOutlier(sce$subsets_mt_percent, nmads = 3, type = "higher") |
  isOutlier(sce$detected, nmads = 3, type = "lower", log = TRUE) |
  isOutlier(sce$sum, nmads = 3, type = "both", log = TRUE)

sce <- sce[, !discard]

Gene Filtering

# Remove genes detected in very few cells
# min.cells = 3 in CreateSeuratObject handles this for Seurat

# For scanpy:
# sc.pp.filter_genes(adata, min_cells=3)

# Also consider removing:
#   Mitochondrial genes (after using them for QC)
#   Ribosomal genes (optional, sometimes informative)
#   Sex-linked genes (if confounding — XIST, DDX3Y, etc.)

Doublet Detection

scDblFinder (R — top performer in benchmarks)

library(scDblFinder)  # Bioconductor

sce <- as.SingleCellExperiment(obj)
sce <- scDblFinder(sce)

# Results in colData(sce)$scDblFinder.class: "singlet" or "doublet"
# Also: scDblFinder.score (continuous), scDblFinder.cxds_score
table(sce$scDblFinder.class)

# Expected doublet rate for 10x Chromium: ~0.8% per 1000 cells captured
# 5000 cells -> ~4% doublets, 10000 cells -> ~8% doublets

# Transfer back to Seurat
obj$doublet_class <- sce$scDblFinder.class
obj$doublet_score <- sce$scDblFinder.score
obj <- subset(obj, subset = doublet_class == "singlet")

Scrublet (Python — integrated in scanpy)

# Scrublet is now in the core scanpy API (not external)
sc.pp.scrublet(adata, expected_doublet_rate=0.06)

# Adds to adata.obs:
#   predicted_doublet (bool), doublet_score (float)
print(f"Detected doublets: {adata.obs['predicted_doublet'].sum()}")

adata = adata[~adata.obs["predicted_doublet"]].copy()

Doublet detection order:
  Run BEFORE normalization and feature selection.
  Doublets with mixed transcriptomes distort HVG selection
  and normalization size factors.

  For multi-sample experiments:
    Run doublet detection PER SAMPLE, not on merged data.
    Cross-sample doublets don't exist in droplet-based protocols.

Normalization

Decision Tree

Which normalization method?

  Seurat v5 workflow, plan to integrate multiple samples?
    -> SCTransform(obj) — v2 is default in Seurat v5
    -> Replaces NormalizeData + FindVariableFeatures + ScaleData in one call
    -> Returns 3000 variable features by default
    -> Requires glmGamPoi package for speed

  Seurat v5, single sample or exploratory analysis?
    -> NormalizeData(obj) + FindVariableFeatures(obj) + ScaleData(obj)
    -> Log-normalization: counts / total * 10000 -> log1p
    -> Fast, simple, adequate for most single-sample analyses

  scanpy workflow?
    -> sc.pp.normalize_total(adata, target_sum=1e4) + sc.pp.log1p(adata)
    -> Standard log-normalization, analogous to Seurat's LogNormalize

  Need principled size factor estimation (composition bias)?
    -> scran::computeSumFactors() on a SingleCellExperiment
    -> Requires pre-clustering with scran::quickCluster()
    -> Slowest method but handles composition bias best

  Dealing with very large datasets (>500k cells)?
    -> SCTransform with BPCells on-disk storage (Seurat v5)
    -> Or scanpy with backed mode: sc.read_h5ad("data.h5ad", backed="r")

SCTransform v2 (Seurat)

# SCTransform v2 is the default in Seurat v5
# Uses glmGamPoi for fast regularized negative binomial regression
# install.packages("glmGamPoi")  # or BiocManager::install("glmGamPoi")

obj <- SCTransform(obj, verbose = FALSE)
# Replaces NormalizeData + FindVariableFeatures + ScaleData
# Sets default assay to "SCT"
# Variable features stored in VariableFeatures(obj)

# For multiple samples, run SCTransform on each before integration
# Then use IntegrateLayers() or SelectIntegrationFeatures()

Log-Normalization (Seurat)

obj <- NormalizeData(obj)  # LogNormalize, scale.factor = 10000
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
obj <- ScaleData(obj)      # centers and scales, optionally regress out variables
# ScaleData(obj, vars.to.regress = "percent.mt")  # regress out MT% if needed

scanpy Normalization

# Store raw counts first
adata.layers["counts"] = adata.X.copy()

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# adata.X now contains log-normalized data
# adata.layers["counts"] retains raw counts for DE analysis later

scran Deconvolution (Bioconductor)

library(scran)
sce <- as.SingleCellExperiment(obj)

# Pre-cluster to avoid violating the non-DE majority assumption
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters = clusters)
# Size factors in sizeFactors(sce), centered around 1

sce <- logNormCounts(sce)
# logcounts(sce) now contains normalized, log-transformed values

Feature Selection

Seurat (VST)

# After NormalizeData:
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
# After SCTransform: already done (default 3000 features)

# Inspect top variable genes
head(VariableFeatures(obj), 20)

# Visualize variance-mean relationship
VariableFeaturePlot(obj)

scanpy

# Flavor depends on input: seurat_v3 needs raw counts, seurat needs log-normalized
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=2000)
# Adds adata.var["highly_variable"], adata.var["means"], adata.var["variances"]

# Subset to HVGs for PCA (keep all genes in adata.raw)
adata = adata[:, adata.var["highly_variable"]].copy()

Deviance-Based (scry — more principled alternative)

library(scry)  # Bioconductor
sce <- as.SingleCellExperiment(obj)

# Compute binomial deviance per gene on raw counts
sce <- devianceFeatureSelection(sce, assay = "counts", sorted = TRUE)
# Top genes by deviance are most informative
# Select top 2000-4000 genes
hvgs <- head(rownames(sce), 2000)

# Pairs with GLM-PCA instead of standard PCA
# glmpca <- GLMPCA(counts(sce[hvgs, ]), L = 30, fam = "poi")

Feature selection guidelines:
  2000-3000 HVGs works for most datasets.
  More HVGs (4000-5000) can help for heterogeneous tissues or rare cell types.
  No single method consistently outperforms others (2025 Nature Methods benchmark).
  Deviance-based selection (scry) is more principled than variance-based
  but less widely adopted.

Dimensionality Reduction

PCA

Seurat

obj <- RunPCA(obj, npcs = 50, verbose = FALSE)
# Determine how many PCs to use downstream
ElbowPlot(obj, ndims = 50)
# Pick the elbow — typically 15-30 PCs for most datasets
# Alternatively: use JackStraw (slow) or molecular cross-validation

scanpy

sc.tl.pca(adata, n_comps=50, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
# Pick PCs where variance ratio flattens — typically 15-30

UMAP

UMAP is for visualization only. Never cluster on UMAP coordinates.
Distances between clusters in UMAP do NOT reflect biological similarity.
Cluster size does not reflect heterogeneity.

Parameters that matter:
  n_neighbors: 10-30 (default 15). Lower = more local detail, higher = broader topology.
  min_dist: 0.1-0.5 (default 0.1). Lower = tighter clusters, higher = more spread.
  For cell type identification: min_dist = 0.1-0.3
  For trajectory visualization: min_dist = 0.3-0.5

Seurat

obj <- FindNeighbors(obj, dims = 1:30)
obj <- RunUMAP(obj, dims = 1:30, min.dist = 0.3, n.neighbors = 30)
DimPlot(obj, reduction = "umap", group.by = "cell_type")

scanpy

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(adata, color=["cell_type", "batch"])

Batch Integration

Method Selection

Which integration method?

  Few batches (2-5), similar cell compositions?
    -> Harmony: fast, works in PCA space, no GPU needed.
    -> RPCA (Seurat): conservative, preserves biology well.

  Many batches (>5), different cell compositions across batches?
    -> scVI: deep generative model, handles complex batch effects.
    -> CCA (Seurat): finds shared correlation structure across batches.

  Have partial cell type labels?
    -> scANVI: semi-supervised, uses labels for better integration.
    -> Best overall performer when labels are available (scIB benchmark).

  Very large dataset (>500k cells)?
    -> Harmony: scales linearly, runs in seconds on PCA.
    -> Seurat v5 sketch-based: subsamples, integrates sketch, projects back.

  Simple/fast first pass?
    -> Harmony. Works well for most cases and takes < 1 minute.

  Evaluate integration quality?
    -> scib-metrics: batch ASW, graph iLISI (mixing), cell type ASW, NMI (conservation).
    -> Scoring: 40% batch removal + 60% biological conservation.

Harmony (R)

library(harmony)  # v1.2+

# Run on PCA embeddings — fast, no GPU needed
obj <- RunHarmony(obj, group.by.vars = "batch", dims.use = 1:30)
# theta = 2 (default): diversity penalty. Increase to 3-4 for stubborn batch effects.
# lambda = 1 (default): ridge regression penalty. Decrease for stronger correction.

# Use harmony reduction for downstream steps
obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- RunUMAP(obj, reduction = "harmony", dims = 1:30)
obj <- FindClusters(obj, resolution = 0.5)

Harmony (scanpy)

import scanpy.external as sce

sce.pp.harmony_integrate(adata, key="batch", basis="X_pca", adjusted_basis="X_pca_harmony")
# Use the corrected embedding for neighbors/UMAP
sc.pp.neighbors(adata, use_rep="X_pca_harmony", n_pcs=30)
sc.tl.umap(adata)

Seurat v5 IntegrateLayers

# Split layers by batch (if not already split after merge)
obj[["RNA"]] <- split(obj[["RNA"]], f = obj$batch)

# Standard workflow: normalize, HVGs, scale, PCA, then integrate
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

# Integration — pick one method:
obj <- IntegrateLayers(obj, method = HarmonyIntegration,
  orig.reduction = "pca", new.reduction = "harmony")

# Or CCA/RPCA:
# obj <- IntegrateLayers(obj, method = CCAIntegration,
#   orig.reduction = "pca", new.reduction = "integrated.cca")

# Rejoin layers before DE analysis
obj <- JoinLayers(obj)

scVI (Python)

import scvi  # v1.4+

# Setup — requires raw counts in a specific layer
scvi.model.SCVI.setup_anndata(adata, batch_key="batch", layer="counts")

model = scvi.model.SCVI(adata, n_latent=30, n_layers=2, gene_likelihood="nb")
model.train(max_epochs=400, accelerator="auto")
# GPU speeds training ~10x but CPU works for < 100k cells

# Get latent representation
adata.obsm["X_scVI"] = model.get_latent_representation()

# Use for neighbors/UMAP
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)

scANVI (semi-supervised, when partial labels available)

# Start from a trained scVI model
scvi.model.SCANVI.setup_anndata(adata, batch_key="batch",
  labels_key="cell_type", unlabeled_category="Unknown")

scanvi_model = scvi.model.SCANVI.from_scvi_model(model,
  labels_key="cell_type", unlabeled_category="Unknown")
scanvi_model.train(max_epochs=20)

adata.obsm["X_scANVI"] = scanvi_model.get_latent_representation()
adata.obs["predicted_label"] = scanvi_model.predict()

Clustering

Leiden (preferred over Louvain)

Why Leiden over Louvain:
  Louvain can produce badly connected communities (internal disconnections).
  Leiden guarantees well-connected communities.
  Louvain is no longer maintained. Leiden is the default in Seurat and scanpy.

Seurat

obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- FindClusters(obj, algorithm = 4, resolution = 0.5)
# algorithm = 4 is Leiden. algorithm = 1 is Louvain (legacy).
# resolution: 0.3-0.5 = coarse, 0.8-1.0 = fine, 1.2+ = very fine

scanpy

sc.tl.leiden(adata, resolution=0.5)
# Requires leidenalg package
# sc.tl.louvain() is deprecated as of scanpy 1.12

Resolution Selection

Picking resolution is one of the hardest choices in scRNA-seq analysis.
Too low: biologically distinct cell types merged.
Too high: single cell types split into meaningless subclusters.

Approaches:
  1. Clustree (R): visualize cluster stability across resolutions 0.1-1.5
     library(clustree)
     clustree(obj, prefix = "RNA_snn_res.")
     Pick the resolution where clusters stabilize (no excessive splitting).

  2. Known biology: if you expect ~9 cell types (PBMCs), resolution 0.4-0.6
     gives ~8-12 clusters. If you expect ~20 (complex tissue), go higher.

  3. Marker gene validation: run FindAllMarkers at each resolution.
     Clusters that lack distinct markers at high resolution were over-split.

  4. Silhouette scores: compute per-cluster silhouette on PCA embedding.
     Low silhouette = poorly separated cluster = likely over-split.

# Clustree: test multiple resolutions
library(clustree)
obj <- FindClusters(obj, algorithm = 4, resolution = seq(0.1, 1.5, 0.1))
clustree(obj, prefix = "RNA_snn_res.")
# Pick the highest resolution with stable cluster assignments

Cell Type Annotation

Decision Tree

Which annotation method?

  Have a reference atlas for the same tissue?
    -> CellTypist (Python): pre-trained models for immune, lung, gut, brain, etc.
    -> scANVI: transfer labels from reference during integration.

  Working in Seurat (R) with known tissue type?
    -> scType: marker-gene database, scores clusters directly.
    -> Fast, customizable, works on Seurat scale.data.

  No reference, no database matches your tissue?
    -> Manual annotation with FindAllMarkers + known canonical markers.
    -> Still the gold standard for novel tissues or rare cell types.

  Want a quick first pass?
    -> CellTypist with majority_voting=True. Then validate with markers.

CellTypist (Python)

import celltypist
from celltypist import models

# Download models (run once)
models.download_models(force_update=True)

# Input must be log-normalized to 10,000 counts per cell
# If adata.X is already log-normalized, proceed directly
predictions = celltypist.annotate(
  adata,
  model="Immune_All_Low.pkl",   # 90 immune subtypes across 20 tissues
  majority_voting=True           # refines via local subcluster consensus
)

adata = predictions.to_adata()
# Adds: adata.obs["predicted_labels"], adata.obs["over_clustering"],
#   adata.obs["majority_voting"], adata.obs["conf_score"]

scType (R)

# Source scType functions from GitHub
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

# Prepare marker gene database
db <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx"
gs_list <- gene_sets_prepare(db, "Immune system")

# Score each cell against all cell types
es <- sctype_score(
  scRNAseqData = obj[["RNA"]]$scale.data,  # Seurat v5 layer access
  gs = gs_list$gs_positive,
  gs2 = gs_list$gs_negative
)

# Assign cell types to clusters
cl_results <- do.call("rbind", lapply(unique(obj$seurat_clusters), function(cl) {
  es_cl <- sort(rowSums(es[, rownames(obj@meta.data[obj$seurat_clusters == cl, ])]),
    decreasing = TRUE)
  data.frame(cluster = cl, type = names(es_cl[1]), score = es_cl[1])
}))

obj$sctype_annotation <- cl_results$type[match(obj$seurat_clusters, cl_results$cluster)]

Manual Annotation with Markers

# Find markers for each cluster
markers <- FindAllMarkers(obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.5)

# Top 5 markers per cluster
top5 <- markers |> group_by(cluster) |> top_n(5, wt = avg_log2FC)

# Known PBMC markers for reference:
#   CD4+ T: IL7R, CCR7 (naive), S100A4 (memory)
#   CD8+ T: CD8A, CD8B
#   B cells: MS4A1 (CD20), CD79A
#   NK: GNLY, NKG7, KLRD1
#   CD14 Mono: CD14, LYZ
#   FCGR3A Mono: FCGR3A, MS4A7
#   DC: FCER1A, CST3
#   Platelets: PPBP

# Dot plot of markers across clusters
DotPlot(obj, features = c("IL7R", "CD8A", "MS4A1", "GNLY", "CD14",
  "FCGR3A", "FCER1A", "PPBP"), group.by = "seurat_clusters") +
  RotatedAxis()

Differential Expression (Pseudobulk)

Why pseudobulk and not Wilcoxon/t-test on single cells:
  Cells from the same donor are not independent.
  Cell-level tests treat 5000 cells from 3 donors as n=5000, not n=3.
  This inflates significance: thousands of false positives.
  Squair et al. 2021, Zimmerman et al. 2021 both showed pseudobulk methods
  outperform cell-level tests for cross-condition comparisons.

  When cell-level tests are OK:
    Marker gene discovery within a single sample (no cross-sample comparison).
    FindAllMarkers for cluster markers is fine — it's within-sample.

  When pseudobulk is required:
    Any comparison across conditions, treatments, or donors.

Pseudobulk with DESeq2 (R)

library(DESeq2)

# Aggregate raw counts: sum per gene, grouped by sample + cell_type
# Seurat v5: JoinLayers first, then extract counts
obj <- JoinLayers(obj)
counts <- obj[["RNA"]]$counts

pb <- aggregate(t(as.matrix(counts)),
  by = list(sample = obj$sample_id, cell_type = obj$cell_type),
  FUN = sum)
# Reshape to genes x pseudobulk_samples matrix
# Or use scuttle::aggregateAcrossCells (cleaner)

# With scuttle (recommended)
library(scuttle)
sce <- as.SingleCellExperiment(obj)
pb_sce <- aggregateAcrossCells(sce,
  ids = DataFrame(sample = sce$sample_id, cell_type = sce$cell_type))

# Run DESeq2 on one cell type
ct <- "CD8+ T"
pb_ct <- pb_sce[, pb_sce$cell_type == ct]

dds <- DESeqDataSetFromMatrix(
  countData = counts(pb_ct),
  colData = colData(pb_ct),
  design = ~ condition  # e.g., treatment vs control
)
dds <- DESeq(dds)
res <- results(dds, alpha = 0.05)

Pseudobulk with decoupleR (Python)

import decoupler as dc

# Aggregate counts by sample + cell_type
pdata = dc.get_pseudobulk(
    adata, sample_col="sample_id", groups_col="cell_type",
    layer="counts", min_cells=10, min_counts=1000
)

# Run DESeq2 via decoupler's wrapper
dc.deseq2(pdata, design="~condition")
# Results in pdata.uns with log2FC, padj per gene per cell type

Trajectory Inference

Method Selection

Which trajectory method?

  Want cluster-level connectivity (which cell types connect)?
    -> PAGA (scanpy, fast, no assumptions about topology)
    -> Run first to understand the coarse structure.

  Want continuous pseudotime along a lineage?
    -> Diffusion pseudotime (scanpy sc.tl.dpt) — needs a root cell
    -> Monocle3 order_cells() — learns a graph, picks root interactively

  Want RNA velocity (transcriptional dynamics)?
    -> scVelo dynamical mode — requires spliced/unspliced counts from velocyto/STARsolo
    -> Caveat: assumes active transcription. Fails for stable/mature populations.
    -> Validate with CellRank 2 for fate probabilities.

  Branching trajectories?
    -> Monocle3: learns principal graph with automatic branch point detection.
    -> PAGA + DPT: use PAGA for topology, DPT for pseudotime per branch.

PAGA + Diffusion Pseudotime (scanpy)

# PAGA: cluster-level trajectory graph
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, threshold=0.03, show=True)
# Edges = statistically supported connections between clusters
# threshold: remove weak edges (default 0.03)

# Initialize UMAP with PAGA for better trajectory layout
sc.tl.umap(adata, init_pos="paga")

# Diffusion pseudotime (requires a root cell)
adata.uns["iroot"] = int(np.where(adata.obs["cell_type"] == "HSC")[0][0])
sc.tl.diffmap(adata)
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["dpt_pseudotime", "cell_type"])

Monocle3 (R)

library(monocle3)

# Convert from Seurat
cds <- as.cell_data_set(obj)  # requires SeuratWrappers
cds <- preprocess_cds(cds, num_dim = 30)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)

# Learn trajectory graph
cds <- learn_graph(cds)

# Order cells (select root interactively or programmatically)
cds <- order_cells(cds, root_cells = root_cell_ids)
# root_cell_ids: barcodes of cells at the trajectory start (e.g., stem cells)

plot_cells(cds, color_cells_by = "pseudotime", cell_size = 0.5)

# Find genes that change along the trajectory
trajectory_genes <- graph_test(cds, neighbor_graph = "principal_graph")
sig_genes <- trajectory_genes[trajectory_genes$q_value < 0.05, ]

RNA Velocity (scVelo)

import scvelo as scv  # v0.3+

# Requires spliced/unspliced counts (from velocyto or STARsolo)
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# Dynamical mode — fits full kinetic ODE per gene
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode="dynamical")
scv.tl.velocity_graph(adata)

scv.pl.velocity_embedding_stream(adata, basis="umap", color="cell_type")

# Velocity confidence: low confidence = unreliable direction
scv.tl.velocity_confidence(adata)
# adata.obs["velocity_confidence"] — filter genes/cells with low values

RNA velocity caveats:
  Requires active transcription dynamics (splicing timescale ~ hours).
  Fails for: PBMCs (mature, stable), slow diseases, steady-state populations.
  Contradictory results reported on some datasets (opposite directions).
  Always validate with known biology before interpreting velocity arrows.
  Consider CellRank 2 for fate analysis — it works with or without velocity.

Cell-Cell Communication

CellChat v2 (R)

library(CellChat)  # v2.1+, from jinworks/CellChat

# Create CellChat object from Seurat
cellchat <- createCellChat(obj, group.by = "cell_type")

# Set ligand-receptor database
CellChatDB <- CellChatDB.human  # or CellChatDB.mouse
cellchat@DB <- CellChatDB

# Inference pipeline
cellchat <- subsetData(cellchat)
cellchat <- identifyOverExpressedGenes(cellchat)
cellchat <- identifyOverExpressedInteractions(cellchat)
cellchat <- computeCommunProb(cellchat)
cellchat <- computeCommunProbPathway(cellchat)
cellchat <- aggregateNet(cellchat)

# Visualization
netVisual_aggregate(cellchat, signaling = "CXCL")
netVisual_bubble(cellchat, sources.use = c("CD8+ T"), targets.use = c("Macrophage"))
netAnalysis_computeCentrality(cellchat, slot.name = "netP")

LIANA+ (Python — consensus across methods)

import liana as li

# LIANA wraps 7 methods and computes consensus rankings
li.mt.rank_aggregate(
    adata,
    groupby="cell_type",
    resource_name="consensus",  # OmniPath consensus resource
    verbose=True
)

# Results in adata.uns["liana_res"]
liana_res = adata.uns["liana_res"]
# Columns: source, target, ligand_complex, receptor_complex,
#   magnitude_rank, specificity_rank (consensus scores)

# Filter top interactions
top = liana_res[(liana_res["magnitude_rank"] < 0.01)]

# Visualization
li.pl.dotplot(adata, colour="magnitude_rank", size="specificity_rank",
  source_labels=["CD8+ T"], target_labels=["Macrophage"])

CellChat vs LIANA:
  CellChat: single method, rich curated database (1000+ interactions), R only,
    built-in spatial support (v2), best visualization.
  LIANA: consensus across 7 methods, flexible databases (OmniPath, CellPhoneDB, etc.),
    Python + R, higher confidence (cross-method agreement).
  For publication: run both and report concordant interactions.

Transcription Factor Activity

For TF activity scoring, decoupleR + CollecTRI is now preferred over pySCENIC
for most use cases. It is faster, less prone to overfitting (uses pre-defined
regulons rather than inferring from the same data), and benchmarks favorably.

Use pySCENIC only when you need to discover novel TF-target relationships
specific to your dataset.

decoupleR (Python — recommended for TF activity)

import decoupler as dc

# Get CollecTRI regulons (curated TF-target interactions)
net = dc.get_collectri(organism="human", split_complexes=False)

# Score TF activities per cell
dc.run_ulm(adata, net=net, source="source", target="target", weight="weight")
# Results in adata.obsm["ulm_estimate"] (activity scores)
# and adata.obsm["ulm_pvals"]

# Visualize TF activities on UMAP
acts = dc.get_acts(adata, obsm_key="ulm_estimate")
sc.pl.umap(acts, color=["PAX5", "SPI1", "GATA1"], vcenter=0, cmap="RdBu_r")
# PAX5 high in B cells, SPI1 in myeloid, GATA1 in erythroid

decoupleR (R)

library(decoupleR)

net <- get_collectri(organism = "human", split_complexes = FALSE)
# Run ULM (univariate linear model) on the expression matrix
acts <- run_ulm(mat = obj[["RNA"]]$data, net = net, .source = "source",
  .target = "target", .mor = "mor", minsize = 5)

pySCENIC (for novel GRN discovery)

# Only use when discovering dataset-specific regulatory networks
# Requires: pyscenic, arboreto, ctxcore

from pyscenic.utils import modules_from_adjacencies
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
from arboreto.algo import grnboost2

# Step 1: infer co-expression modules (slow — hours for large datasets)
adjacencies = grnboost2(expression_data, tf_names=tf_list, seed=42)

# Step 2: prune with cis-regulatory motif analysis
modules = modules_from_adjacencies(adjacencies, expression_data)
# Requires ranking databases (hg38 feather files from cisTarget)
df = prune2df(dbs, modules, "motifs-v10-nr.hgnc-m0.001-o0.0.tbl")
regulons = df2regulons(df)

# Step 3: score regulon activity per cell
auc_mtx = aucell(expression_data, regulons)

Output Specification

Output	Format	Description
`qc_metrics.csv`	CSV	Per-cell QC: total counts, genes detected, MT%, outlier flags
`doublet_scores.csv`	CSV	Doublet probability per cell, singlet/doublet classification
`filtered_object.rds`	RDS	Seurat object after QC, doublet removal, normalization
`filtered_adata.h5ad`	H5AD	AnnData after QC, doublet removal, normalization
`hvg_list.csv`	CSV	Selected highly variable genes
`integrated_object.rds`	RDS	Seurat object after batch integration
`integrated_adata.h5ad`	H5AD	AnnData after integration with latent embedding
`cluster_markers.csv`	CSV	Top markers per cluster from FindAllMarkers
`cell_annotations.csv`	CSV	Per-cell type assignments (automated + manual)
`umap_plot.pdf`	PDF	UMAP colored by cluster, batch, cell type
`pseudobulk_de.csv`	CSV	DESeq2 results per cell type (log2FC, padj)
`trajectory_pseudotime.csv`	CSV	Per-cell pseudotime values
`velocity_embedding.pdf`	PDF	RNA velocity stream plot on UMAP
`cellchat_network.rds`	RDS	CellChat object with communication probabilities
`liana_results.csv`	CSV	Consensus L-R interaction rankings
`tf_activities.csv`	CSV	decoupleR TF activity scores per cell/cluster
`qc_violin.pdf`	PDF	Violin plots of nCount, nFeature, percent.mt

Validation Checks

After preprocessing PBMC 3k (10x reference dataset), verify:

Cell counts:
  Starting cells: ~2,700
  After QC (MAD 3): ~2,500-2,650 (lose ~2-7%)
  After doublet removal: ~2,400-2,600 (lose ~2-5%)

QC distributions:
  Median genes/cell: 800-1200
  Median UMI/cell: 2000-4000
  Median MT%: 2-5%
  Cells with > 10% MT: < 5% of total (PBMCs have low MT content)

Normalization:
  After log-normalization, expression values should be 0-8 range (log scale)
  SCTransform residuals centered near 0 with SD ~1

Feature selection:
  2000 HVGs should include known markers: IL7R, CD14, MS4A1, CD8A, GNLY, NKG7
  Housekeeping genes (ACTB, GAPDH) should NOT be in top HVGs

Doublet detection:
  Expected doublets for 3k cells: 20-60 (~1-2%)
  If > 10% flagged: check threshold or consider the data has quality issues

Integration (multi-sample):
  After Harmony/scVI, UMAP should show batches interleaved, not separated
  Cell types should remain distinct (batch correction ≠ biological erasure)
  If all cell types collapse into one blob: over-correction (reduce theta or lambda)

Clustering (PBMC 3k at resolution 0.5):
  Expect 8-12 clusters
  CD4 T, CD8 T, B cells, NK, CD14 Mono, FCGR3A Mono, DC, Platelets should separate
  If only 3-4 clusters: resolution too low
  If >20 clusters: resolution too high, check with clustree

Annotation:
  CellTypist with Immune_All_Low on PBMCs should identify all major types
  scType with "Immune system" database should score CD14 Mono, B cells, NK correctly
  Manual markers: IL7R for CD4 T, CD8A for CD8 T, MS4A1 for B, CD14 for Mono, GNLY for NK

Common Pitfalls

QC

Fixed MT% threshold on all datasets: A 5% cutoff works for PBMCs but removes real cardiomyocytes (30-40% MT) or hepatocytes (15-20% MT). Use MAD-based filtering or at least check the distribution before setting a threshold.
Not filtering genes: Removing cells but keeping genes detected in 0-2 cells. These contribute noise to normalization and feature selection. Filter with min.cells = 3 or sc.pp.filter_genes(adata, min_cells=3).
QC on merged data without batch awareness: MAD thresholds should ideally be computed per sample/batch. A sample with lower quality shifts the median and hides outliers in better samples. Use scater::isOutlier(..., batch = sample_id) or compute MAD per sample.

Doublets

Doublet detection on merged multi-sample data: Droplet-based doublets only form within a sample. Running scDblFinder or Scrublet on merged data creates false doublets between cell types that were never in the same droplet.
Trusting doublet scores blindly for rare populations: Rare cell types with intermediate profiles can be flagged as doublets. Cross-check flagged cells against known markers before removal.

Normalization

Using TPM/FPKM normalization for scRNA-seq: Single-cell UMI counts are not length-biased. Standard bulk normalization (TPM, FPKM) is wrong for UMI data. Use library-size normalization or SCTransform.
Running SCTransform without glmGamPoi: SCTransform v2 requires glmGamPoi for the fast GLM fitting. Without it, the function falls back to a much slower implementation. Install it from Bioconductor.
ScaleData on all genes: ScaleData() only needs to run on variable features for PCA. Scaling all genes wastes memory and time. ScaleData(obj, features = VariableFeatures(obj)).

Feature Selection

Running HVG selection on normalized data with seurat_v3 flavor: The seurat_v3 flavor in scanpy expects raw counts, not log-normalized data. Using the wrong flavor silently produces bad HVG rankings.

Integration

Integrating before QC and doublet removal: Batch integration amplifies artifacts from low-quality cells and doublets. Always QC and remove doublets per sample first, then merge and integrate.
Over-correction erasing biology: If cell types that should be distinct merge after integration, the method is too aggressive. Reduce Harmony theta, increase scVI n_latent, or try a more conservative method (RPCA).
Not joining layers before DE in Seurat v5: After IntegrateLayers(), each sample's counts are still in separate layers. Call JoinLayers() before FindAllMarkers() or DE analysis will fail or use only one sample's data.

Clustering and Annotation

Clustering on UMAP coordinates: UMAP distorts distances and densities. Always cluster on PCA or integrated embeddings (Harmony, scVI latent space), never on UMAP dimensions.
Single resolution without validation: Picking resolution = 0.5 without checking is arbitrary. Use clustree, silhouette scores, or marker gene validation to support the choice.
Automated annotation without marker validation: CellTypist and scType provide a starting point, not ground truth. Always check top markers per cluster against known biology for the tissue. Automated tools can mislabel rare or novel cell types.

Differential Expression

Wilcoxon/t-test for cross-condition DE: Treating cells as independent replicates inflates significance massively. Use pseudobulk (aggregate by sample + cell type, then DESeq2). Cell-level tests are only appropriate for marker discovery within a single sample.
Pseudobulk on normalized data: Aggregate raw counts, not normalized values. DESeq2 needs integers and handles normalization internally.

Trajectory and Communication

RNA velocity on stable populations: Velocity requires active transcription dynamics. PBMCs, terminally differentiated cells, and chronic disease states produce meaningless velocity arrows. Check velocity confidence scores and validate against known biology.
Over-interpreting cell communication rankings: Both CellChat and LIANA rank L-R pairs by statistical enrichment, not by biological activity. High-ranked interactions are hypotheses for validation, not confirmed communication events.
pySCENIC for TF activity scoring: pySCENIC infers GRNs from the same data it scores, creating overfitting risk. For TF activity scoring alone (not GRN discovery), decoupleR with CollecTRI regulons is faster and more reproducible.

Related Skills

cancer-multiomics: Bulk RNA-seq analysis, TCGA data retrieval
immune-deconvolution: Bulk-level immune estimation (compare with single-cell ground truth)

Public Datasets for Testing

Dataset	Cells	Use Case
PBMC 3k (10x)	2,700	Standard tutorial dataset, 9 cell types, fast processing
PBMC 10k (10x)	10,000	Larger PBMC, tests scalability
Tabula Muris	100k+	Multi-organ mouse atlas, tests batch integration
Human Cell Atlas (HCA)	varies	Reference atlases for annotation

ナビゲーション

Skillsとは？

リンク

Single-Cell Atlas Construction

Single-Cell Atlas Construction

When to Use This Skill

Inputs

Loading Data

Seurat v5 (R)

scanpy (Python)

Quality Control

Computing QC Metrics

Seurat

scanpy

MAD-Based Filtering

Seurat (manual MAD implementation)

scanpy (MAD filtering)

Bioconductor (scater — dedicated function)

Gene Filtering

Doublet Detection

scDblFinder (R — top performer in benchmarks)

Scrublet (Python — integrated in scanpy)

Normalization

Decision Tree

SCTransform v2 (Seurat)

Log-Normalization (Seurat)

scanpy Normalization

scran Deconvolution (Bioconductor)

Feature Selection

Seurat (VST)

scanpy

Deviance-Based (scry — more principled alternative)

Dimensionality Reduction

PCA

Seurat

scanpy

UMAP

Seurat

scanpy

Batch Integration

Method Selection

Harmony (R)

Harmony (scanpy)

Seurat v5 IntegrateLayers

scVI (Python)

scANVI (semi-supervised, when partial labels available)

Clustering

Leiden (preferred over Louvain)

Seurat

scanpy

Resolution Selection

Cell Type Annotation

Decision Tree

CellTypist (Python)

scType (R)

Manual Annotation with Markers

Differential Expression (Pseudobulk)

Pseudobulk with DESeq2 (R)

Pseudobulk with decoupleR (Python)

Trajectory Inference

Method Selection

PAGA + Diffusion Pseudotime (scanpy)

Monocle3 (R)

RNA Velocity (scVelo)

Cell-Cell Communication

CellChat v2 (R)

LIANA+ (Python — consensus across methods)

Transcription Factor Activity

decoupleR (Python — recommended for TF activity)

decoupleR (R)

pySCENIC (for novel GRN discovery)

Output Specification

Validation Checks

Common Pitfalls

QC

Doublets

Normalization

Feature Selection

Integration

Clustering and Annotation