name: celltypeannotation description: Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.
CellTypeAnnotation Process Configuration
Purpose
Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.
When to Use
- After clustering: When you have cluster assignments but need biological cell type labels
- Automated annotation: When manual annotation is too time-consuming or subjective
- Consistent nomenclature: When you need standardized cell type names across multiple samples
- Reference-based annotation: When you have well-characterized reference datasets or marker databases
- Cross-sample comparison: When analyzing multiple samples with the same cell type definitions
- Alternative to SeuratMap2Ref: When you prefer database-based annotation over reference dataset mapping
Configuration Structure
Process Enablement
[CellTypeAnnotation]
cache = true # Cache results for faster re-runs
Input Specification
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"] # Path or reference to Seurat object
Environment Variables
Core Parameters
[CellTypeAnnotation.envs]
# Annotation method selection
tool = "direct" # Options: "direct", "sctype", "hitype", "sccatch", "celltypist"
# Cluster identity column (required for h5ad input, optional for Seurat objects)
ident = "seurat_clusters" # Column name in metadata representing clusters
# Backup column name (stores original cluster labels)
backup_col = "seurat_clusters_id" # Default: "seurat_clusters_id"
# New column name for annotated cell types
# If specified, original identity is kept; otherwise, it's replaced
newcol = "" # Default: empty (overwrite identity)
# Merge clusters with same predicted cell types
merge = false # Default: false; suffixes (.1, .2) added for duplicate labels
# Output file type
outtype = "input" # Options: "input", "rds", "qs", "qs2", "h5ad"
Direct Annotation Parameters
[CellTypeAnnotation.envs]
tool = "direct"
# Cell type assignments (one per cluster, in order)
# Use "-" or "" to keep original cluster name
# Use "NA" to remove cluster from downstream analysis (only without newcol)
cell_types = ["CD4+ T cells", "CD8+ T cells", "-", "B cells"] # Default: []
# Additional annotations (multiple cell type columns)
more_cell_types = { # Dict: {new_column: [cell_types]}
cell_type_broad = ["T cells", "T cells", "NK cells", "B cells"],
cell_type_detailed = ["CD4+ naive", "CD8+ effector", "NK", "B naive"]
}
ScType Annotation Parameters
[CellTypeAnnotation.envs]
tool = "sctype"
# Tissue type (must match tissueType column in database)
sctype_tissue = "Immune system" # Required for sctype
# Database file path (Excel format compatible with ScType)
sctype_db = "/path/to/ScTypeDB_full.xlsx" # Optional: uses default if not specified
hitype Annotation Parameters
[CellTypeAnnotation.envs]
tool = "hitype"
# Tissue type (must match tissueType column in database)
hitype_tissue = "Immune system" # Required for hitype
# Database file path or built-in database name
# Built-in options: "hitypedb_short", "hitypedb_full", "hitypedb_pbmc3k"
hitype_db = "hitypedb_full" # Default: built-in database
scCATCH Annotation Parameters
[CellTypeAnnotation.envs]
tool = "sccatch"
[CellTypeAnnotation.envs.sccatch_args]
# Species (Human or Mouse)
species = "Human" # Required
# Tissue origin
tissue = "Blood" # Required
# Cancer type (if cancer tissue)
cancer = "Normal" # Default: "Normal"
# Custom marker genes (RDS file or list)
marker = "" # Optional
# Use custom marker instead of database
if_use_custom_marker = false # Default: false
# Additional scCATCH::findmarkergene() arguments
# See: https://rdrr.io/cran/scCATCH/man/findmarkergene.html
CellTypist Annotation Parameters
[CellTypeAnnotation.envs]
tool = "celltypist"
[CellTypeAnnotation.envs.celltypist_args]
# Model file path (download from https://celltypist.cog.sanger.ac.uk/models/models.json)
model = "Immune_All_Low.pkl" # Required
# Python interpreter where celltypist is installed
python = "python" # Default: "python"
# Majority voting refinement for local subclusters
majority_voting = true # Default: true
# Over-clustering column (for majority voting)
# Set to false to disable over-clustering
over_clustering = "seurat_clusters" # Auto: identity for Seurat, false for h5ad
# Assay for Seurat-to-AnnData conversion
assay = "" # Auto: RNA for h5seurat, default assay for Seurat
Annotation Methods
1. Direct Annotation
Assigns cell types manually to each cluster. Best when you have well-defined marker genes or want complete control over annotations.
Pros:
- Full control over annotations
- Fast and deterministic
- Works with any clustering result
Cons:
- Requires domain knowledge
- Time-consuming for many clusters
- Subjective
Use cases:
- Small number of well-separated clusters
- Known marker genes
- Reproducible annotation needed
2. ScType
Uses pre-defined cell type markers from ScType database. Annotates based on enrichment of known marker genes in each cluster.
Databases:
- ScTypeDB_short.xlsx: Compact database (~70 cell types)
- ScTypeDB_full.xlsx: Full database (~200+ cell types)
- Custom database: Provide your own Excel file
Pros:
- Automated annotation
- Tissue-specific filtering available
- Well-curated marker database
Cons:
- Limited to predefined cell types
- Requires tissue specification
- May miss rare cell types
Reference: https://github.com/IanevskiAleksandr/sc-type
Use cases:
- Immune tissue datasets
- When tissue type is well-defined
- Need for comprehensive annotation
3. hitype
Flexible annotation tool compatible with ScType database format. Supports both file-based and built-in databases.
Built-in databases:
hitypedb_short: Compact marker sethitypedb_full: Comprehensive marker sethitypedb_pbmc3k: PBMC-specific markers (from 10X PBMC3k dataset)
Pros:
- Faster than ScType (Python-based)
- Multiple built-in databases
- Tissue-specific filtering
Cons:
- Limited to database cell types
- Requires tissue specification
Reference: https://github.com/pwwang/hitype
Use cases:
- PBMC datasets (use
hitypedb_pbmc3k) - General immune annotation
- When speed matters
4. scCATCH
Identifies cell types by matching cluster marker genes to cell type-specific marker database.
Workflow:
- Finds marker genes for each cluster
- Matches markers to cell type database
- Assigns best matching cell type
Parameters:
species: Human or Mousetissue: Tissue origin (required)cancer: Cancer type (if applicable)
Pros:
- Automated marker identification
- Species-specific databases
- Cancer type support
Cons:
- Requires tissue specification
- Slower (finds markers first)
- Limited database
Reference: https://github.com/ZJUFanLab/scCATCH
Use cases:
- When you want marker discovery + annotation
- Cancer tissue datasets
- Species-specific annotation
5. CellTypist
Machine learning-based annotation using pre-trained models. Requires Python environment and celltypist2 package.
Models:
- Download from: https://celltypist.cog.sanger.ac.uk/models/models.json
- Common models: Immune_All_Low.pkl, Immune_All_High.pkl, Tissue-specific models
Key features:
majority_voting: Refines annotations within local subclustersover_clustering: Over-cluster first, then merge by majority vote
Pros:
- State-of-the-art ML models
- Handles complex datasets well
- Majority voting improves accuracy
Cons:
- Requires Python environment
- Model files need download
- Longer runtime with majority voting
Reference: https://celltypist.org/
Use cases:
- Large complex datasets
- When ScType/hitype annotation is insufficient
- High-throughput annotation
Configuration Examples
Example 1: Minimal Configuration (No Annotation)
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
Result: Tool defaults to "direct" with empty cell_types. Original cluster names are preserved.
Example 2: Direct Annotation for T Cell Subsets
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ naive", "CD4+ memory", "CD8+ naive", "CD8+ effector", "-", "Regulatory T"]
Result: Clusters 0-3 and 5 get specified labels. Cluster 4 keeps original name (placeholder "-").
Example 3: ScType for Immune Tissue
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
sctype_db = "/data/databases/ScTypeDB_full.xlsx"
merge = true # Merge clusters with same annotation
Result: Uses full ScType database for immune tissue. Merges clusters with identical annotations.
Example 4: hitype with Built-in PBMC Database
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k" # Built-in PBMC database
merge = true
Result: Fast PBMC annotation using built-in database optimized for 10X PBMC data.
Example 5: scCATCH for Cancer Tissue
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "sccatch"
[CellTypeAnnotation.envs.sccatch_args]
species = "Human"
tissue = "Lung"
cancer = "Lung adenocarcinoma"
Result: Annotates lung adenocarcinoma dataset with cancer-specific cell types.
Example 6: CellTypist with Majority Voting
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "celltypist"
[CellTypeAnnotation.envs.celltypist_args]
model = "/data/models/Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters" # Use clusters for majority voting
python = "/usr/bin/python3" # Specify Python interpreter
Result: Uses ML model with majority voting refinement for robust annotation.
Example 7: Multiple Annotation Methods (Keep Original)
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype" # Create new column, keep original
Result: Annotated cell types saved in celltype_sctype column. Original seurat_clusters unchanged.
Example 8: Multiple Annotation Columns
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NK", "B", "Monocyte"]
more_cell_types = {
"celltype_broad": ["T cells", "T cells", "NK cells", "B cells", "Monocytes"],
"celltype_subset": ["CD4+ naive", "CD8+ effector", "NK", "B naive", "CD14+ Mono"]
}
Result: Creates three metadata columns: celltype (from cell_types), celltype_broad, celltype_subset.
Example 9: Exclude Clusters with NA
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NA", "B cells"]
Result: Cluster 2 is removed from downstream analysis (NA excludes cluster). Note: Only works without newcol.
Example 10: H5AD Input with CellTypist
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["seurat_clustering.h5ad"] # H5AD file
[CellTypeAnnotation.envs]
tool = "celltypist"
ident = "clusters" # Required for H5AD: cluster column name
[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true
Result: Annotates H5AD file. ident specifies which metadata column contains clusters.
Common Patterns
Pattern 1: Standard T Cell Annotation Workflow
# Step 1: Cluster T cells
[SeuratClusteringOfAllCells]
[TOrBCellSelection]
[SeuratClustering] # Clustering on T cells only
# Step 2: Annotate T cell subsets
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["Naive CD4+", "Memory CD4+", "Effector CD8+", "Tregs", "Progenitor"]
Pattern 2: Automated Immune Annotation with Backup
# Use hitype for annotation, keep original clusters
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k"
newcol = "celltype_hitype" # Keep original seurat_clusters
merge = true
Pattern 3: Combine Multiple Annotation Methods
# First annotation: ScType
[CellTypeAnnotation]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype"
# Second annotation: CellTypist for comparison
[CellTypeAnnotation2]
# Note: Must define separate process for second annotation
# See immunopipe-config.md for multi-process setup
Pattern 4: Refine Annotation with CellTypist
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "celltypist"
[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters" # Use clustering result
python = "python"
Pattern 5: Tissue-Specific ScType Annotation
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Brain" # Brain-specific annotation
sctype_db = "/data/brain_markers.xlsx" # Custom brain marker database
merge = true
Dependencies
Upstream Processes
- Required:
SeuratClustering(or process that produces Seurat object with clusters) - Optional:
SeuratClusteringOfAllCells(if using T/B cell selection) - Optional:
SeuratMap2Ref(can combine multiple annotation methods) - Optional:
TOrBCellSelection(T/B-specific annotation)
Downstream Processes
- SeuratClusterStats: Uses annotated cell types for visualization
- ClusterMarkers: Finds markers for each cell type
- TopExpressingGenes: Top genes per cell type
- MarkersFinder: Flexible marker finding by cell type
- CellCellCommunication: Uses cell types for ligand-receptor analysis
- ScFGSEA: GSEA by cell type
- PseudoBulkDEG: DE analysis by cell type
- ScrnaMetabolicLandscape: Metabolic analysis by cell type
- ScRepCombiningExpression: Integrates with TCR/BCR data
External Dependencies
- ScType: Requires
sctypeR package - hitype: Requires
hitypePython package - scCATCH: Requires
scCATCHR package - CellTypist: Requires
celltypist2Python package and Python interpreter
Validation Rules
Tool-Specific Validation
-
ScType:
sctype_tissuemust be specified (or empty string to use all tissues)sctype_dbmust be a valid Excel file path (or empty for default)- Database must contain
tissueType,cellType, andgene_shortcolumns
-
hitype:
hitype_tissuemust be specified (or empty string to use all tissues)hitype_dbmust be valid file path or built-in name- Built-in names:
hitypedb_short,hitypedb_full,hitypedb_pbmc3k
-
scCATCH:
speciesmust be "Human" or "Mouse"tissuemust be specified- At least 2 clusters required (scCATCH limitation)
-
CellTypist:
modelmust be a valid .pkl file pathpythonmust be valid Python interpreter path- CellTypist must be installed in specified Python environment
-
Direct:
cell_typeslist length should match number of clusters (shorter OK, longer not)- Placeholders "-" or "" keep original names
- "NA" removes cluster (only without
newcol)
Input Validation
- Seurat object must have valid identity/clustering column
- H5AD input requires
identparameter (cluster column name) - Output directory must be writable
Output Validation
cluster2celltype.tsvgenerated for ScType/hitype/scCATCH/CellTypist- Output file format matches
outtypespecification - Metadata contains annotated cell types
Troubleshooting
Common Issues and Solutions
Issue: "No tissues found in database" (ScType/hitype)
Cause: sctype_tissue or hitype_tissue doesn't match tissueType column in database.
Solutions:
- Check available tissues: Open database Excel file, read
tissueTypecolumn - Use exact match (case-sensitive)
- Set tissue to empty string
""to use all rows in database - Verify database file path is correct
Issue: "Not enough clusters for scCATCH"
Cause: scCATCH requires at least 2 clusters.
Solutions:
- Ensure clustering result has ≥2 clusters
- Increase clustering resolution in
SeuratClustering - Use alternative tool (ScType, hitype, CellTypist)
Issue: CellTypist Python not found
Cause: CellTypist requires Python environment with celltypist2 installed.
Solutions:
- Specify correct Python path:
celltypist_args.python = "/usr/bin/python3" - Install celltypist2:
pip install celltypist2 - Verify Python environment:
python -c "import celltypist; print(celltypist.__version__)"
Issue: CellTypist model file not found
Cause: Model path is incorrect or model not downloaded.
Solutions:
- Download model from: https://celltypist.cog.sanger.ac.uk/models/models.json
- Use absolute path for
celltypist_args.model - Verify model file exists and is readable
Issue: "Unknown tool" error
Cause: Invalid tool value specified.
Solutions:
- Check valid options:
direct,sctype,hitype,sccatch,celltypist - Verify spelling is correct (case-sensitive)
- Check tool is installed in environment
Issue: Annotations overwritten by multiple annotation processes
Cause: Multiple annotation processes write to same metadata column.
Solutions:
- Use
newcolparameter to create separate columns:[CellTypeAnnotation.envs] newcol = "celltype_method1" - Or use
backup_colto preserve original:backup_col = "original_clusters_id"
Issue: Ambiguous cell type assignments
Cause: Clusters have similar marker expression patterns.
Solutions:
- Increase clustering resolution for finer separation
- Use
merge = falseto keep cluster-specific labels - Compare multiple annotation methods for consensus
- Manual inspection of top marker genes
Issue: Missing cell types in results
Cause: Clusters removed by "NA" placeholder or filtering.
Solutions:
- Check
cell_typeslist for "NA" entries - Verify
newcolis not set (NA removal only works without newcol) - Check downstream processes for filtering
Issue: H5AD input annotation fails
Cause: ident parameter not specified for H5AD files.
Solutions:
- Specify cluster column:
ident = "clusters"(or your cluster column name) - Check H5AD metadata for cluster column name
- Or convert H5AD to RDS format first
Issue: Wrong number of cell types assigned
Cause: cell_types list length doesn't match cluster count.
Solutions:
- Check number of clusters in Seurat object
- Ensure
cell_typeslist has correct number of entries - Use placeholders "-" or "" for clusters to keep original names
- Shorter lists OK (extra clusters keep original names)
Verification Steps
After annotation, verify:
-
Check output file:
# View cluster to cell type mapping cat .pipen/Immunopipe/CellTypeAnnotation/0/output/cluster2celltype.tsv -
Check Seurat object metadata:
library(Seurat) obj <- readRDS(".pipen/Immunopipe/CellTypeAnnotation/0/output/annotated.rds") head(obj@meta.data) # Look for cell type column (seurat_clusters or newcol name) -
Validate annotation quality:
# Check distribution of cell types table(Idents(obj)) # Visualize UMAP with cell types DimPlot(obj, group.by = "celltype_hitype", label = TRUE, repel = TRUE) -
Compare multiple methods:
# Compare ScType vs hitype annotations table(obj$celltype_sctype, obj$celltype_hitype)
Best Practices
Method Selection
- Start with hitype: Fast, good for PBMC/immune datasets
- Compare with ScType: Alternative database-based method
- Use CellTypist for complex datasets: ML-based, handles well
- Manual refinement: Use direct annotation for corrections
Multi-Method Workflow
- Run multiple annotation methods in parallel
- Compare results for consensus
- Manually refine discrepancies using direct annotation
- Keep original cluster names for traceability
Tissue-Specific Annotation
- Always specify tissue when using ScType/hitype
- Use custom databases for non-standard tissues
- Verify database contains relevant cell types
Reproducibility
- Save cluster-to-celltype mapping (
cluster2celltype.tsv) - Document which tool/database was used
- Keep original cluster names using
newcolorbackup_col
External References
Tool Documentation
- ScType: https://github.com/IanevskiAleksandr/sc-type
- hitype: https://github.com/pwwang/hitype
- scCATCH: https://github.com/ZJUFanLab/scCATCH
- CellTypist: https://celltypist.org/
Database Downloads
- ScType databases:
- CellTypist models: https://celltypist.cog.sanger.ac.uk/models/models.json
Related Processes
SeuratClustering: Clustering before annotationSeuratMap2Ref: Reference-based annotation (alternative)ClusterMarkers: Find markers for each cell typeSeuratClusterStats: Visualize annotated clusters