Module 3: Cell Type Annotation

Introduction

In this module, we identify cell types by analysing marker genes that distinguish each cluster. Cell type annotation is a critical step in single cell analysis, as it transforms abstract cluster numbers into biologically meaningful cell populations.

There are two main approaches to cell type annotation:

Marker-based annotation (manual): Examine differentially expressed genes in each cluster and match them to known cell type markers from the literature. This approach requires domain knowledge but provides full control and interpretability.
Reference-based annotation (automated): Use tools like Azimuth, SingleR, or scArches to transfer labels from a reference dataset. This is faster but may miss context-specific populations.

In this workshop, we use marker-based annotation to teach the fundamental concepts. The human heart contains several major cell types:

Cardiomyocytes: The contractile muscle cells, marked by TNNT2, TTN, MYH7
Fibroblasts: Structural cells producing extracellular matrix (DCN, COL1A1)
Endothelial cells: Line blood vessels (PECAM1, VWF, CDH5)
Pericytes/Smooth muscle: Support vascular structures (ACTA2, RGS5)
Immune cells: Macrophages and lymphocytes (PTPRC, CD68, CD163)
Epicardial cells: Cover the heart surface (WT1, TBX18)
Neural cells: Neurons and Schwann cells (NRXN1, PLP1)

Load Libraries and Data

library(Seurat)
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(RColorBrewer)
library(pheatmap)
library(future)

We load the integrated and clustered Seurat object from Module 2.

# Load the clustered data. Resolve `data/` whether wd is the repo root
# (interactive RStudio) or `vignettes/` (knit context).
data_dir <- if (dir.exists("data")) "data" else "../data"
seu <- readRDS(file.path(data_dir, "processed/02_integrated_clustered.rds"))

# Verify the data
cat("Loaded Seurat object:\n")

## Loaded Seurat object:

cat("- Cells:", ncol(seu), "\n")

## - Cells: 10000

cat("- Genes:", nrow(seu), "\n")

## - Genes: 17726

cat("- Clusters:", length(unique(seu$seurat_clusters)), "\n")

## - Clusters: 17

Let us first examine the cluster distribution to understand what we are annotating.

# Cluster sizes
cluster_sizes <- table(seu$seurat_clusters)

cat("Cluster sizes:\n")

## Cluster sizes:

for (i in seq_along(cluster_sizes)) {
    cat(sprintf("  Cluster %s: %d cells\n", names(cluster_sizes)[i], cluster_sizes[i]))
}

##   Cluster 0: 2669 cells
##   Cluster 1: 1940 cells
##   Cluster 2: 1119 cells
##   Cluster 3: 693 cells
##   Cluster 4: 580 cells
##   Cluster 5: 556 cells
##   Cluster 6: 543 cells
##   Cluster 7: 367 cells
##   Cluster 8: 227 cells
##   Cluster 9: 205 cells
##   Cluster 10: 193 cells
##   Cluster 11: 182 cells
##   Cluster 12: 166 cells
##   Cluster 13: 151 cells
##   Cluster 14: 145 cells
##   Cluster 15: 141 cells
##   Cluster 16: 123 cells

Define Colour Palette

We maintain consistent colours throughout our analysis.

# Group colours (developmental stage - lowercase to match data)
group_colors <- c(
    "fetal" = "#E64B35",
    "young" = "#4DBBD5",
    "adult" = "#3C5488"
)

# Generate a palette for clusters
n_clusters <- length(unique(seu$seurat_clusters))
cluster_colors <- colorRampPalette(brewer.pal(12, "Paired"))(n_clusters)
names(cluster_colors) <- levels(seu$seurat_clusters)

Visualise Current Clusters

Before annotation, let us examine the clusters and their composition by developmental stage.

# UMAP coloured by cluster and by group
p1 <- DimPlot(seu, group.by = "seurat_clusters", label = TRUE,
              label.size = 4, repel = TRUE, cols = cluster_colors) +
    ggtitle("Clusters") +
    theme(legend.position = "none")

p2 <- DimPlot(seu, group.by = "group", cols = group_colors) +
    ggtitle("Developmental Stage")

p1 + p2

Understanding the developmental composition of each cluster provides important context for annotation.

# Calculate cluster composition by group
comp_data <- seu@meta.data %>%
    group_by(seurat_clusters, group) %>%
    summarise(n = n(), .groups = "drop") %>%
    group_by(seurat_clusters) %>%
    mutate(pct = 100 * n / sum(n))

ggplot(comp_data, aes(x = seurat_clusters, y = pct, fill = group)) +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = group_colors) +
    labs(x = "Cluster", y = "Percentage", fill = "Stage",
         title = "Developmental Stage Composition per Cluster") +
    theme_minimal() +
    theme(axis.text.x = element_text(size = 10))

We observe that some clusters are highly enriched for specific developmental stages. For example, clusters 0, 7, 9, 10, 13, and 14 are nearly exclusively fetal (>95%), while cluster 2 contains predominantly young and adult cells. Cluster 8 (proliferating cells) is 96% fetal, consistent with the high proliferative capacity of fetal cardiac cells. This developmental segregation reflects biological differences in cell state, particularly among cardiomyocytes where fetal cells have distinct transcriptional profiles from mature cells.

Find Cluster Markers

We use FindAllMarkers() to identify genes that distinguish each cluster from all other cells. This is the foundation of marker-based annotation.

# Set memory limits for large datasets
options(future.globals.maxSize = 8000 * 1024^2)

# Use sequential processing to avoid memory issues
library(future)
plan("sequential")

# Set default assay to SCT
DefaultAssay(seu) <- "SCT"

# Prepare for marker finding with SCT assay
seu <- PrepSCTFindMarkers(seu)

# Find markers for all clusters
# We use only.pos = TRUE to find upregulated markers
# min.pct = 0.25 requires the gene to be expressed in at least 25% of cells
# logfc.threshold = 0.5 requires at least 0.5 log2 fold change
markers <- FindAllMarkers(
    seu,
    only.pos = TRUE,
    min.pct = 0.25,
    logfc.threshold = 0.5,
    test.use = "wilcox"
)

cat("Found", nrow(markers), "marker genes across all clusters\n")

## Found 18163 marker genes across all clusters

Let us examine the top markers for each cluster.

# Get top 5 markers per cluster
top_markers <- markers %>%
    group_by(cluster) %>%
    slice_max(order_by = avg_log2FC, n = 5) %>%
    select(cluster, gene, avg_log2FC, pct.1, pct.2, p_val_adj)

# Display top markers for each cluster
for (cl in levels(seu$seurat_clusters)) {
    cl_markers <- top_markers %>% filter(cluster == cl)
    cat("\n--- Cluster", cl, "---\n")
    for (i in seq_len(nrow(cl_markers))) {
        cat(sprintf("  %s: log2FC=%.2f, pct.1=%.0f%%, pct.2=%.0f%%\n",
                    cl_markers$gene[i],
                    cl_markers$avg_log2FC[i],
                    cl_markers$pct.1[i] * 100,
                    cl_markers$pct.2[i] * 100))
    }
}

## 
## --- Cluster 0 ---
##   LINC02008: log2FC=2.95, pct.1=39%, pct.2=8%
##   RNF175: log2FC=2.64, pct.1=44%, pct.2=8%
##   GPR39: log2FC=2.64, pct.1=49%, pct.2=9%
##   GLP1R: log2FC=2.57, pct.1=47%, pct.2=9%
##   BANCR: log2FC=2.54, pct.1=71%, pct.2=15%
## 
## --- Cluster 1 ---
##   SCARA5: log2FC=5.71, pct.1=43%, pct.2=2%
##   VIT: log2FC=5.62, pct.1=39%, pct.2=2%
##   ADH1B: log2FC=5.60, pct.1=52%, pct.2=5%
##   ABCA9-AS1: log2FC=5.35, pct.1=67%, pct.2=5%
##   ABCA9: log2FC=5.31, pct.1=92%, pct.2=17%
## 
## --- Cluster 2 ---
##   LINC01880: log2FC=5.91, pct.1=32%, pct.2=1%
##   APOB: log2FC=5.57, pct.1=26%, pct.2=1%
##   ADRB1: log2FC=5.39, pct.1=42%, pct.2=1%
##   UGT2B4: log2FC=5.16, pct.1=45%, pct.2=1%
##   MYOM3: log2FC=4.95, pct.1=90%, pct.2=5%
## 
## --- Cluster 3 ---
##   FHL5: log2FC=6.61, pct.1=28%, pct.2=1%
##   EGFLAM: log2FC=6.40, pct.1=77%, pct.2=5%
##   AGAP2: log2FC=6.39, pct.1=40%, pct.2=1%
##   LINC02237: log2FC=6.10, pct.1=38%, pct.2=2%
##   FAM162B: log2FC=6.05, pct.1=68%, pct.2=2%
## 
## --- Cluster 4 ---
##   S100A1: log2FC=3.93, pct.1=32%, pct.2=4%
##   ACTA1: log2FC=3.79, pct.1=80%, pct.2=16%
##   HLA-DQB1: log2FC=3.52, pct.1=34%, pct.2=3%
##   HLA-DRA: log2FC=3.36, pct.1=57%, pct.2=7%
##   HLA-DPA1: log2FC=3.34, pct.1=45%, pct.2=6%
## 
## --- Cluster 5 ---
##   NR5A2: log2FC=6.95, pct.1=47%, pct.2=1%
##   CA4: log2FC=6.55, pct.1=39%, pct.2=1%
##   NOTCH4: log2FC=6.52, pct.1=75%, pct.2=2%
##   BTNL9: log2FC=6.45, pct.1=72%, pct.2=3%
##   CYYR1: log2FC=6.40, pct.1=92%, pct.2=4%
## 
## --- Cluster 6 ---
##   MARCO: log2FC=7.68, pct.1=29%, pct.2=0%
##   PLEK: log2FC=7.25, pct.1=51%, pct.2=1%
##   F13A1: log2FC=7.20, pct.1=92%, pct.2=12%
##   LILRB5: log2FC=7.14, pct.1=45%, pct.2=1%
##   SIGLEC1: log2FC=6.97, pct.1=65%, pct.2=2%
## 
## --- Cluster 7 ---
##   CSMD1: log2FC=5.31, pct.1=70%, pct.2=19%
##   OPCML: log2FC=4.77, pct.1=69%, pct.2=12%
##   BRINP3: log2FC=3.64, pct.1=94%, pct.2=26%
##   ZMAT4: log2FC=2.82, pct.1=66%, pct.2=17%
##   P2RX1: log2FC=2.59, pct.1=41%, pct.2=8%
## 
## --- Cluster 8 ---
##   NEK2: log2FC=7.17, pct.1=27%, pct.2=0%
##   TOP2A: log2FC=6.91, pct.1=95%, pct.2=4%
##   KIF18B: log2FC=6.88, pct.1=91%, pct.2=2%
##   KIF20A: log2FC=6.88, pct.1=35%, pct.2=0%
##   TROAP: log2FC=6.80, pct.1=47%, pct.2=1%
## 
## --- Cluster 9 ---
##   PANCR: log2FC=7.04, pct.1=50%, pct.2=1%
##   KCNJ3: log2FC=6.75, pct.1=91%, pct.2=2%
##   KCNH7: log2FC=6.66, pct.1=89%, pct.2=9%
##   VWDE: log2FC=6.65, pct.1=63%, pct.2=1%
##   ZNF385B: log2FC=6.51, pct.1=85%, pct.2=9%
## 
## --- Cluster 10 ---
##   SRARP: log2FC=7.02, pct.1=41%, pct.2=0%
##   OTUD1: log2FC=5.39, pct.1=74%, pct.2=6%
##   ATF3: log2FC=4.63, pct.1=68%, pct.2=7%
##   FOSB: log2FC=4.47, pct.1=79%, pct.2=5%
##   XIRP1: log2FC=4.43, pct.1=82%, pct.2=12%
## 
## --- Cluster 11 ---
##   PKHD1L1: log2FC=8.67, pct.1=88%, pct.2=3%
##   SMOC1: log2FC=7.21, pct.1=74%, pct.2=2%
##   MMRN1: log2FC=6.95, pct.1=42%, pct.2=1%
##   LINC02388: log2FC=6.31, pct.1=69%, pct.2=10%
##   PCDH15: log2FC=6.31, pct.1=80%, pct.2=13%
## 
## --- Cluster 12 ---
##   NRXN1: log2FC=8.28, pct.1=100%, pct.2=12%
##   TFAP2A: log2FC=8.25, pct.1=38%, pct.2=0%
##   INSC: log2FC=8.22, pct.1=75%, pct.2=1%
##   XKR4: log2FC=7.85, pct.1=99%, pct.2=10%
##   GRIK3: log2FC=7.76, pct.1=60%, pct.2=1%
## 
## --- Cluster 13 ---
##   MYH7: log2FC=2.79, pct.1=99%, pct.2=66%
##   TNNI1: log2FC=2.79, pct.1=98%, pct.2=36%
##   DHFR: log2FC=2.77, pct.1=91%, pct.2=46%
##   LRRC10: log2FC=2.48, pct.1=47%, pct.2=10%
##   SRP14: log2FC=2.41, pct.1=89%, pct.2=34%
## 
## --- Cluster 14 ---
##   E2F1: log2FC=4.74, pct.1=67%, pct.2=4%
##   CDC45: log2FC=4.66, pct.1=53%, pct.2=2%
##   DTL: log2FC=4.62, pct.1=87%, pct.2=6%
##   MCM10: log2FC=4.50, pct.1=53%, pct.2=3%
##   EXO1: log2FC=4.36, pct.1=53%, pct.2=3%
## 
## --- Cluster 15 ---
##   XIRP2: log2FC=4.01, pct.1=47%, pct.2=23%
##   GCLM: log2FC=2.54, pct.1=28%, pct.2=11%
##   GBE1: log2FC=2.51, pct.1=82%, pct.2=73%
##   LMCD1: log2FC=2.47, pct.1=35%, pct.2=23%
##   PCBP3: log2FC=2.44, pct.1=26%, pct.2=9%
## 
## --- Cluster 16 ---
##   SKAP1: log2FC=8.23, pct.1=81%, pct.2=3%
##   THEMIS: log2FC=7.85, pct.1=48%, pct.2=2%
##   SCML4: log2FC=7.65, pct.1=51%, pct.2=1%
##   CD2: log2FC=7.50, pct.1=42%, pct.2=0%
##   ITK: log2FC=7.34, pct.1=60%, pct.2=1%

Visualise Top Markers with DotPlot

The DotPlot below shows the top 5 differentially expressed genes for each cluster. This provides an unbiased view of what distinguishes each cluster before we examine the cell type markers from the original study.

# Get top 5 unique markers per cluster, ordered by cluster
top5_genes <- top_markers %>%
    arrange(cluster, desc(avg_log2FC)) %>%
    pull(gene) %>%
    unique()

DotPlot(seu, features = top5_genes, group.by = "seurat_clusters") +
    RotatedAxis() +
    scale_colour_gradient2(low = "blue", mid = "white", high = "red",
                           midpoint = 0) +
    labs(title = "Top 5 Marker Genes per Cluster (from FindAllMarkers)") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 7),
          axis.text.y = element_text(size = 10))

Marker Heatmap

A heatmap of these top markers provides a comprehensive view of the expression patterns across clusters.

# Get top 3 markers per cluster for heatmap
top3_markers <- markers %>%
    group_by(cluster) %>%
    slice_max(order_by = avg_log2FC, n = 3) %>%
    pull(gene) %>%
    unique()

# Downsample for visualisation (50 cells per cluster max)
set.seed(42)
cells_subset <- seu@meta.data %>%
    mutate(cell_id = rownames(seu@meta.data)) %>%
    group_by(seurat_clusters) %>%
    slice_sample(n = 50) %>%
    pull(cell_id)

# Create heatmap
DoHeatmap(subset(seu, cells = cells_subset),
          features = top3_markers,
          group.by = "seurat_clusters",
          size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0) +
  theme(axis.text.y = element_text(size = 6))

Cell Type Marker Analysis

While the differentially expressed genes provide a starting point, we typically compare against known marker genes for the cell types expected in our tissue. The human heart contains several well-characterised cell populations.

Cell Type Marker Genes

The marker genes below were identified by Sim et al. 2021 (Circulation) from their analysis of the full human heart development dataset. These markers distinguish the major cell populations across foetal, young, and adult hearts.

Visualise with DotPlot

The DotPlot is a powerful visualisation that shows both the percentage of cells expressing a gene (dot size) and the average expression level (colour intensity).

# Define markers and cell types from Sim et al. 2021
marker_info <- data.frame(
    gene = c(
        # Cardiomyocytes
        "TNNT2", "ACTN2", "TNNI3", "RYR2", "MYH6", "TTN", "MYBPC3", "PLN", "MYL2",
        # Fibroblasts
        "COL1A1", "COL3A1", "DCN", "PDGFRA", "POSTN", "TCF21", "VIM",
        # Pericytes
        "KCNJ8", "VTN", "ABCC9", "HEYL",
        # Endothelial
        "PECAM1", "KDR", "CDH5", "EMCN", "FLT1",
        # Smooth Muscle
        "TAGLN", "MYH11", "MYLK", "LMOD1",
        # Macrophages
        "CD68", "CD163", "CD74", "F13A1", "LGALS3",
        # T Cells
        "CD3D", "CD3G", "CD3E", "IL7R",
        # Schwann/Neural
        "PLP1", "CNP", "S100B",
        # Epicardial
        "UPK3B", "WT1",
        # Proliferating
        "MKI67", "TOP2A"
    ),
    celltype = c(
        rep("Cardiomyocytes", 9),
        rep("Fibroblasts", 7),
        rep("Pericytes", 4),
        rep("Endothelial", 5),
        rep("Smooth Muscle", 4),
        rep("Macrophages", 5),
        rep("T Cells", 4),
        rep("Schwann", 3),
        rep("Epicardial", 2),
        rep("Proliferating", 2)
    )
)

# Define colors for each cell type
celltype_colors <- c(
    "Cardiomyocytes" = "#D62728",
    "Fibroblasts" = "#1F77B4",
    "Pericytes" = "#2CA02C",
    "Endothelial" = "#9467BD",
    "Smooth Muscle" = "#FF7F0E",
    "Macrophages" = "#8C564B",
    "T Cells" = "#E377C2",
    "Schwann" = "#7F7F7F",
    "Epicardial" = "#17BECF",
    "Proliferating" = "#333333"
)

# Filter to available genes
marker_info <- marker_info[marker_info$gene %in% rownames(seu), ]
marker_genes <- marker_info$gene
gene_colors <- celltype_colors[marker_info$celltype]
marker_info$celltype_f <- factor(marker_info$celltype, levels = names(celltype_colors))

# Order clusters by cell type identity for diagonal pattern
# Preferred order: CMs first (fetal, mature, atrial), then stromal, immune, neural
preferred_order <- c("0", "7", "10", "14", "9", "2", "4", "8", "13", "1", "3", "15", "5", "11", "6", "16", "12")
# Filter to only clusters that exist in our data
actual_clusters <- as.character(unique(seu$seurat_clusters))
cluster_order <- preferred_order[preferred_order %in% actual_clusters]
# Add any clusters not in preferred order (shouldn't happen, but safety)
cluster_order <- c(cluster_order, setdiff(actual_clusters, cluster_order))
seu$cluster_ordered <- factor(seu$seurat_clusters, levels = cluster_order)

# Calculate cell type label positions
celltype_positions <- marker_info %>%
    mutate(pos = row_number()) %>%
    group_by(celltype_f) %>%
    summarise(start = min(pos) - 0.5, end = max(pos) + 0.5,
              mid = mean(pos), .groups = "drop")

# Create DotPlot with cell type annotations
DotPlot(seu, features = marker_genes, group.by = "cluster_ordered") +
    RotatedAxis() +
    scale_colour_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                           midpoint = 0, name = "Average\nExpression") +
    scale_size_continuous(name = "Percent\nExpressed", range = c(0.5, 6)) +
    labs(x = NULL, y = "Cluster",
         title = "Cell Type Marker Expression (Sim et al. 2021)") +
    theme_bw() +
    theme(
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1, size = 10,
                                   color = gene_colors, face = "italic"),
        axis.text.y = element_text(size = 12),
        axis.title.y = element_text(size = 14, face = "bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_rect(color = "grey40", linewidth = 0.8),
        legend.position = "right",
        legend.title = element_text(size = 11, face = "bold"),
        legend.text = element_text(size = 10),
        plot.margin = margin(10, 10, 45, 10)
    ) +
    # Add cell type color bars
    annotate("rect",
             xmin = celltype_positions$start, xmax = celltype_positions$end,
             ymin = -1.8, ymax = -1.2,
             fill = celltype_colors[as.character(celltype_positions$celltype_f)]) +
    # Add cell type labels inside bars
    annotate("text",
             x = celltype_positions$mid, y = -1.5,
             label = celltype_positions$celltype_f,
             size = 3.5, angle = 0, hjust = 0.5, fontface = "bold",
             color = "white") +
    coord_cartesian(clip = "off", ylim = c(0.5, length(cluster_order) + 0.5))

The DotPlot reveals a clear diagonal pattern where each cell type shows specific marker expression. Gene names are coloured by their cell type, matching the annotation bar at the bottom:

Cardiomyocytes (red): TNNT2, ACTN2, TNNI3, RYR2, MYH6, TTN, MYBPC3, PLN, MYL2
Fibroblasts (blue): COL1A1, COL3A1, DCN, PDGFRA, POSTN, TCF21, VIM
Pericytes (green): KCNJ8, VTN, ABCC9, HEYL
Endothelial (purple): PECAM1, KDR, CDH5, EMCN, FLT1
Smooth Muscle (orange): TAGLN, MYH11, MYLK, LMOD1
Macrophages (brown): CD68, CD163, CD74, F13A1, LGALS3
T Cells (pink): CD3D, CD3G, CD3E, IL7R
Schwann (grey): PLP1, CNP, S100B
Epicardial (teal): UPK3B, WT1
Proliferating (black): MKI67, TOP2A

Feature Plots for Key Markers

FeaturePlots show the spatial distribution of marker expression on the UMAP, helping us understand the relationship between clusters.

# Cardiomyocyte markers
FeaturePlot(seu,
            features = c("TNNT2", "TTN", "MYH7", "MYH6"),
            ncol = 2, order = TRUE) &
    scale_colour_gradient(low = "lightgrey", high = "darkred") &
    theme_minimal()

The cardiomyocyte markers reveal an interesting pattern: MYH6 (atrial isoform) is enriched in clusters 2 (mature atrial CM) and 9 (fetal atrial CM), while MYH7 (ventricular isoform) is more expressed in clusters 0, 7, 10, and 14 (fetal ventricular CM populations). This suggests we can distinguish atrial from ventricular cardiomyocytes.

# Stromal cell markers
FeaturePlot(seu,
            features = c("DCN", "PECAM1", "KCNJ8", "ACTA2"),
            ncol = 2, order = TRUE) &
    scale_colour_gradient(low = "lightgrey", high = "darkblue") &
    theme_minimal()

# Immune and other markers
FeaturePlot(seu,
            features = c("PTPRC", "F13A1", "NRXN1", "MKI67"),
            ncol = 2, order = TRUE) &
    scale_colour_gradient(low = "lightgrey", high = "darkgreen") &
    theme_minimal()

Assign Cell Type Annotations

Based on the marker analysis above, we can now assign cell type labels to each cluster. We consider both the marker gene expression and the developmental stage composition.

Note: These annotations are based on the 10,000 cell subset used for this workshop. The full dataset (~54,000 cells) may reveal additional rare populations or finer subclusters.

The table below summarises our annotations with the supporting marker evidence:

Cluster	Cell Type	Key Markers	Notes
0	Fetal CM	TNNT2+, TTN+, MYH7+	Fetal ventricular cardiomyocytes (100% fetal)
1	Fibroblasts	DCN+, COL1A1+, SCARA5+	Present across all developmental stages
2	Atrial CM	TTN+, MYH6++, ADRB1+	Mature atrial CM (62% young, 37% adult)
3	Epicardial	KCNJ8+, EGFLAM+, UPK3B+	Epicardial/pericyte population
4	Activated CM	HLA-DRA+, HLA-DQB1+, TNNT2+	CM with MHC class II expression (75% young)
5	Endothelial	PECAM1+, NOTCH4+, CA4+	Main endothelial population
6	Macrophages	F13A1+, MARCO+, CD68+	Tissue-resident macrophages
7	Fetal CM	TNNT2+, TTN+, CSMD1+	Fetal cardiomyocytes (100% fetal)
8	Proliferating	TOP2A+, MKI67+, KIF18B+	Proliferating cells (96% fetal)
9	Fetal Atrial CM	TTN+, MYH6+, KCNJ3+	Fetal atrial cardiomyocytes (100% fetal)
10	Fetal CM (stress)	TNNT2+, ATF3+, FOSB+	Stress-response CM (100% fetal)
11	Endothelial	PECAM1+, PKHD1L1+, SMOC1+	Vascular endothelial subset
12	Neurons	NRXN1++, XKR4+, TFAP2A+	Neural cells
13	Cycling CM	E2F1+, CDC45+, DTL+	S-phase cycling CM (99% fetal)
14	Fetal Ventricular CM	MYH7++, TNNI1+, DHFR+	Fetal ventricular CM (98% fetal)
15	Mixed/Transitional	XIRP2+, GCLM+	Mixed epicardial/CM (50% fetal)
16	T cells	CD2+, SKAP1+, THEMIS+	T lymphocytes

# Define cell type annotations for all 17 clusters
cluster_annotations <- c(
  "0" = "Fetal CM",
  "1" = "Fibroblasts",
  "2" = "Atrial CM",
  "3" = "Epicardial",
  "4" = "Activated CM",
  "5" = "Endothelial",
  "6" = "Macrophages",
  "7" = "Fetal CM",
  "8" = "Proliferating",
  "9" = "Fetal Atrial CM",
  "10" = "Fetal CM (stress)",
  "11" = "Endothelial",
  "12" = "Neurons",
  "13" = "Cycling CM",
  "14" = "Fetal Ventricular CM",
  "15" = "Mixed/Transitional",
  "16" = "T cells"
)

cell_type_df <- data.frame(
  cell_type = cluster_annotations[as.character(seu$seurat_clusters)],
  row.names = colnames(seu)
)
seu <- AddMetaData(seu, metadata = cell_type_df)

# Broad categories for summary analyses
broad_annotations <- c(
  "0" = "Cardiomyocytes",
  "1" = "Fibroblasts",
  "2" = "Cardiomyocytes",
  "3" = "Epicardial",
  "4" = "Cardiomyocytes",
  "5" = "Endothelial",
  "6" = "Immune",
  "7" = "Cardiomyocytes",
  "8" = "Proliferating",
  "9" = "Cardiomyocytes",
  "10" = "Cardiomyocytes",
  "11" = "Endothelial",
  "12" = "Neural",
  "13" = "Cardiomyocytes",
  "14" = "Cardiomyocytes",
  "15" = "Mixed",
  "16" = "Immune"
)

cell_type_broad_df <- data.frame(
  cell_type_broad = broad_annotations[as.character(seu$seurat_clusters)],
  row.names = colnames(seu)
)
seu <- AddMetaData(seu, metadata = cell_type_broad_df)

Visualise Annotated Cell Types

With cell type annotations assigned, we now examine the results to verify our annotations make biological sense and to understand the cellular composition of the heart across developmental stages.

Annotated UMAP

The UMAP visualisation with cell type labels provides the key output of our annotation workflow. We show both detailed annotations (distinguishing fetal from adult cardiomyocytes, atrial from ventricular) and broad categories.

# Define colours for cell types (13 unique detailed types for 17 clusters)
celltype_colors <- c(
    "Fetal CM" = "#E64B35",
    "Fibroblasts" = "#00A087",
    "Atrial CM" = "#3C5488",
    "Activated CM" = "#F4A582",
    "Epicardial" = "#F39B7F",
    "Endothelial" = "#8491B4",
    "Macrophages" = "#91D1C2",
    "Fetal Atrial CM" = "#B09C85",
    "Fetal CM (stress)" = "#FF6B6B",
    "Fetal Ventricular CM" = "#4DBBD5",
    "Neurons" = "#CD534C",
    "Cycling CM" = "#9370DB",
    "Proliferating" = "#DDA0DD",
    "Mixed/Transitional" = "#868686",
    "T cells" = "#7570B3"
)

# Broad categories
broad_colors <- c(
    "Cardiomyocytes" = "#E64B35",
    "Fibroblasts" = "#00A087",
    "Endothelial" = "#8491B4",
    "Epicardial" = "#F39B7F",
    "Immune" = "#91D1C2",
    "Neural" = "#CD534C",
    "Proliferating" = "#9370DB",
    "Mixed" = "#868686"
)

p1 <- DimPlot(seu, group.by = "cell_type", label = TRUE,
              label.size = 3, repel = TRUE) +
    scale_color_manual(values = celltype_colors) +
    ggtitle("Detailed Cell Types") +
    theme(legend.position = "none")

p2 <- DimPlot(seu, group.by = "cell_type_broad", label = TRUE,
              label.size = 4, repel = TRUE) +
    scale_color_manual(values = broad_colors) +
    ggtitle("Broad Cell Types")

p1 + p2

The annotated UMAP reveals a well-organised cellular landscape:

Cardiomyocytes form the dominant population, with fetal cells (red tones) clustering separately from mature atrial and ventricular populations
Stromal cells (fibroblasts, pericytes, endothelial) occupy distinct regions of the UMAP space
Immune and neural cells form smaller but clearly defined clusters

Cluster Size Distribution

Before examining developmental patterns, we assess how many cells belong to each cluster. This helps interpret whether observed patterns reflect genuine biology or sampling effects.

# Cluster sizes with cell type labels
cluster_size_data <- seu@meta.data %>%
    group_by(seurat_clusters, cell_type) %>%
    summarise(n = n(), .groups = "drop")

ggplot(cluster_size_data, aes(x = seurat_clusters, y = n, fill = cell_type)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = celltype_colors) +
  labs(x = "Cluster", y = "Number of Cells", fill = "Cell Type",
       title = "Cell Count per Cluster") +
  theme_minimal() +
  theme(legend.position = "right",
        legend.text = element_text(size = 8))

Cluster 0 (Fetal CM) is the largest population with ~2,700 cells, consistent with fetal samples contributing the most cells. Clusters 12 (Neurons), 15 (Mixed/Transitional), and 16 (T cells) are among the smallest, representing specialised or transitional populations.

Developmental Composition per Cluster

A critical question in developmental studies is: which cell populations are stage-specific versus shared across development? This stacked bar chart shows the fetal/young/adult breakdown within each cluster.

# Calculate developmental composition per cluster
cluster_dev_comp <- seu@meta.data %>%
    group_by(seurat_clusters, group) %>%
    summarise(n = n(), .groups = "drop") %>%
    group_by(seurat_clusters) %>%
    mutate(pct = 100 * n / sum(n))

ggplot(cluster_dev_comp, aes(x = seurat_clusters, y = pct, fill = group)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = group_colors) +
  labs(x = "Cluster", y = "Percentage", fill = "Stage",
       title = "Developmental Stage Composition per Cluster") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 10))

This figure reveals striking developmental patterns:

Fetal-specific clusters (0, 7, 9, 10, 13, 14): These cardiomyocyte populations are almost exclusively fetal (>95%), reflecting the distinct transcriptional state of immature heart cells
Mature CM cluster (2): Atrial cardiomyocytes predominantly from young (62%) and adult (37%) hearts
Activated CM (4): Cardiomyocytes with MHC class II expression, predominantly from young samples (75%), potentially reflecting immune activation
Proliferating cells (8): Nearly exclusively fetal (96%), consistent with the high proliferative capacity of fetal cardiac cells
Shared populations (1, 3, 5, 6, 11, 12, 16): Fibroblasts, epicardial, endothelial, immune, and neural cells are present across all developmental stages
Mixed/Transitional (15): A heterogeneous population containing both epicardial and cardiomyocyte signatures

Cell Type Composition by Developmental Stage

We can also view this relationship from the opposite perspective: what is the cellular composition within each developmental stage?

# Calculate composition
stage_comp <- seu@meta.data %>%
    group_by(group, cell_type_broad) %>%
    summarise(n = n(), .groups = "drop") %>%
    group_by(group) %>%
    mutate(pct = 100 * n / sum(n))

ggplot(stage_comp, aes(x = group, y = pct, fill = cell_type_broad)) +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = broad_colors) +
    labs(x = "Developmental Stage", y = "Percentage", fill = "Cell Type",
         title = "Cell Type Composition by Developmental Stage") +
    theme_minimal()

This complementary view shows:

Cardiomyocytes dominate all stages but show distinct subtypes (fetal vs mature)
Proliferating cells are predominantly fetal, consistent with the known proliferative capacity of fetal cardiomyocytes
Stromal populations (fibroblasts, endothelial, pericytes) are relatively stable across development

Split UMAP by Developmental Stage

Finally, we split the UMAP by developmental stage to directly visualise which cell types are present at each time point. This view clearly shows how the cellular landscape changes from fetal to adult heart.

DimPlot(seu, group.by = "cell_type", split.by = "group",
        label = FALSE, cols = celltype_colors) +
  ggtitle("Cell Types by Developmental Stage") +
  theme(legend.position = "right",
        legend.text = element_text(size = 8))

The split view highlights several key observations:

Fetal heart is dominated by Fetal CM populations, with proliferating and cycling cells visible (clusters 8, 13)
Young heart shows a transition state with mature atrial CM (cluster 2) and activated CM with MHC class II expression (cluster 4)
Adult heart shows predominantly mature atrial cardiomyocytes (cluster 2)
Stromal and immune populations (fibroblasts, endothelial, macrophages, T cells) are consistent across all developmental stages

Save Annotated Object

# Resolve results/ alongside data/ for the same wd-portability reason.
results_dir <- if (dir.exists("results") || dir.exists("data")) "results" else "../results"
dir.create(results_dir, recursive = TRUE, showWarnings = FALSE)

# Save the annotated Seurat object
saveRDS(seu, file.path(data_dir, "processed/03_annotated.rds"))

# Also save the markers table
write.csv(markers, file.path(results_dir, "cluster_markers.csv"), row.names = FALSE)

message("Saved annotated object to: data/processed/03_annotated.rds")
message("Saved marker genes to: results/cluster_markers.csv")

Summary

In this module, we performed cell type annotation on the clustered data. We:

Identified marker genes for each of the 17 clusters using FindAllMarkers()
Compared cluster markers against known heart cell type markers from Sim et al. 2021
Visualised marker expression using DotPlots and FeaturePlots
Assigned cell type labels based on marker gene expression and developmental composition
Created both detailed (13 types) and broad (8 types) cell type annotations

The annotated dataset is now ready for differential expression and composition analysis in Module 4.

Session Information

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] future_1.68.0      pheatmap_1.0.13    RColorBrewer_1.1-3 patchwork_1.3.2   
##  [5] tidyr_1.3.2        dplyr_1.1.4        ggplot2_4.0.1      Seurat_5.4.0      
##  [9] SeuratObject_5.3.0 sp_2.2-0          
## 
## loaded via a namespace (and not attached):
##   [1] deldir_2.0-4           pbapply_1.7-4          gridExtra_2.3         
##   [4] rlang_1.1.7            magrittr_2.0.4         RcppAnnoy_0.0.23      
##   [7] otel_0.2.0             spatstat.geom_3.6-1    matrixStats_1.5.0     
##  [10] ggridges_0.5.7         compiler_4.5.2         png_0.1-8             
##  [13] systemfonts_1.3.1      vctrs_0.6.5            reshape2_1.4.5        
##  [16] stringr_1.6.0          pkgconfig_2.0.3        fastmap_1.2.0         
##  [19] labeling_0.4.3         promises_1.5.0         rmarkdown_2.30        
##  [22] ragg_1.5.0             purrr_1.2.1            xfun_0.55             
##  [25] cachem_1.1.0           jsonlite_2.0.0         goftest_1.2-3         
##  [28] later_1.4.5            spatstat.utils_3.2-1   irlba_2.3.5.1         
##  [31] parallel_4.5.2         cluster_2.1.8.1        R6_2.6.1              
##  [34] ica_1.0-3              spatstat.data_3.1-9    stringi_1.8.7         
##  [37] bslib_0.9.0            limma_3.66.0           reticulate_1.44.1     
##  [40] spatstat.univar_3.1-5  parallelly_1.46.1      lmtest_0.9-40         
##  [43] jquerylib_0.1.4        scattermore_1.2        Rcpp_1.1.1-1          
##  [46] knitr_1.51             tensor_1.5.1           future.apply_1.20.1   
##  [49] zoo_1.8-15             sctransform_0.4.3      httpuv_1.6.16         
##  [52] Matrix_1.7-4           splines_4.5.2          igraph_2.2.1          
##  [55] tidyselect_1.2.1       abind_1.4-8            yaml_2.3.12           
##  [58] spatstat.random_3.4-3  spatstat.explore_3.6-0 codetools_0.2-20      
##  [61] miniUI_0.1.2           listenv_0.10.0         plyr_1.8.9            
##  [64] lattice_0.22-7         tibble_3.3.1           withr_3.0.2           
##  [67] shiny_1.12.1           S7_0.2.1-1             ROCR_1.0-11           
##  [70] evaluate_1.0.5         Rtsne_0.17             fastDummies_1.7.5     
##  [73] desc_1.4.3             survival_3.8-3         polyclip_1.10-7       
##  [76] fitdistrplus_1.2-4     pillar_1.11.1          BiocManager_1.30.27   
##  [79] KernSmooth_2.23-26     renv_1.1.5             plotly_4.11.0         
##  [82] generics_0.1.4         RcppHNSW_0.6.0         scales_1.4.0          
##  [85] globals_0.18.0         xtable_1.8-4           glue_1.8.0            
##  [88] lazyeval_0.2.2         tools_4.5.2            data.table_1.18.0     
##  [91] RSpectra_0.16-2        RANN_2.6.2             fs_1.6.6              
##  [94] dotCall64_1.2          cowplot_1.2.0          grid_4.5.2            
##  [97] nlme_3.1-168           cli_3.6.5              spatstat.sparse_3.1-0 
## [100] textshaping_1.0.4      spam_2.11-3            viridisLite_0.4.2     
## [103] uwot_0.2.4             gtable_0.3.6           sass_0.4.10           
## [106] digest_0.6.39          progressr_0.18.0       ggrepel_0.9.6         
## [109] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.9       
## [112] pkgdown_2.2.0          lifecycle_1.0.5        httr_1.4.7            
## [115] statmod_1.5.1          mime_0.13              MASS_7.3-65

Single Cell Workshop

2026-05-06