Using SuperCellCyto for Stratified Summarising
Givanna Putri
using_supercellcyto_for_stratified_summarising.Rmd
Have you been following the vignette on how to create supercells, and
wonder whether it is possible to use SuperCellCyto
as a
replacement for stratified sampling to avoid overcrowding UMAP/tSNE
plot?
The short answer to that is, yes we can. We call this
stratified summarising, and SuperCellCyto
can absolutely be used for this purpose. To do this, all we need to do
is simply set the sample column of our data to not be the biological
sample the cell came from, but rather` the column the we want to
stratify the data based on.
For example, when drawing UMAP or tSNE plot, we commonly subsample
each cluster or cell type to avoid crowding the plot. Instead of
subsampling, we can generate supercells for each cluster or cell type
simply by specifying the column that denotes the cluster or cell type
each cell belong to as the sample_colname
parameter!
Let’s illustrate this using a clustered (using k-means) toy data.
library(SuperCellCyto)
set.seed(42)
# Simulate some data
dat <- simCytoData()
markers_col <- paste0("Marker_", seq_len(10))
cell_id_col <- "Cell_Id"
# Run kmeans
clust <- kmeans(
x = dat[, markers_col, with = FALSE],
centers = 5
)
clust_col <- "kmeans_clusters"
dat[[clust_col]] <- paste0("cluster_", clust$cluster)
To perform stratified summarising, we supply the cluster column
(kmeans_clusters
in the example above), as
runSuperCellCyto
’s sample_colname
parameter.
supercells <- runSuperCellCyto(
dt = dat,
markers = markers_col,
sample_colname = clust_col,
cell_id_colname = cell_id_col
)
Now, if we look at the supercell_expression_matrix
, each
row (each supercell) will be denoted with the cluster it belongs to, and
not the biological sample it came from:
# Inspect the top 3 and bottom 3 of the expression matrix and some columns.
rbind(
head(supercells$supercell_expression_matrix, n = 3),
tail(supercells$supercell_expression_matrix, n = 3)
)[, c("kmeans_clusters", "SuperCellId", "Marker_10")]
#> kmeans_clusters SuperCellId Marker_10
#> <char> <char> <num>
#> 1: cluster_4 SuperCell_1_Sample_cluster_4 14.64662
#> 2: cluster_4 SuperCell_2_Sample_cluster_4 14.66858
#> 3: cluster_4 SuperCell_3_Sample_cluster_4 14.41837
#> 4: cluster_5 SuperCell_498_Sample_cluster_5 16.99003
#> 5: cluster_5 SuperCell_499_Sample_cluster_5 17.09864
#> 6: cluster_5 SuperCell_500_Sample_cluster_5 15.85447
If we look at the number of supercells created and check how many
cells there were in each cluster, we will find that, for each cluster,
we get approximately n_cells_in_the_cluster/20
where 20 is
the gam
parameter value we used for
runSuperCellCyto
(this is the default).
# Compute how many cells per cluster, and divide by 20, the gamma value.
table(dat$kmeans_clusters) / 20
#>
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
#> 120.25 130.30 119.75 129.70 500.00
table(supercells$supercell_expression_matrix$kmeans_clusters)
#>
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
#> 120 130 120 130 500
Session information
sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SuperCellCyto_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.6-5 jsonlite_1.8.8 compiler_4.3.3
#> [4] Rcpp_1.0.12 parallel_4.3.3 SuperCell_1.0
#> [7] jquerylib_0.1.4 systemfonts_1.0.6 textshaping_0.3.7
#> [10] BiocParallel_1.36.0 yaml_2.3.8 fastmap_1.1.1
#> [13] lattice_0.22-5 R6_2.5.1 plyr_1.8.9
#> [16] igraph_2.0.3 knitr_1.46 desc_1.4.3
#> [19] bslib_0.7.0 rlang_1.1.3 cachem_1.0.8
#> [22] RANN_2.6.1 xfun_0.43 fs_1.6.3
#> [25] sass_0.4.9 memoise_2.0.1 cli_3.6.2
#> [28] pkgdown_2.0.7 magrittr_2.0.3 digest_0.6.35
#> [31] grid_4.3.3 lifecycle_1.0.4 vctrs_0.6.5
#> [34] evaluate_0.23 glue_1.7.0 data.table_1.15.4
#> [37] codetools_0.2-19 ragg_1.3.0 rmarkdown_2.26
#> [40] purrr_1.0.2 tools_4.3.3 pkgconfig_2.0.3
#> [43] htmltools_0.5.8.1