Skip to contents

Have you been following the vignette on how to create supercells, and wonder whether it is possible to use SuperCellCyto as a replacement for stratified sampling to avoid overcrowding UMAP/tSNE plot?

The short answer to that is, yes we can. We call this stratified summarising, and SuperCellCyto can absolutely be used for this purpose. To do this, all we need to do is simply set the sample column of our data to not be the biological sample the cell came from, but rather` the column the we want to stratify the data based on.

For example, when drawing UMAP or tSNE plot, we commonly subsample each cluster or cell type to avoid crowding the plot. Instead of subsampling, we can generate supercells for each cluster or cell type simply by specifying the column that denotes the cluster or cell type each cell belong to as the sample_colname parameter!

Let’s illustrate this using a clustered (using k-means) toy data.

library(SuperCellCyto)

set.seed(42)

# Simulate some data
dat <- simCytoData()
markers_col <- paste0("Marker_", seq_len(10))
cell_id_col <- "Cell_Id"

# Run kmeans
clust <- kmeans(
  x = dat[, markers_col, with = FALSE],
  centers = 5
)

clust_col <- "kmeans_clusters"
dat[[clust_col]] <- paste0("cluster_", clust$cluster)

To perform stratified summarising, we supply the cluster column (kmeans_clusters in the example above), as runSuperCellCyto’s sample_colname parameter.

supercells <- runSuperCellCyto(
  dt = dat,
  markers = markers_col,
  sample_colname = clust_col,
  cell_id_colname = cell_id_col
)

Now, if we look at the supercell_expression_matrix, each row (each supercell) will be denoted with the cluster it belongs to, and not the biological sample it came from:

# Inspect the top 3 and bottom 3 of the expression matrix and some columns.
rbind(
  head(supercells$supercell_expression_matrix, n = 3),
  tail(supercells$supercell_expression_matrix, n = 3)
)[, c("kmeans_clusters", "SuperCellId", "Marker_10")]
#>    kmeans_clusters                    SuperCellId Marker_10
#>             <char>                         <char>     <num>
#> 1:       cluster_4   SuperCell_1_Sample_cluster_4  14.64662
#> 2:       cluster_4   SuperCell_2_Sample_cluster_4  14.66858
#> 3:       cluster_4   SuperCell_3_Sample_cluster_4  14.41837
#> 4:       cluster_5 SuperCell_498_Sample_cluster_5  16.99003
#> 5:       cluster_5 SuperCell_499_Sample_cluster_5  17.09864
#> 6:       cluster_5 SuperCell_500_Sample_cluster_5  15.85447

If we look at the number of supercells created and check how many cells there were in each cluster, we will find that, for each cluster, we get approximately n_cells_in_the_cluster/20 where 20 is the gam parameter value we used for runSuperCellCyto (this is the default).

# Compute how many cells per cluster, and divide by 20, the gamma value.
table(dat$kmeans_clusters) / 20
#> 
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5 
#>    120.25    130.30    119.75    129.70    500.00
table(supercells$supercell_expression_matrix$kmeans_clusters)
#> 
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5 
#>       120       130       120       130       500

Session information

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SuperCellCyto_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.6-5        jsonlite_1.8.8      compiler_4.3.3     
#>  [4] Rcpp_1.0.12         parallel_4.3.3      SuperCell_1.0      
#>  [7] jquerylib_0.1.4     systemfonts_1.0.6   textshaping_0.3.7  
#> [10] BiocParallel_1.36.0 yaml_2.3.8          fastmap_1.1.1      
#> [13] lattice_0.22-5      R6_2.5.1            plyr_1.8.9         
#> [16] igraph_2.0.3        knitr_1.46          desc_1.4.3         
#> [19] bslib_0.7.0         rlang_1.1.3         cachem_1.0.8       
#> [22] RANN_2.6.1          xfun_0.43           fs_1.6.3           
#> [25] sass_0.4.9          memoise_2.0.1       cli_3.6.2          
#> [28] pkgdown_2.0.7       magrittr_2.0.3      digest_0.6.35      
#> [31] grid_4.3.3          lifecycle_1.0.4     vctrs_0.6.5        
#> [34] evaluate_0.23       glue_1.7.0          data.table_1.15.4  
#> [37] codetools_0.2-19    ragg_1.3.0          rmarkdown_2.26     
#> [40] purrr_1.0.2         tools_4.3.3         pkgconfig_2.0.3    
#> [43] htmltools_0.5.8.1