How to create supercells

Introduction

This vignette describes the steps to reduce the size of vast high-dimensional cytometry data using SuperCellCyto, an R package based on the SuperCell R package by David Gfeller lab from the University of Lausanne.

Please note that we’re still actively updating this vignette (and in fact the package itself), and that we welcome any feedbacks on how to improve them. There are myriad of ways on how to use SuperCell. While we try to cover as many use cases as possible, we bound to miss something. In that case, please reach out through the github repository by creating a Github issue.

Installation

To install SuprCellCyto, we need to use the devtools package from CRAN. You can install devtools by using the install.packages("devtools") command.

Thereafter, you can install SuperCellCyto using devtools::install_github("phipsonlab/SuperCellCyto").

SuperCellCyto requires the SuperCell R package installed to run properly. If you use the devtools::install_github command above to install SuperCellCyto, it should be, in theory, automatically installed. But in the case it doesn’t, you can manually install it by using devtools::install_github("GfellerLab/SuperCell").

Preparing your dataset

The function which creates supercells is called runSuperCellCyto, and it operates on a data.table object, an enhanced version of R native data.frame. We may add some support for SummarizedExperiment or flowFrame object in the future if there are enough demands for it.

If the raw data is stored in a csv file, we can import it into a data.table object using their fread function.

If the raw data is stored across multiple csv files or FCS files (more common for cytometry), then we will need the help of Spectre R package to import them as adata.table object. Specifically, we need to:

Run read.files function to read in the FCS or csv files.
Run do.merge.files to merge the resulting data.table objects into one.

If you are unsure as to how these steps will work out, have a look at an example in this Spectre vignette.

Using the vignette above, if you have csv files, you can run the steps in that vignette as they are, but only after changing the InputDirectory variable. If you have FCS files, you need to change the file.type parameter for the read.files function to .fcs.

For this vignette, we will simulate some toy data using the simCytoData function.

n_markers <- 15
n_samples <- 3
dat <- SuperCellCyto::simCytoData(nmarkers = n_markers, ncells = rep(10000, n_samples))
head(dat)
#>     Marker_1 Marker_2 Marker_3  Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#> 1:  9.922964 18.77430 12.37885  8.606616 19.89643 19.67688 5.309088 5.653116
#> 2:  9.660632 21.21645 10.63614  9.084591 18.40098 18.06670 5.606506 5.504536
#> 3:  8.356894 18.48199 11.33545  8.693703 19.09837 19.96394 7.031845 7.939611
#> 4: 10.295952 19.09034 10.06999  9.420504 17.96347 17.58355 7.095184 5.274024
#> 5:  9.013735 20.48332 10.90067  8.938739 18.65707 15.36199 6.107704 6.172049
#> 6:  8.235404 17.96216 11.27431 10.265208 16.65202 19.32516 4.718395 6.223309
#>    Marker_9 Marker_10 Marker_11 Marker_12 Marker_13 Marker_14 Marker_15
#> 1: 9.192645  5.637332  10.78582  17.75642  8.392398  12.77173  5.807003
#> 2: 8.661358  5.074771  12.28062  19.96957  8.673914  13.08128  4.815719
#> 3: 7.024692  6.132604  10.90555  18.22023  7.009536  14.05435  5.230946
#> 4: 7.498126  6.252804  10.26942  16.71395  7.307800  14.29799  4.062768
#> 5: 9.647637  4.944890  11.49003  16.60555  8.057238  15.34386  6.582094
#> 6: 9.875673  5.632924  10.94819  16.50517  8.445060  15.63214  5.950597
#>      Sample Cell_Id
#> 1: Sample_1  Cell_1
#> 2: Sample_1  Cell_2
#> 3: Sample_1  Cell_3
#> 4: Sample_1  Cell_4
#> 5: Sample_1  Cell_5
#> 6: Sample_1  Cell_6

There are several things to note about our dataset. Let’s go through them one by one in each sub-section below.

The markers

The runSuperCellCyto function does not perform any data transformation or scaling. Thus, we must ensure that our dataset have already been appropriately transformed using either the arc-sinh transformation or linear binning (using FlowJo). This tutorial explains the data transformation process in very great detail: (https://wiki.centenary.org.au/display/SPECTRE/Data+transformation). Please have a read if you are unsure how to transform your data.

For our toy dataset, we will transform our data using the arc-sinh transformation implementation provided by the base R asinh function:

# Specify which columns are the markers to transform
marker_cols <- paste0("Marker_", seq_len(n_markers))
# The co-factor for arc-sinh
cofactor <- 5

# Do the transformation
dat_asinh <- asinh(dat[, marker_cols, with = FALSE] / cofactor)

# Rename the new columns
marker_cols_asinh <- paste0(marker_cols, "_asinh")
names(dat_asinh) <- marker_cols_asinh

# Add them our previously loaded data
dat <- cbind(dat, dat_asinh)

head(dat[, marker_cols_asinh, with = FALSE])
#>    Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> 1:       1.436724       2.033476       1.638194       1.311582       2.089677
#> 2:       1.412863       2.152090       1.499128       1.358628       2.014081
#> 3:       1.286217       2.018321       1.557077       1.320298       2.050023
#> 4:       1.469797       2.049616       1.449878       1.390569       1.990879
#> 5:       1.351774       2.117894       1.521409       1.344475       2.027425
#> 6:       1.273675       1.990808       1.552131       1.467108       1.918055
#>    Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> 1:       2.078919      0.9244168      0.9707903       1.369000       0.9686972
#> 2:       1.996400      0.9646000      0.9509570       1.317068       0.8919084
#> 3:       2.092962      1.1416784      1.2425647       1.140849       1.0328413
#> 4:       1.970301      1.1489973      0.9196004       1.194555       1.0479433
#> 5:       1.841095      1.0296905      1.0378167       1.411668       0.8735583
#> 6:       2.061447      0.8409828      1.0442539       1.432460       0.9681120
#>    Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> 1:        1.511791        1.979715        1.289858        1.667228
#> 2:        1.630811        2.093235        1.318323        1.689563
#> 3:        1.521816        2.004558        1.139090        1.756879
#> 4:        1.467477        1.921610        1.173250        1.773087
#> 5:        1.569483        1.915378        1.255035        1.839972
#> 6:        1.525364        1.909574        1.295236        1.857686
#>    Marker_15_asinh
#> 1:       0.9910259
#> 2:       0.8550708
#> 3:       0.9136600
#> 4:       0.7424410
#> 5:       1.0884223
#> 6:       1.0096323

Breaking down the steps, we:

Identify the columns denoting the markers.
Set the co-factor to 5.
Do the transformation and store it in dat_asinh variable.
Set the dat_asinh column name to reflect that the values in each column (marker) haas undergone an arc-sinh transformation.
Combine dat and dat_asinh using cbind.

Cell id column

To create supercell, we must provide a column which uniquely identify each cell, akin to the Cell_Id column in the toy data we generated above:

head(dat$Cell_Id, n = 10)
#>  [1] "Cell_1"  "Cell_2"  "Cell_3"  "Cell_4"  "Cell_5"  "Cell_6"  "Cell_7" 
#>  [8] "Cell_8"  "Cell_9"  "Cell_10"

The purpose of cell id is to allow SuperCell to uniquely identify each cell in the dataset. This ID will come in super handy later when/if we need to work out which cells belong to which supercells.

Generally, we will need to create this ID ourselves. Most dataset won’t come with this ID already embedded in. A simple cell id can be made up by concatenating the word Cell with the row number. Something like the following:

dat$Cell_id_dummy <- paste0("Cell_", seq_len(nrow(dat)))
head(dat$Cell_id_dummy, n = 10)
#>  [1] "Cell_1"  "Cell_2"  "Cell_3"  "Cell_4"  "Cell_5"  "Cell_6"  "Cell_7" 
#>  [8] "Cell_8"  "Cell_9"  "Cell_10"

Here, we store the cell id in a column called Cell_id_dummy. It has values such as Cell_1, Cell_2, all the way until Cell_x where x is the number of cells in the dataset.

Note, we can name the cell id column however we like, id, cell_identity, etc. Importantly, we need make sure we note the column name as we will need to pass it to the runSuperCellCyto function later.

Sample column

You will notice that in the toy data above, we have a column called Sample. By default, this column refers to the biological sample the cells come from. In the toy data above, we have 3 samples, Sample_1, Sample_2, Sample_3:

unique(dat$Sample)
#> [1] "Sample_1" "Sample_2" "Sample_3"

and we have 10,000 cells per sample:

table(dat$Sample)
#> 
#> Sample_1 Sample_2 Sample_3 
#>    10000    10000    10000

To create supercells, it is necessary to have this Sample column in our dataset. We can name the column however we like, Samp, Cell_Samp. However, we make sure we note the column name as we will need to pass it to the runSuperCellCyto function. More on this in [Creating supercells][#creating-supercells] section.

But what if we only have 1 biological sample in our dataset? It does not matter. We still need to have this column in our dataset, and pass the column name to the runSuperCellCyto function. The only difference is that this column will only have 1 unique value.

Why do we need to do this? To ensure that each supercell only contains cells from exactly 1 sample. This is because, in general, it does not make sense to mix cells from different biological samples in one supercell. Additionally (not as important), the runSuperCellCyto function can process all the samples in parallel if you set its BPPARAM parameter to a BiocParallelParam class that leverage parallel processing. More on this in Running runSuperCellCyto in parallel section below.

However, if you want each supercell to contain cells from different biological samples, then you need to create a new Sample column containing exactly 1 unique value, and pass the column name to runSuperCellCyto function.

: You may wonder whether it is possible to use SuperCellCyto to reduce the number of cells captured in each cluster (or cell type) so we can make a UMAP/tSNE plot that is not as crowded? Commonly in cytometry, we use stratified sampling to subsample our clusters before drawing UMAP/tSNE plot to avoid overcrowding it.

The short answer is, yes you can. See Using runSuperCellCyto for stratified summarising section for more information.

Creating supercells

Now that we have imported our data, let’s create some supercells.

First, let’s store the markers, sample, and cell id column in variables:

markers_col <- paste0("Marker_", seq_len(n_markers), "_asinh")
sample_col <- "Sample"
cell_id_col <- "Cell_Id_dummy"

Then pass all of that, together with the dataset into runSuperCellCyto function to create supercells:

supercells <- runSuperCellCyto(
  dt = dat,
  markers = markers_col,
  sample_colname = sample_col,
  cell_id_colname = cell_id_col
)
#> Warning in SCimplify(X = mt, genes.use = rownames(mt), do.scale = FALSE, : colnames(X) is Null, 
#> Gene expression matrix X is expected to have cellIDs as colnames! 
#> CellIDs will be created automatically in a form 'cell_i'

#> Warning in SCimplify(X = mt, genes.use = rownames(mt), do.scale = FALSE, : colnames(X) is Null, 
#> Gene expression matrix X is expected to have cellIDs as colnames! 
#> CellIDs will be created automatically in a form 'cell_i'

#> Warning in SCimplify(X = mt, genes.use = rownames(mt), do.scale = FALSE, : colnames(X) is Null, 
#> Gene expression matrix X is expected to have cellIDs as colnames! 
#> CellIDs will be created automatically in a form 'cell_i'

Now let’s dig deeper into the object it created:

class(supercells)
#> [1] "list"

It is a list containing 3 elements:

names(supercells)
#> [1] "supercell_expression_matrix" "supercell_cell_map"         
#> [3] "supercell_object"

Supercell object

The supercell_object contains the metadata used to create the supercells. It is a list, and each element contains the metadata used to create the supercells for a sample. This will come in handy if we need to debug the supercells later down the line.

Supercell expression matrix

The supercell_expression_matrix contains the marker expression of each supercell. These are calculated by taking the average of the marker expression of all the cells contained within a supercell.

head(supercells$supercell_expression_matrix)
#>    Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> 1:       1.469085       2.072055       1.567400       1.377552       2.027787
#> 2:       1.263866       2.047135       1.581732       1.418801       2.034719
#> 3:       1.274929       2.047144       1.573172       1.421295       2.026800
#> 4:       1.202416       2.066525       1.689606       1.356282       2.027473
#> 5:       1.286750       2.048629       1.601158       1.296497       2.019125
#> 6:       1.326727       2.056175       1.582990       1.324014       2.031440
#>    Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> 1:       2.041505      0.9584494      1.2089978       1.186689       0.8284389
#> 2:       2.034187      1.1049543      1.0451931       1.375561       1.0387258
#> 3:       2.027067      1.2336162      0.9864861       1.383786       1.0083728
#> 4:       2.030019      1.0997869      0.9987753       1.312165       1.0554807
#> 5:       2.031943      1.0839941      1.1662184       1.260295       0.9705766
#> 6:       2.018531      1.0085722      1.0633266       1.272978       1.1422201
#>    Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> 1:        1.615054        1.980432        1.241044        1.796577
#> 2:        1.558310        1.967110        1.323674        1.779298
#> 3:        1.633094        1.970961        1.418573        1.827996
#> 4:        1.543092        1.994770        1.272785        1.812418
#> 5:        1.564709        1.947356        1.335419        1.799435
#> 6:        1.597838        1.966346        1.304000        1.804822
#>    Marker_15_asinh   Sample                 SuperCellId
#> 1:       1.0170310 Sample_1 SuperCell_1_Sample_Sample_1
#> 2:       0.7927207 Sample_1 SuperCell_2_Sample_Sample_1
#> 3:       0.9798060 Sample_1 SuperCell_3_Sample_Sample_1
#> 4:       0.8796199 Sample_1 SuperCell_4_Sample_Sample_1
#> 5:       1.1062859 Sample_1 SuperCell_5_Sample_Sample_1
#> 6:       0.8164972 Sample_1 SuperCell_6_Sample_Sample_1

Therein, we will have the following columns:

names(supercells$supercell_expression_matrix)
#>  [1] "Marker_1_asinh"  "Marker_2_asinh"  "Marker_3_asinh"  "Marker_4_asinh" 
#>  [5] "Marker_5_asinh"  "Marker_6_asinh"  "Marker_7_asinh"  "Marker_8_asinh" 
#>  [9] "Marker_9_asinh"  "Marker_10_asinh" "Marker_11_asinh" "Marker_12_asinh"
#> [13] "Marker_13_asinh" "Marker_14_asinh" "Marker_15_asinh" "Sample"         
#> [17] "SuperCellId"

All the markers we previously specified in the markers_col variable.
A column (Sample in this case) denoting which sample a supercell belongs to, (note the column name is the same as what is stored in sample_col variable).
The SuperCellId column denoting the unique ID of the supercell.

SuperCellId

Let’s have a look at SuperCellId:

head(unique(supercells$supercell_expression_matrix$SuperCellId))
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_2_Sample_Sample_1"
#> [3] "SuperCell_3_Sample_Sample_1" "SuperCell_4_Sample_Sample_1"
#> [5] "SuperCell_5_Sample_Sample_1" "SuperCell_6_Sample_Sample_1"

Let’s break down one of them, SuperCell_1_Sample_Sample_1. SuperCell_1 is a numbering (1 to however many supercells there are in a sample) used to uniquely identify each supercell in a sample. Notably, you may encounter this (SuperCell_1, SuperCell_2) being repeated across different samples, e.g.,

supercell_ids <- unique(supercells$supercell_expression_matrix$SuperCellId)
supercell_ids[grep("SuperCell_1_", supercell_ids)]
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_1_Sample_Sample_2"
#> [3] "SuperCell_1_Sample_Sample_3"

While these 3 supercells’ id are pre-fixed with SuperCell_1, it does not make them equal to one another! SuperCell_1_Sample_Sample_1 will only contain cells from Sample_1 while SuperCell_1_Sample_Sample_2 will only contain cells from Sample_2.

By now, you may have noticed that we appended the sample name into each supercell id. This aids in differentiating the supercells in different samples.

Supercell cell map

supercell_cell_map maps each cell in our dataset to the supercell it belongs to.

head(supercells$supercell_cell_map)
#>                      SuperCellID   Sample
#> 1: SuperCell_285_Sample_Sample_1 Sample_1
#> 2:  SuperCell_61_Sample_Sample_1 Sample_1
#> 3: SuperCell_170_Sample_Sample_1 Sample_1
#> 4: SuperCell_217_Sample_Sample_1 Sample_1
#> 5: SuperCell_308_Sample_Sample_1 Sample_1
#> 6: SuperCell_129_Sample_Sample_1 Sample_1

This map is very useful if we later need to expand the supercells out. Additionally, this is also the reason why we need to have a column in the dataset which uniquely identify each cell.

Running `runSuperCellCyto` in parallel

By default, runSuperCellCyto will process each sample one after the other. As each sample is processed independent of one another, we can process all of them in parallel.

To do this, we need to create a BiocParallelParam object that leverages parallel processing. Additionally, we will also set the number of tasks to the number of samples, and set the load_balancing parameter to TRUE so jobs that are supercelling large samples are not assigned small samples (they will instead be given to those that are supercelling smaller samples).

Notably, we should not set more workers than the total number of cores we have in the computer, as it will render your computer useless for anything else (and it might blow out your RAM). To find out the total number of cores we have in the computer, we can use parallel’s detectCores.

n_cores <- detectCores()
supercell_par <- runSuperCellCyto(
  dt = dat,
  markers = markers_col,
  sample_colname = sample_col,
  cell_id_colname = cell_id_col,
  BPPARAM = MulticoreParam(
    workers = n_cores - 1,
    tasks = n_samples
  ),
  load_balancing = TRUE
)

Controlling the supercells’ granularity

This is described in the runSuperCellCyto function’s documentation, but let’s briefly go through it here.

The runSuperCellCyto function is equipped with various parameters which can be customise to alter the composition of the supercells. The one is very likely to be used the most is the gam parameter.

The gam parameter controls how many supercells to generate, and indirectly, how many cells are captured within each supercell. This parameter is resolved into the following formula gam=n_cells/n_supercells where n_cell denotes the number of cells and n_supercells denotes the number of supercells.

In general, the larger gam parameter is set to, the less supercells we will get. Say for instance we have 10,000 cells. If gam is set to 10, we will end up with about 1,000 supercells, whereas if gam is set to 50, we will end up with about 200 supercells.

You may have noticed, after reading the sections above, runSuperCellCyto is ran on each sample independent of each other, and that we can only set 1 value as the gam parameter. Indeed, for now, the same gam value will be used across all samples, and that depending on how many cells we have in each sample, we will end up with different number of supercells for each sample. For instance, say we have 10,000 cells for sample 1, and 100,000 cells for sample 2. If gam is set to 10, for sample 1, we will get 1,000 supercells (10,000/10) while for sample 2, we will get 10,000 supercells (100,000/10).

In the future, we may add the ability to specify different gam value for different samples. For now, if we want to do this, we will need to break down our data into multiple data.table objects, each containing data from 1 sample, and run runSuperCellCyto function on each of them with different gam parameter value. Something like the following:

n_markers <- 10
dat <- simCytoData(nmarkers = n_markers)
markers_col <- paste0("Marker_", seq_len(n_markers))
sample_col <- "Sample"
cell_id_col <- "Cell_Id"

samples <- unique(dat[[sample_col]])
gam_values <- c(10, 20, 10)

supercells_diff_gam <- lapply(seq_len(length(samples)), function(i) {
  sample <- samples[i]
  gam <- gam_values[i]
  dat_samp <- dat[dat$Sample == sample, ]
  supercell_samp <- runSuperCellCyto(
    dt = dat_samp,
    markers = markers_col,
    sample_colname = sample_col,
    cell_id_colname = cell_id_col,
    gam = gam
  )
  return(supercell_samp)
})

Subsequently, to extract and combine the supercell_expression_matrix and supercell_cell_map, we will need to use rbind:

supercell_expression_matrix <- do.call(
  "rbind", lapply(supercells_diff_gam, function(x) x[["supercell_expression_matrix"]])
)

supercell_cell_map <- do.call(
  "rbind", lapply(supercells_diff_gam, function(x) x[["supercell_cell_map"]])
)

rbind(head(supercell_expression_matrix, n = 3), tail(supercell_expression_matrix, n = 3))
#>     Marker_1 Marker_2 Marker_3 Marker_4  Marker_5 Marker_6  Marker_7  Marker_8
#> 1:  4.688545 12.96876 13.56623 12.56293 17.266424 16.03250  6.786881 18.118704
#> 2:  5.494283 12.34399 13.92597 13.89454 15.735730 15.24414  8.698235 15.753595
#> 3:  6.100729 12.42840 14.81436 13.88304 16.577030 16.12862  7.456026 16.464623
#> 4: 17.734554 17.88825 16.16063 16.21216  8.169350 13.78371 13.510645  8.174193
#> 5: 17.309954 15.25007 15.96462 17.61992  6.750592 15.52790 11.322850  8.925017
#> 6: 16.193427 16.24684 17.28793 16.24450  6.350078 16.41798 11.073143  8.655675
#>    Marker_9 Marker_10   Sample                   SuperCellId
#> 1: 16.00083 14.682485 Sample_1   SuperCell_1_Sample_Sample_1
#> 2: 16.12070 14.505695 Sample_1   SuperCell_2_Sample_Sample_1
#> 3: 15.55512 15.843026 Sample_1   SuperCell_3_Sample_Sample_1
#> 4: 13.73575  8.234455 Sample_2 SuperCell_498_Sample_Sample_2
#> 5: 15.40221  8.963820 Sample_2 SuperCell_499_Sample_Sample_2
#> 6: 16.04450  8.768232 Sample_2 SuperCell_500_Sample_Sample_2

rbind(head(supercell_cell_map, n = 3), tail(supercell_cell_map, n = 3))
#>                      SuperCellID     CellId   Sample
#> 1: SuperCell_857_Sample_Sample_1     Cell_1 Sample_1
#> 2: SuperCell_249_Sample_Sample_1     Cell_2 Sample_1
#> 3: SuperCell_465_Sample_Sample_1     Cell_3 Sample_1
#> 4:   SuperCell_2_Sample_Sample_2 Cell_19998 Sample_2
#> 5:  SuperCell_30_Sample_Sample_2 Cell_19999 Sample_2
#> 6: SuperCell_257_Sample_Sample_2 Cell_20000 Sample_2

Using `runSuperCellCyto` for stratified summarising

As previously mentioned, we can use runSuperCellCyto to perform stratified summarising, i.e., to summarise (well, meaningfully sub-sample) each cluster or cell type. To do this, we need to change the sample column such that it denotes the cell type or the cluster a cell belongs to.

As an example, let’s first cluster a toy data with k-means:

set.seed(42)

# Simulate some data
dat <- simCytoData()
markers_col <- paste0("Marker_", seq_len(10))
cell_id_col <- "Cell_Id"

# Run kmeans
clust <- kmeans(
  x = dat[, markers_col, with = FALSE],
  centers = 5
)

clust_col <- "kmeans_clusters"
dat[[clust_col]] <- paste0("cluster_", clust$cluster)

To perform stratified summarising, we supply the cluster column (kmeans_clusters in the example above), as runSuperCellCyto’s sample_colname parameter.

supercells <- runSuperCellCyto(
  dt = dat,
  markers = markers_col,
  sample_colname = clust_col,
  cell_id_colname = cell_id_col
)

Now, if we look at the supercell_expression_matrix, each row (each supercell) will be denoted with the cluster it belongs to, and not the biological sample it came from:

# Inspect the top 3 and bottom 3 of the expression matrix and some columns.
rbind(
  head(supercells$supercell_expression_matrix, n = 3),
  tail(supercells$supercell_expression_matrix, n = 3)
)[, c("kmeans_clusters", "SuperCellId", "Marker_10")]
#>    kmeans_clusters                    SuperCellId Marker_10
#> 1:       cluster_4   SuperCell_1_Sample_cluster_4  14.64662
#> 2:       cluster_4   SuperCell_2_Sample_cluster_4  14.66858
#> 3:       cluster_4   SuperCell_3_Sample_cluster_4  14.41837
#> 4:       cluster_5 SuperCell_498_Sample_cluster_5  16.99003
#> 5:       cluster_5 SuperCell_499_Sample_cluster_5  17.09864
#> 6:       cluster_5 SuperCell_500_Sample_cluster_5  15.85447

If we look at the number of supercells created and check how many cells there were in each cluster, we will find that, for each cluster, we get approximately n_cells/20 where 20 is the gam parameter value we used for runSuperCellCyto (this is the default).

# Compute how many cells per cluster, and divide by 20, the gamma value.
table(dat$kmeans_clusters) / 20
#> 
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5 
#>    120.25    130.30    119.75    129.70    500.00

table(supercells$supercell_expression_matrix$kmeans_clusters)
#> 
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5 
#>       120       130       120       130       500

Session information

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] BiocParallel_1.36.0 SuperCellCyto_0.1.0 BiocStyle_2.30.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.6-1.1      jsonlite_1.8.8      compiler_4.3.2     
#>  [4] BiocManager_1.30.22 Rcpp_1.0.12         stringr_1.5.1      
#>  [7] SuperCell_1.0       jquerylib_0.1.4     systemfonts_1.0.5  
#> [10] textshaping_0.3.7   yaml_2.3.8          fastmap_1.1.1      
#> [13] lattice_0.21-9      plyr_1.8.9          R6_2.5.1           
#> [16] igraph_1.6.0        knitr_1.45          bookdown_0.37      
#> [19] desc_1.4.3          bslib_0.6.1         rlang_1.1.3        
#> [22] cachem_1.0.8        stringi_1.8.3       RANN_2.6.1         
#> [25] xfun_0.41           fs_1.6.3            sass_0.4.8         
#> [28] memoise_2.0.1       cli_3.6.2           pkgdown_2.0.7      
#> [31] magrittr_2.0.3      digest_0.6.34       grid_4.3.2         
#> [34] lifecycle_1.0.4     vctrs_0.6.5         evaluate_0.23      
#> [37] glue_1.7.0          data.table_1.14.10  codetools_0.2-19   
#> [40] ragg_1.2.7          rmarkdown_2.25      purrr_1.0.2        
#> [43] pkgconfig_2.0.3     tools_4.3.2         htmltools_0.5.7

Givanna Putri