How to Prepare Data for SuperCellCyto

Performing Quality Control

Prior to creating supercells, it’s crucial to ensure that your dataset has undergone thorough quality control (QC). We want to retain only single, live cells and remove any debris, doublets, or dead cells. Additionally, it is also important to perform compensation to correct for fluorescence spillover (for Flow data) or to adjust for signal overlap or spillover between different metal isotopse (for Cytof data). A well-prepared dataset is key to obtaining reliable supercells from SuperCellCyto.

Several R packages are available for performing QC on cytometry data. Notable among these are PeacoQC, CATALYST, and CytoExploreR. These packages are well maintained and are continuously updated. To make sure that the information we provide do not quickly go out of date, we highly recommend you to consult the packages’ respective vignettes for detailed guidance on how to use them to QC your data.

In our manuscript, we used CytoExploreR to QC the Oetjen_bcell flow cytometry data and CATALYST to QC the Trussart_cytofruv Cytof data.

The specific scripts used can be found in Github:

b_cell_identification/gate_flow_data.R for Oetjen_bcell data.
batch_correction/prepare_data.R for Trussart_cytofruv data. These scripts were adapted from those used in the CytofRUV manuscript.

For Oetjen_bcell data, we used the following gating strategy post compensation:

FSC-H and FSC-A to isolate only the single events. (Also check SSC-H vs SSC-A).
FSC-A and SSC-A to remove debris.
Live/Dead and SSC-A to isolate live cells.

The following is the resulting single live cells manually gated for the Oetjen_bcell data.

knitr::include_graphics(
    "figures/oetjen_bcell_single_live_cells.png", 
    error = FALSE
)

After completing the QC process, you will have clean data in either CSV or FCS file formats. The next section will guide you on how to load these files and proceed with preparing your data for SuperCellCyto.

Preparing FCS/CSV files for SuperCellCyto

To use SuperCellCyto, your input data must be formatted as a data.table object. Briefly, data.table is an enhanced version of R native data.frame object. It is a package that offers fast processing of large data.frame.

Cell ID column

Additionally, each cell in your data.table must also have a unique identifier. The purpose of this ID is to allow SuperCell to uniquely identify each cell in the dataset. It will come in super handy later when/if we need to work out which cells belong to which supercells, i.e., when we need to expand the supercells out. Generally, we will need to create this ID ourselves. Most dataset won’t come with this ID already embedded in.

For this tutorial, we will call the column that denotes the cell ID cell_id. For your own dataset, you can name this column however you like, e.g., id, cell_identity, etc. Just make sure you note the column name as we will need it later to create supercells.

Sample column

Lastly, each cell in the data.table object must also be associated with a sample. This information must be stored in a column that we later on pass to the function that creates supercells. Generally, sample here typically refers to the biological sample the cell came from.

To create supercells, it is necessary to have this column in our dataset. This is to ensure that each supercell will only have cells from exactly one sample. In most cases, it does not make sense to mix cells from different biological samples in one supercell. Additionally (not as important), SuperCellCyto can process multiple samples in parallel, and for it to do that, it needs to know the sample information.

But what if we only have 1 biological sample in our dataset? It does not matter. We still need to have the sample column in our dataset. The only difference is that this column will only have 1 unique value.

You can name the column however we like, e.g., Samp, Cell_Samp, etc. For this tutorial, we will call the column sample. Just make sure you note the column name as we will need it later to create supercells.

Preparing CSV files

Loading CSV files into a data.table object is straightforward. We can use the fread function from the data.table package.

Here’s how to install it:

install.packages("data.table")

For this example, let’s load two CSV files containing subsampled data from the Levine_32dim dataset we used in SuperCellCyto manuscript. Each file represents a sample (H1 and H2), with the sample name appended to the file name:

library(data.table)

csv_files <- c("data/Levine_32dim_H1_sub.csv", "data/Levine_32dim_H2_sub.csv")
samples <- c("H1", "H2")

dat <- lapply(seq_len(length(samples)), function(i) {
    csv_file <- csv_files[i]
    sample <- samples[i]

    dat_a_sample <- fread(csv_file)
    dat_a_sample$sample <- sample

    return(dat_a_sample)
})
dat <- rbindlist(dat)

dat[, cell_id := paste0("Cell_", seq_len(nrow(dat)))]

head(dat)
#>      Time Cell_length      DNA1     DNA2   CD45RA       CD133       CD19
#>     <num>       <int>     <num>    <num>    <num>       <num>      <num>
#> 1: 307428          27 169.91125 262.3192 2.338830 -0.15333985 -0.2056334
#> 2:  80712          13  50.91230 181.1320 2.129232  3.35638666 -0.1013980
#> 3: 111390          16 126.93545 269.4199 1.613719 -0.07193317  0.1116483
#> 4:  14088          31 142.67317 283.4645 4.100985  0.09366111 19.4974289
#> 5: 284190          53  98.28069 187.2090 4.289627  0.56254190 12.2265682
#> 6: 481997          35 112.29634 162.4416 6.089430  0.01665318 -0.1943735
#>           CD22       CD11b        CD4          CD8        CD34        Flt3
#>          <num>       <num>      <num>        <num>       <num>       <num>
#> 1: -0.19720075 32.13040161 0.78105438 -0.071934469  1.53498471  0.84833205
#> 2:  3.05647945 14.23928833 0.53373063 -0.007943562 -0.09401329 -0.13234507
#> 3: -0.07421356  2.20701027 9.75063324 -0.034266483  0.53720337 -0.05749827
#> 4:  5.07963181 -0.07880223 0.05995781  0.009049721 -0.19206744  3.36803102
#> 5: 10.81768703  1.82670891 1.30010796 -0.187664956  2.05419374  2.72891521
#> 6:  1.43817198  5.79350042 0.64789140 54.004249573  0.28843120  1.01514077
#>           CD20    CXCR4     CD235ab     CD45       CD123     CD321        CD14
#>          <num>    <num>       <num>    <num>       <num>     <num>       <num>
#> 1: -0.08676258 3.488938  0.82301176 313.8038  0.30909532 46.484669  0.05072345
#> 2: -0.04217101 1.364644 -0.13094166 207.2459  1.76594567 22.532978 -0.19256826
#> 3:  0.09777651 3.880993  2.00220966 750.4200 -0.06809702  9.515447 -0.05956535
#> 4:  0.64118648 2.911314 -0.08744399 169.4798  1.25776207  9.218699  1.09861076
#> 5: 15.34162998 9.303430  6.34135485 751.0563  0.05031190 10.463912  1.11993504
#> 6:  3.84020925 3.520693  2.93023992 868.4937 -0.04488884 19.107010  0.62120903
#>           CD33     CD47       CD11c          CD7      CD15        CD16
#>          <num>    <num>       <num>        <num>     <num>       <num>
#> 1:  2.09802437 20.96871 20.76318550 -0.007966662 0.7279212 -0.03067662
#> 2:  7.35230541 27.49848 15.13398170 -0.087256350 0.7187206  0.41139653
#> 3: -0.16046160 53.51268 -0.19080050  1.044164538 2.1075230 -0.14510959
#> 4:  0.18614264 55.07846 -0.07061907  0.948859751 1.2470639  1.12294865
#> 5:  0.15872155 40.63973  4.64010382 -0.195279136 4.5712810 -0.10192144
#> 6: -0.09832545 29.65497  6.15759659 12.104630470 0.5801706 -0.11606000
#>         CD44        CD38      CD13         CD3       CD61       CD117     CD49d
#>        <num>       <num>     <num>       <num>      <num>       <num>     <num>
#> 1:  95.71002   5.1124768 5.1056433   0.5827813 -0.1684093 -0.02967962  6.557199
#> 2: 185.51929   7.4784145 0.3580886   1.8861074  1.9233229 -0.14122920  1.088500
#> 3:  33.95839   0.6161237 0.3045178 462.1258240  0.7625037 -0.03500306  5.997476
#> 4:  32.46420 249.4612885 1.2526705   0.7302832  3.2274778 -0.18526185  8.533935
#> 5:  98.09428  43.5352974 2.8327518   0.1868679  2.1032026  0.01776284 12.400333
#> 6:  65.91293   2.0126576 1.2817017 390.3737793  2.4605207  0.33154550  5.214703
#>        HLA-DR       CD64         CD41  Viability file_number event_number
#>         <num>      <num>        <num>      <num>       <int>        <int>
#> 1: 112.467545  6.9157209  0.083808646  1.7268630          94       257088
#> 2:  12.206795 30.7242870  7.753727913  3.7120194          94        80655
#> 3:  -0.046793 -0.1739236 -0.080375805  0.7011412          94       116394
#> 4:   8.965122  0.3391838 -0.005531122  0.2978864          94         5618
#> 5: 174.952667  0.4361930  1.834125400 13.2743187          94       241699
#> 6:   0.648035 -0.1803290  0.389085352  0.4543665          94       363564
#>    sample cell_id
#>    <char>  <char>
#> 1:     H1  Cell_1
#> 2:     H1  Cell_2
#> 3:     H1  Cell_3
#> 4:     H1  Cell_4
#> 5:     H1  Cell_5
#> 6:     H1  Cell_6

Let’s break down what we have done.

We specify the location of the csv files in csv_files vector and their corresponding sample names in samples vector. data/Levine_32dim_H1_sub.csv belongs to sample H1 while data/Levine_32dim_H2_sub.csv belongs to sample H2.

We use lapply to simultaneously iterate over each element in the csv_files and samples vector. For each csv file and the corresponding sample, we read the csv file into the variable dat_a_sample using fread function. We then assign the sample id in a new column called sample. As a result, we get a list dat containing 2 data.table objects, 1 object per csv file.

We use rbindlist function from the data.table package to merge list into one data.table object.

We create a new column cell_id which gives each cell a unique id such as Cell_1, Cell_2, etc.

Preparing FCS files

FCS files, commonly used in cytometry, require specific handling. You can read in FCS files using the flowCore package available from Bioconductor and convert it to a data.table object.

Let’s load two small FCS files for the Anti-PD1 data from FlowRepository.

library(flowCore)
library(data.table)

fs <- read.flowSet(
    path = "data",
    pattern = "\\.fcs$"
)

dat_list <- lapply(seq_along(fs), function(i) {
    df <- as.data.table(exprs(fs[[i]]))

    # concatenate channel and marker name as column names
    names(df) <- markernames(fs[[i]])

    # add a column showing the filename
    df$file_name <- sampleNames(fs)[i]

    return(df)
})

# collate all the files into one
dat <- rbindlist(dat_list)

dat
#>      209Bi_CD11b 162Dy_CD11c  163Dy_CD7 166Er_CD209  167Er_CD38 151Eu_CD123
#>            <num>       <num>      <num>       <num>       <num>       <num>
#>   1:    2.596444    0.000000  24.233248   2.4271879   0.2497845  0.01359761
#>   2:   20.172476   29.508638 253.930161   2.0283859  63.4845619  0.00000000
#>   3:  224.771057  377.262421  20.755060   5.5264821 111.7728882 10.33144855
#>   4:  711.252808  192.961227  16.039482   2.7453365  13.4806919  1.71946943
#>   5:  435.131073  203.836838   5.983910   0.0000000   4.2271118  1.08193946
#>  ---                                                                       
#> 569:  337.737152  160.982407   1.724550   0.8164008  10.8179998  1.87148142
#> 570:   48.965187    3.454711  55.719162   0.7539494  20.6741104  0.00000000
#> 571:  326.666168   58.792435   4.358847   0.0000000   8.6697922  0.00000000
#> 572:    7.255682    0.000000   3.552515   0.0000000   2.9367595  0.43651727
#> 573:  349.350769   31.072611   1.050743   0.0000000   5.4545450  5.87748337
#>      153Eu_CD62L 152Gd_CD66b 154Gd_ICAM-1 155Gd_CD1c  156Gd_CD86 160Gd_CD14
#>            <num>       <num>        <num>      <num>       <num>      <num>
#>   1:  276.736237   0.0000000    6.3661394 0.72167069   0.0000000   0.000000
#>   2:    7.522890   0.0000000   16.6949806 0.00000000   0.9723848   2.370285
#>   3:    2.523962   0.0000000  480.9555969 2.87207913 109.6869812  41.803619
#>   4:    1.439798   0.0000000  125.3002014 0.00000000   5.1308088  19.226034
#>   5:    3.030660   0.0000000   54.1368217 0.03324713  23.3382854  14.204736
#>  ---                                                                       
#> 569:    0.000000   0.0000000  158.1088257 2.28826094  15.2002096   5.657256
#> 570:    3.414483   0.0000000    0.9569107 2.11766744   0.7849152   4.082072
#> 571:    0.000000   0.0000000    1.1182016 0.00000000   0.0000000   0.000000
#> 572:    2.556602   0.6154923    0.0000000 0.00000000   0.0000000   0.000000
#> 573:    3.925082   0.0000000   43.5083618 0.00000000   4.2216253   8.216695
#>       165Ho_CD16 191Ir_DNA1 193Ir_DNA2 175Lu_PD-L1 142Nd_CD19 146Nd_CD64
#>            <num>      <num>      <num>       <num>      <num>      <num>
#>   1:   0.0000000   40.33384   76.70573   0.0000000 0.00000000   5.452389
#>   2: 126.4256592   55.11267  150.46278   3.4527724 0.00000000   1.922374
#>   3:   2.0556698   91.42932  161.59381   8.3969393 0.00000000  12.956987
#>   4:   0.0000000   75.32122  160.41399   0.1078656 0.00000000  12.841355
#>   5:   0.5523136   79.88800  142.12471   0.6882498 0.00000000  19.308508
#>  ---                                                                    
#> 569:   0.0000000   65.31769  148.84435   0.1064020 0.00000000  22.941177
#> 570:  51.3577194   65.16866  129.46790   3.5496852 0.36609724   2.852356
#> 571:   0.9855582   92.97505  151.65977   0.0000000 0.03485066   6.290884
#> 572:   0.0000000   65.07745  111.99138   0.5479890 0.63119560   0.000000
#> 573:   0.0000000  114.67633  191.50166   0.0000000 0.00000000  12.638340
#>          195Pt     196Pt 198Pt_Dead 147Sm_CD303 148Sm_CD34 149Sm_CD141
#>          <num>     <num>      <num>       <num>      <num>       <num>
#>   1: 0.0000000 0.0000000  15.199163   0.0000000  1.2188276    2.482402
#>   2: 2.6318610 0.0000000  32.325874   0.0000000  0.2467884    0.000000
#>   3: 0.2854315 0.3602256  10.119688   0.7021694  7.2453775   17.055792
#>   4: 1.7483448 1.6977243   2.457203   0.0000000  1.1050161    6.811901
#>   5: 0.0000000 2.7839208  13.680239   0.8908378  0.0000000   18.400421
#>  ---                                                                  
#> 569: 0.0000000 0.0000000  16.751446   2.6801469  0.6307272    2.710867
#> 570: 0.3765990 0.0000000  20.695446   0.0000000  3.0734873    1.968072
#> 571: 0.1897834 0.0000000  16.076082   0.9444249  0.4878764    6.945232
#> 572: 0.0000000 0.0000000  23.279974   0.0000000  0.0000000    0.000000
#> 573: 0.0000000 4.0132923  12.463143   0.8065992  1.6200637    4.190233
#>      150Sm_CD61  169Tm_CD33  89Y_CD45  170Yb_CD3  173Yb_CD56 174Yb_HLA-DR
#>           <num>       <num>     <num>      <num>       <num>        <num>
#>   1:  65.015724   1.1326860  37.52760  9.0316811  0.00000000     6.807954
#>   2:  25.455738   0.1979908 225.64279  0.9135754  8.38446426     5.601559
#>   3:  75.056625 134.7255249 458.79022 12.2988071 10.31508160   403.328308
#>   4:  21.287003  58.7627487 387.78836  1.1870043  0.00000000    88.822212
#>   5:  26.795689  35.0521393 107.94828  0.3782265  5.60699320   104.159081
#>  ---                                                                     
#> 569:  36.163387  22.8519592 227.01973  2.7277973  0.02519703   188.378601
#> 570:  37.537052   3.3848262  48.12588  4.6816015  2.54012346    10.622149
#> 571:  16.739002  10.3474064  60.64515  0.6677586  0.00000000    15.465885
#> 572:  15.469506   0.0000000  11.47586  8.9600000  0.00000000     0.000000
#> 573:   8.587144  18.8689251  52.50787  0.6892268  1.88265443    13.542062
#>                                file_name
#>                                   <char>
#>   1: Data23_Panel3_base_NR4_Patient9.fcs
#>   2: Data23_Panel3_base_NR4_Patient9.fcs
#>   3: Data23_Panel3_base_NR4_Patient9.fcs
#>   4: Data23_Panel3_base_NR4_Patient9.fcs
#>   5: Data23_Panel3_base_NR4_Patient9.fcs
#>  ---                                    
#> 569: Data23_Panel3_base_R5_Patient15.fcs
#> 570: Data23_Panel3_base_R5_Patient15.fcs
#> 571: Data23_Panel3_base_R5_Patient15.fcs
#> 572: Data23_Panel3_base_R5_Patient15.fcs
#> 573: Data23_Panel3_base_R5_Patient15.fcs

The code above used flowCore’s read.flowSet function to first read FCS files into a flowSet object.

lapply and rbindlist is then used to convert it to one data.table object containing data from all FCS files.

The FCS files belong to two different patients, patient 9 and 15. We shall use that as the sample ID. To make sure that we correctly map the filenames to the patients, we will first create a new data.table object containing the mapping of FileName and the sample name, and then using merge.data.table to add them into our data.table object.

We will also to create a new column cell_id which gives each cell a unique id such as Cell_1, Cell_2, etc.

sample_info <- data.table(
    sample = c("patient9", "patient15"),
    file_name = c(
        "Data23_Panel3_base_NR4_Patient9.fcs",
        "Data23_Panel3_base_R5_Patient15.fcs"
    )
)
dat <- merge.data.table(
    x = dat,
    y = sample_info,
    by = "file_name"
)

dat[, cell_id := paste0("Cell_", seq_len(nrow(dat)))]

With CSV and FCS files loaded as data.table objects, the next step is to transform the data appropriately for SuperCellCyto.

Data Transformation

Before using SuperCellCyto, it’s essential to apply appropriate data transformations.

A common method for data transformation in cytometry is the arcsinh transformation, an inverse hyperbolic arcsinh transformation. The transformation requires specifying a cofactor, which affects the representation of the low-end data. Typically, a cofactor of 5 is used for Cytof data and 150 for Flow data. This vignette will focus on the transformation process rather than cofactor selection.

We’ll use the Levine_32dim dataset loaded earlier from CSV files.

First, we need to select the markers to be transformed. Usually, all markers should be transformed for SuperCellCyto. However, you can choose to exclude specific markers if needed:

markers <- c(
    "209Bi_CD11b", "162Dy_CD11c", "163Dy_CD7", "166Er_CD209", "167Er_CD38",
    "151Eu_CD123", "153Eu_CD62L", "152Gd_CD66b", "154Gd_ICAM-1", "155Gd_CD1c",
    "156Gd_CD86", "160Gd_CD14", "165Ho_CD16", "191Ir_DNA1", "193Ir_DNA2",
    "175Lu_PD-L1", "142Nd_CD19", "146Nd_CD64", "195Pt", "196Pt",
    "198Pt_Dead", "147Sm_CD303", "148Sm_CD34", "149Sm_CD141", "150Sm_CD61",
    "169Tm_CD33", "89Y_CD45", "170Yb_CD3", "173Yb_CD56", "174Yb_HLA-DR"
)

For transformation, we’ll use a cofactor of 5 and apply the arcsinh transformation.

new_cols <- paste0(markers, "_asinh")
cf <- 5
dat[, (new_cols) := lapply(.SD, function(x) asinh(x / cf)), .SDcols = markers]

After transformation, new columns with “_asinh” appended indicate the transformed markers.

With your data now transformed, you’re ready to create supercells using SuperCellCyto. Please refer to How to create supercells vignette for detailed instructions.

Session information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] flowCore_2.20.0   data.table_1.17.8 BiocStyle_2.36.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.5           knitr_1.50          rlang_1.1.6        
#>  [4] xfun_0.52           png_0.1-8           generics_0.1.4     
#>  [7] textshaping_1.0.1   jsonlite_2.0.0      RProtoBufLib_2.20.0
#> [10] S4Vectors_0.46.0    htmltools_0.5.8.1   stats4_4.5.1       
#> [13] ragg_1.4.0          sass_0.4.10         Biobase_2.68.0     
#> [16] rmarkdown_2.29      evaluate_1.0.4      jquerylib_0.1.4    
#> [19] fastmap_1.2.0       yaml_2.3.10         lifecycle_1.0.4    
#> [22] bookdown_0.43       BiocManager_1.30.26 compiler_4.5.1     
#> [25] fs_1.6.6            htmlwidgets_1.6.4   systemfonts_1.2.3  
#> [28] digest_0.6.37       R6_2.6.1            cytolib_2.20.0     
#> [31] bslib_0.9.0         tools_4.5.1         matrixStats_1.5.0  
#> [34] BiocGenerics_0.54.0 pkgdown_2.1.3       cachem_1.1.0       
#> [37] desc_1.4.3

Givanna Putri