Assign clones to cells — assign_and_embed

This function assigns clone barcodes to cells and optionally embeds the assignment into a SingleCellExperiment object. It calculates a cell-by-clone matrix from a provided data.table where each row corresponds to a read associated with a cell, UMI, and clone barcode. See details for the assignment method.

assign_and_embed_clones(
  cell_by_gene_mat,
  cell_clone_reads_dt,
  cell_bcode_col = "CellBarcode",
  barcode_edit_dist_col = "BarcodeEditDist",
  clone_bcode_col = "CloneBarcode",
  umi_col = "UMI",
  most_dominant_threshold = 0.5,
  umi_clone_consensus_threshold = 0.7,
  embed_to_mat = TRUE
)

Arguments

cell_by_gene_mat: A SingleCellExperiment object containing the cell-by-gene matrix. It must have a 'Barcode' column in 'colData' that uniquely identifies each cell. The barcodes should match those in the `cell_clone_reads_dt` data table.
cell_clone_reads_dt: A data.table object representing the reads. Each row includes information about a cell, UMI, and clone barcode. For data produced by NextClone, use `fread` from the data.table package and pass the resulting object to this parameter.
cell_bcode_col: Name of the column in `cell_clone_reads_dt` that indicates the cell barcode for each read.
barcode_edit_dist_col: Name of the column in `cell_clone_reads_dt` that indicates the edit distance for the clone barcode of each read.
clone_bcode_col: Name of the column in `cell_clone_reads_dt` that specifies the clone barcode for each read.
umi_col: Name of the column in `cell_clone_reads_dt` that specifies the UMI barcode for each read.
most_dominant_threshold: A numeric value between 0 and 1 specifying the proportion threshold of reads that must be associated with a clone barcode to assign it to a cell in cases where multiple clone barcodes are detected.
umi_clone_consensus_threshold: A numeric value between 0 and 1 specifying the proportion threshold of reads for collapsing UMIs when computing the cell-by-clone matrix. See details for more information.
embed_to_mat: A boolean indicating whether to embed the clone barcode assignment directly into the `cell_by_gene_mat` object.

Value

Depending on the `embed_to_mat` parameter, this function returns either an updated SingleCellExperiment object with clone barcode assignments embedded or a data.table with the cell-clone assignments.

Details

Clone barcode assignment to cells follows a tiered approach: Cells with a single detected clone barcode are straightforwardly assigned to that clone. In cases where multiple clone barcodes are present, the most dominant clone barcode (constituting over 50% of reads, adjustable via `most_dominant_threshold`) is assigned. Remaining cells, where no clone barcode is sufficiently dominant, are assigned based on the lowest average barcode edit distance. If edit distances are equal, the clone barcode with the higher read count prevails. To use the default 50% threshold, set `most_dominant_threshold` to 0.5.

The cell-by-clone matrix construction first collapses reads with the same cell and UMI barcodes For a group of reads have the same cell barcode and UMI barcode, if the reads are mapped to several clone barcodes, by default, they are collapsed into one read and assigned to the clone barcode comprising 70% or more of its group's reads. This threshold modifiable by the `umi_clone_consensus_threshold` parameter. To apply the default threshold of 70%, set this parameter to 0.7.

Examples

library(scater)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#> 
#>     findMatches
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#> 
#>     rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#> 
#>     anyMissing, rowMedians
#> Loading required package: scuttle
#> Loading required package: ggplot2
library(data.table)
#> 
#> Attaching package: ‘data.table’
#> The following object is masked from ‘package:SummarizedExperiment’:
#> 
#>     shift
#> The following object is masked from ‘package:GenomicRanges’:
#> 
#>     shift
#> The following object is masked from ‘package:IRanges’:
#> 
#>     shift
#> The following objects are masked from ‘package:S4Vectors’:
#> 
#>     first, second

set.seed(42)

sce <- mockSCE(ncells = 4, ngenes = 100)
colData(sce)$Barcode <- colnames(sce)

cell_clone_reads_dt <- data.table(
    CellBarcode = c(
        rep("Cell_001", 6),
        rep("Cell_002", 7),
        rep("Cell_003", 4),
        rep("Cell_004", 5)
    ),
    CloneBarcode = c(
        rep("AA", 6),
        rep("BB1", 4), rep("BB2", 3),
        rep("CC1", 2), rep("CC2", 2),
        rep("CC1", 2), "DD1", "DD2", "DD3"
    ),
    BarcodeEditDist = c(
        rep(0L, 6),
        rep(0L, 7),
        c(0, 1, 2, 1),
        rep(0, 5)

    )
)
cell_clone_reads_dt[, UMI := seq_len(nrow(cell_clone_reads_dt))]
#>     CellBarcode CloneBarcode BarcodeEditDist UMI
#>  1:    Cell_001           AA               0   1
#>  2:    Cell_001           AA               0   2
#>  3:    Cell_001           AA               0   3
#>  4:    Cell_001           AA               0   4
#>  5:    Cell_001           AA               0   5
#>  6:    Cell_001           AA               0   6
#>  7:    Cell_002          BB1               0   7
#>  8:    Cell_002          BB1               0   8
#>  9:    Cell_002          BB1               0   9
#> 10:    Cell_002          BB1               0  10
#> 11:    Cell_002          BB2               0  11
#> 12:    Cell_002          BB2               0  12
#> 13:    Cell_002          BB2               0  13
#> 14:    Cell_003          CC1               0  14
#> 15:    Cell_003          CC1               1  15
#> 16:    Cell_003          CC2               2  16
#> 17:    Cell_003          CC2               1  17
#> 18:    Cell_004          CC1               0  18
#> 19:    Cell_004          CC1               0  19
#> 20:    Cell_004          DD1               0  20
#> 21:    Cell_004          DD2               0  21
#> 22:    Cell_004          DD3               0  22
#>     CellBarcode CloneBarcode BarcodeEditDist UMI

sce_with_clone <- assign_and_embed_clones(
    cell_by_gene_mat = sce,
    cell_clone_reads_dt = cell_clone_reads_dt,
    cell_bcode_col = "CellBarcode",
    barcode_edit_dist_col = "BarcodeEditDist",
    clone_bcode_col = "CloneBarcode",
    umi_col = "UMI"
)
colData(sce_with_clone)
#> DataFrame with 4 rows and 6 columns
#>          Mutation_Status  Cell_Cycle   Treatment     Barcode clone_barcode
#>              <character> <character> <character> <character>   <character>
#> Cell_001        positive          G1      treat1    Cell_001            AA
#> Cell_002        positive          G0      treat2    Cell_002           BB1
#> Cell_003        negative          G0      treat1    Cell_003           CC1
#> Cell_004        positive         G2M      treat2    Cell_004           CC1
#>               clone_barcode_criteria
#>                             <factor>
#> Cell_001 single_clone               
#> Cell_002 dominant_clone_moreThan_0_5
#> Cell_003 clone_from_edit_distance   
#> Cell_004 clone_from_edit_distance