NextClone

Nextflow pipeline for extracting and counting clonal barcodes

View the Project on GitHub phipsonlab/NextClone

NextClone

NextClone is a Nextflow pipeline to facilitate rapid extraction and quantification of clonal barcodes from both DNA-seq and scRNAseq data. DNA-seq data refers to dedicated DNA barcoding data which exclusively sequences the synthetic lineage tracing clone barcode reads using Next Generation Sequencing.

The pipeline comprises two distinct workflows, one for DNA-seq data and the other for scRNAseq data. Both workflows are highly modular and adaptable, with software that can easily be substituted as required, and with parameters that can be tailored through the nextflow.config` file to suit diverse needs. It is heavily optimised for usage in high-performance computing (HPC) platforms.

Usage

In general, to run the pipeline, you can use the following command:

nextflow run phipsonlab/Nextclone -r main

However, chances are you will need to adjust some of the default parameters in nextflow.config file. Read on to find out what parameters there are and what do they mean.

Additionally, see the following more specific insturcitons for extracting clone barcodes from various barcoding libraries:

  1. For ClonMapper library
  2. For splintr library

Output

For DNAseq data, you will get a clone_barcode_counts.csv file in the folder you nominated as the publish_dir parameter. For this csv file, each row corresponds to a unique clone barcode and its frequency in the sample.

For scRNAseq data, you will get a clone_barcodes.csv file in the folder you nominated as the publish_dir parameter. For this csv file, each row corresponds to a read associated with a 10x cell barcode, a UMI, and a clone barcode.

Clone barcode reference file

This is a text file where each row represents the sequence of a clone barcode.

It should look something like this:

CGGAGTAATACATTTTGCCT
TCGGAGTTGGCTGTCGTTTC
GTTGTCTCGGGGGGTGGAGA
CCATGATAAGGGAGTTCCGG
AGGGGAGTCGCGTGGTAGGC
TGTCTAATGGGGGTGTCACT

Generally, you will need to get this reference file from the lab that generates the lineage tracing barcode library for you.

Parallelisation within sample

The Python package within NextClone divides reads into several FASTA files, which can be increased from the default two as needed through the n_chunks parameter.

Each FASTA file then undergoes a mapping process against a reference list of clonal barcode sequences using Flexiplex running on multiple threads. Each mapping task is submitted as an individual job to the HPC scheduler, allowing all the tasks to be processed simultaneously as resources permit.

Do not specify too large number for n_chunks parameter as you risk overloading the HPC system. Generally a number around 20 is reasonable.

It is also important to bear in mind that if you have multiple samples, by default, the n_chunks is applicable to every sample. For instance, if you have 10 samples and specify n_chunks as 20, you will end up with 10*20 = 200 mapping jobs.

Parameters

These are all the default parameters used by NextClone. They are all in the nextflow.config file.

You can change the value of any of these to suit your needs.

publish_dir_mode = 'copy'

mode = "DNAseq"

// generic
publish_dir = "${projectDir}/output"
clone_barcodes_reference = "${projectDir}/data/known_barcodes_subset.txt"
barcode_edit_distance = 2
n_chunks = 2
barcode_length = 20
// mapping may need long time, so use either long_mapping or regular_mapping
mapping_process_profile = "regular_mapping"


// for DNA-seq data
dnaseq_fastq_files = "${projectDir}/data/dnaseq_fastq_files"
fastp_percent_bases_unqualified = 20
fastp_phred_for_qualified_reads = 30

// for clonmapper single cell data
// change me if required
scrnaseq_bam_files = "${projectDir}/data/scrnaseq_bam_files"
phred_thres = 30
adapter_edit_distance = 6
adapter_5prime_clonmapper = "ATCTTGTGGAAAGGACGAAACACCG"
adapter_3prime_clonmapper = "GTTTCAGAGCTATGCTGGAAACAGC"

Explanation:

Setting up conda environment

NextClone relies on the following Python packages installed in a conda environment to work:

If you are running NextClone for the very first time, it will attempt to create a conda environment and install the packages listed above. For this to happen, you will need to make sure that conda is installed in the system.

If you are using slurm, you can load the conda module up by using module load miniconda3/latest. If this does not work, best contact the team that manages the HPC system to help you out.

In the future, this will be replaced with a singularity container.