Algorithms Overview
barcode/UMI extraction
SeekSoulTools is able to extract the barcode and UMI sequences based on different Read1 structures. The pipeline processes the barcodes and filters Read1 and its corresponding Read2, and an updated fastq file is generated at the end of step 1.
Structure design and description
Barcode and UMI descriptions: Describing the basic structure of Read1 using letters and numbers, where the letters represent the meaning of the nucleotides and the numbers represent the length of the nucleotides.
B: barcode bases
L: linker bases
U: UMI bases
X: other arbitrary bases used as placeholders
The Read1 structure of the SeekOne™ DD Single Cell Immune Profiling Kit is B17U12:
Workflow
Barcode and linker correction
After determining the starting position of the barcode, the corresponding sequence is extracted based on the structural design. When the extracted barcode sequence is found in the whitelist, it is considered a valid barcode and the read count with valid barcodes is recorded. When the barcode is not found in the whitelist, it is considered an invalid barcode
During sequencing process, there is a certain probability that sequencing error occurs. With the presense of a whitelist, SeekSoulTools are able to do barcode correction. When correction is enabled, if sequences that are one base differ (one humming distance apart) from the invalid barcode appear in the whitelist, we will consider the following circumstances:
If there is one and only sequence exists in the whitelist, we will correct the invalid barcode to the barcode in the whitelist.
If multiple sequences exist in the whitelist, we will correct the invalid barcode to the sequence with the highest read support.
After the processing procedure described above, the metrics is shown below:
total: Total reads
valid: Number of reads without correction and number of successfully corrected reads
Q30 Bases in Barcode: The q30 ratio of the barcode sequence
Q30 Bases in UMI: The q30 ratio of the UMI sequence
Q30 Bases in R2 Read: The q30 ratio of reads 2
Statistics on the proportion of VDJ gene
Construct STAR index based on ref and compare reads, count the number of TRA, TRB, TRD or IGK, IGL, IGH and the corresponding ratio based on bam files
Sequence Assembly
1、Use the fastq-extractor module of trust4 to extract the reads that may contain TCR/BCR sequences, and then do downsample for each barcode, and only take the first 80,000 reads for cells with more than 80,000 reads.
2、According to the barcode, filter out the umi that support reads less than n50, and then use umitools to correct the umi.
3、Based on the above reads were assembled with trust4.
Please refer to trust4 for detailed description.
VDJ annotation
The annotator of trust4 was utilized for the annotation of VDJC genes and CDR3 positions.For the case of multiple annotations for per chain in per cell, the contig with the highest umi value was selected as the primary_chain.The UMI count is calculated based on the reads information of each contig, with a filtering step performed before counting: when the same UMI is used for multiple contig assemblies, these contigs are sorted by reads in descending order. If the ratio of reads between the second highest contig and the highest contig is less than 0.15, that UMI is only counted for the highest contig, and no UMI counting is performed for the others.
Calculation Clonotype
1. Select contigs that meet chain requirements; filter out contigs with 1-UMI/reads ratio less than 0.94; remove cells where the sum of UMIs supporting the contigs is less than 3.
2. If 5' matrix information is provided, take the intersection of cell barcodes.
3. Utilize dandelion to determine clones based on CDR3 amino acid sequence similarity, with similarity thresholds set at 85% for BCR and 100% for TCR. For detailed criteria, refer to::sc-dandelion
4. For single-chain clonotypes, identify all other contigs sharing the same VJ genes and CDR3 amino acid sequence. Calculate a threshold as 1% of the n50 UMI count of these contigs. Discard the single-chain if its UMI count falls below this threshold.
5. For the retained single-chain clonotypes, identify all dual-chain clonotypes sharing the same VJ genes and CDR3 amino acid sequence. Sort by cell count. If the ratio of cell counts between the second largest and the largest dual-chain clonotype is less than 0.15, merge the single-chain clonotype into the largest dual-chain clonotype; otherwise, maintain them separately.
6. Sort by new clonotypes based on cell counts, generate new clone IDs, and provide exact IDs.