Run SeekSoulTools

Run tests

Example 1: T cell receptor analysis example

seeksoultools vdj run \
--fq1 /path/to/demo_tcr/demo_tcr_R1.fq.gz \
--fq2 /path/to/demo_tcr/demo_tcr_R2.fq.gz \
--samplename demo_tcr \
--chain TR \
--core 16  \
--outdir /path/to/ouput/demo_tcr \
--organism human

Example 2: B cell receptor analysis example

seeksoultools vdj run \
--fq1 /path/to/demo_bcr/demo_bcr_R1.fq.gz \
--fq2 /path/to/demo_bcr/demo_bcr_R2.fq.gz \
--samplename demo_bcr \
--chain IG \
--core 16  \
--outdir /path/to/ouput/demo_bcr \
--organism human

Parameter descriptions

Parameters	Descriptions
–fq1	Paths to R1 fastq files.
–fq2	Paths to R2 fastq files.
–samplename	Sample name. Only digits, letters, and underscores are supported
–organism	organism, Available options: human，mouse
–chain	chain type, Available options: IG and TR. IG for B clel receptor and TR for T cell receptor.
–core	Number of threads used for the analysis
–outdir	Output directory. Absolute path, and ensure that the outdir for each task is unique.
–cfg	When the species is not human or mouse, the ref file needs to be configured by itself, and the file that records three REFs is the value of the parameter.
–read_pair	Whether or not to use paired reads to do the assembly, defaults to False.
–keep_tmp	Whether to keep the intermediate files of trust4, the default is not to keep them, when this parameter is specified, these intermediate files are compressed.
–matrix	rna matrix data. If this parameter is specified, the intersection of the barcode of vdj and matrix is taken before find clone, and the path to the filter matrix of the rna data is specified if required
–rna_wd	When analyzing RNA data, if this parameter is specified along with the –matrix parameter, a combined websummary for RNA and VDJ will be generated. When specifying this parameter, please ensure that the sample names for RNA and VDJ are consistent.

Example of building a ref file

Approach One: Utilize the trust4 software to retrieve data from the IMGT database and construct the required reference sequences.

BuildImgtAnnot.pl Homo_sapien > IMGT+C.fa
grep ">" IMGT+C.fa | cut -f2 -d'>' | cut -f1 -d'*' | sort | uniq > bcr_tcr_gene_name.txt
BuildDatabaseFa.pl refdata-GRCh38/fasta/genome.fa refdata-GRCh38/genes/genes.gtf bcr_tcr_gene_name.txt > bcrtcr.fa

Approach Two: If the gene names in the genes.gtf file do not encompass the required BCR/TCR gene names, configure them independently.

### Retrieve all species sequences from the IMGT database
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

### Extract the required sequences based on the species information in the ID and format them according to the requirements of TRUST4.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
with open("IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP", "r") as file, open('IMGT+C.fa', 'w') as out:
    for record in SeqIO.parse(file, "fasta"):
        #if "IG" not in record.description and "TR" not in record.description:
        if 'Bos taurus_Holstein' in record.description:
            new_id = record.description.strip().split('|')[1]
            new_seq = record.seq.upper()
            new_record = SeqRecord(Seq(new_seq), id=new_id, description="")
            SeqIO.write(new_record, out, "fasta")

Prepare the leader file.

Download “”L-PART1+L-PART2” from IMGT.

Separate downloads for IG and TR are required, with IG and TR sequences being saved as two different files
Download IGH, IGK, and IGL separately, and then merge them into a single file.
Download TRA and TRB separately, and then merge them into a single file.

cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}'

Write into the cfg file.

The format of the cfg file is as follows:
fa:"/your/path/bcrtcr.fa"
ref:"/your/path/IMGT+C.fa"
leader:"/your/path/IG_L-PART1+L-PART2.fa"

If the reference (ref) information is configured through Approach Two, both the fa and ref can be specified as IMGT+C.fa.