Run SeekSoulTools
Run tests
Example 1: T cell receptor analysis example
seeksoultools vdj run \
--fq1 /path/to/demo_tcr/demo_tcr_R1.fq.gz \
--fq2 /path/to/demo_tcr/demo_tcr_R2.fq.gz \
--samplename demo_tcr \
--chain TR \
--core 16 \
--outdir /path/to/ouput/demo_tcr \
--organism human
Example 2: B cell receptor analysis example
seeksoultools vdj run \
--fq1 /path/to/demo_bcr/demo_bcr_R1.fq.gz \
--fq2 /path/to/demo_bcr/demo_bcr_R2.fq.gz \
--samplename demo_bcr \
--chain IG \
--core 16 \
--outdir /path/to/ouput/demo_bcr \
--organism human
Parameter descriptions
Parameters |
Descriptions |
---|---|
–fq1 |
Paths to R1 fastq files. |
–fq2 |
Paths to R2 fastq files. |
–samplename |
Sample name. Only digits, letters, and underscores are supported |
–organism |
organism, Available options: human,mouse |
–chain |
chain type, Available options: IG and TR. IG for B clel receptor and TR for T cell receptor. |
–core |
Number of threads used for the analysis |
–outdir |
Output directory. Absolute path, and ensure that the outdir for each task is unique. |
–cfg |
When the species is not human or mouse, the ref file needs to be configured by itself, and the file that records three REFs is the value of the parameter. |
–read_pair |
Whether or not to use paired reads to do the assembly, defaults to False. |
–keep_tmp |
Whether to keep the intermediate files of trust4, the default is not to keep them, when this parameter is specified, these intermediate files are compressed. |
–matrix |
rna matrix data. If this parameter is specified, the intersection of the barcode of vdj and matrix is taken before find clone, and the path to the filter matrix of the rna data is specified if required |
–rna_wd |
When analyzing RNA data, if this parameter is specified along with the –matrix parameter, a combined websummary for RNA and VDJ will be generated. When specifying this parameter, please ensure that the sample names for RNA and VDJ are consistent. |
Example of building a ref file
Approach One: Utilize the trust4 software to retrieve data from the IMGT database and construct the required reference sequences.
BuildImgtAnnot.pl Homo_sapien > IMGT+C.fa
grep ">" IMGT+C.fa | cut -f2 -d'>' | cut -f1 -d'*' | sort | uniq > bcr_tcr_gene_name.txt
BuildDatabaseFa.pl refdata-GRCh38/fasta/genome.fa refdata-GRCh38/genes/genes.gtf bcr_tcr_gene_name.txt > bcrtcr.fa
Approach Two: If the gene names in the genes.gtf file do not encompass the required BCR/TCR gene names, configure them independently.
### Retrieve all species sequences from the IMGT database
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP
### Extract the required sequences based on the species information in the ID and format them according to the requirements of TRUST4.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
with open("IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP", "r") as file, open('IMGT+C.fa', 'w') as out:
for record in SeqIO.parse(file, "fasta"):
#if "IG" not in record.description and "TR" not in record.description:
if 'Bos taurus_Holstein' in record.description:
new_id = record.description.strip().split('|')[1]
new_seq = record.seq.upper()
new_record = SeqRecord(Seq(new_seq), id=new_id, description="")
SeqIO.write(new_record, out, "fasta")
Prepare the leader file.
Download “”L-PART1+L-PART2” from IMGT.
Separate downloads for IG and TR are required, with IG and TR sequences being saved as two different files
Download IGH, IGK, and IGL separately, and then merge them into a single file.
Download TRA and TRB separately, and then merge them into a single file.
cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}'
Write into the cfg file.
The format of the cfg file is as follows:
fa:"/your/path/bcrtcr.fa"
ref:"/your/path/IMGT+C.fa"
leader:"/your/path/IG_L-PART1+L-PART2.fa"
If the reference (ref) information is configured through Approach Two, both the fa and ref can be specified as IMGT+C.fa.