运行

运行示例

示例1：T细胞受体分析示例

seeksoultools vdj run \
--fq1 /path/to/demo_tcr/demo_tcr_R1.fq.gz \
--fq2 /path/to/demo_tcr/demo_tcr_R2.fq.gz \
--samplename demo_tcr \
--chain TR \
--core 16  \
--outdir /path/to/ouput/demo_tcr \
--organism human

示例2：B细胞受体分析示例

seeksoultools vdj run \
--fq1 /path/to/demo_bcr/demo_bcr_R1.fq.gz \
--fq2 /path/to/demo_bcr/demo_bcr_R2.fq.gz \
--samplename demo_bcr \
--chain IG \
--core 16  \
--outdir /path/to/ouput/demo_bcr \
--organism human

软件参数说明

参数	参数说明
–fq1	R1 fastq数据路径。
–fq2	R2 fastq数据路径。
–samplename	样本名称。仅支持数字，字母和下划线。
–organism	物种，可选值：human，mouse
–chain	链类型，可选值：IG，TR；IG对应B细胞受体；TR对应T细胞受体。
–core	分析使用的线程数。
–outdir	结果输出目录，绝对路径，并且保证每个任务的outdir唯一。
–cfg	当物种非人非鼠时，需要自行配置ref文件，记录三个ref的文件为该参数的value。
–read_pair	是否需要用paired reads去做组装，默认为False。
–keep_tmp	是否保留trust4的中间文件，默认不保存，当指定该参数时，对这些中间文件进行压缩。
–matrix	rna matrix数据。如果指定该参数，则在find clone之前对vdj和matrix的barcode取交集，如果需要，请指定rna数据的filter matrix路径。
–rna_wd	rna数据的分析路径，如果同时指定该参数及–matrix参数，则会生成rna和vdj联合的websumarry。指定该参数时请注意rna和vdj样本名字统一。

构建ref文件示例

方式一：可利用trust4软件，获取IMGT数据库并构建所需ref。

BuildImgtAnnot.pl Homo_sapien > IMGT+C.fa
grep ">" IMGT+C.fa | cut -f2 -d'>' | cut -f1 -d'*' | sort | uniq > bcr_tcr_gene_name.txt
BuildDatabaseFa.pl refdata-GRCh38/fasta/genome.fa refdata-GRCh38/genes/genes.gtf bcr_tcr_gene_name.txt > bcrtcr.fa

方式二：如果genes.gtf文件中基因名称不包含所需的BCR/TCR 基因名称的话，可单独配置：

### 下载IMGT数据库所有物种序列
wget -c https://www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP

### 根据id里面的物种信息提取所需序列，并整理成trust4所需格式
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
with open("IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP", "r") as file, open('IMGT+C.fa', 'w') as out:
    for record in SeqIO.parse(file, "fasta"):
        #if "IG" not in record.description and "TR" not in record.description:
        if 'Bos taurus_Holstein' in record.description:
            new_id = record.description.strip().split('|')[1]
            new_seq = record.seq.upper()
            new_record = SeqRecord(Seq(new_seq), id=new_id, description="")
            SeqIO.write(new_record, out, "fasta")

准备leader文件

在IGMT 下载 "L-PART1+L-PART2"。

IG 和TR 的分别下载,TR IG 分为2个文件
IGH IGK IGL 分别下载，合并成一个文件
TRA TRB 分别下载，合并成一个文件将碱基转成大写：

cat IG_L-PART1+L-PART2.fa |awk '/^>/ {print; next} {printf "%s", toupper($0)}' | awk 'BEGIN{RS=">"; FS="\n"} NR>1 {print ">" $1 "\n" $2}'

写入cfg文件

cfg文件格式如下：
fa:"/your/path/bcrtcr.fa"
ref:"/your/path/IMGT+C.fa"
leader:"/your/path/IG_L-PART1+L-PART2.fa"

如果是通过方式二配置的ref信息，可以fa和ref都指定IMGT+C.fa