How to build reference genome?

Preparing the files required to build a reference genome

When using Seeksoultools rna or fast module, the species' reference genome sequence and the corresponding gtf annotation file are required. The relevant file formats are as follows

Genome sequence file

The genome sequence files requires a fa format file. The chromosome id must be consistent with the seqname of the gtf file, and the gtf seqname must to be a subset of the chromosome id of the fa file. Note that the file cannot have blank lines.

gtf file

The gtf file format is described as follows

seqname: sequence name, usually chromosome or contig ID
source: the source of the annotation, which can be the name of a database, such as from the RefSeq database, or the name of a software, such as predicted by GeneScan software. Of course, it can also be empty and filled with a dot.
Feature: represents the feature type corresponding to the interval. In GTF, common types include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.
start: the starting position of the feature
end: the end position of the feature
score: indicates the confidence of feature existence and coordinates, which can be a floating point number or an integer. "." means it is empty, which means it is not needed.
Strand: The feature is located on the positive strand (+) or negative strand (-) of the reference genome
frame: 0 means the first complete codon of the reading frame is located at the 5' end, 1 means there is an extra base before the first complete codon, and 2 means there are two extra bases before the first complete codon. Note that frame is not the remainder of the CDS length divided by 3. If the strand is '-', the first base value of the region is 'end', because the corresponding coding region will be from end to start on the antisense strand.
attribute: The format should be attributes_name 'attributes_values'; each attribute must end with a semicolon and be separated from the next attribute by a space, and the attribute value is surrounded by double quotes. Generally, the following three attributes are required

attribute	means
gene_id "value";	The unique ID of the locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it means there is no corresponding gene.
transcript_id "value";	The unique ID of the predicted transcript. transcript_id and value are separated by a space, and an empty value indicates no transcript.
gene_type "value";	Biological type of gene, protein coding, lncRNA, etc.

Notes on gtf file preparation:

The feature column of each gene in the gtf file must contain three pieces of information: gene, transcript, and exon;
When the feature column is 'gene', the attributes column must contain gene_id and gene_type. If there is no gene_name, the value of gene_id will be used as the gene name. When the feature column is 'transcript', the attributes column must contain transcript_id. When the feature column is 'exon', the attributes column must contain exon_id, otherwise it will affect the processing when reads are annotated to multiple genes.
GTF files should also not have blank lines.
The gene_name of mitochondrial genes in the gtf file must start with "Mt-" or "mt-", otherwise the mito section in the report will be 0.

Scenario 1: Building a reference genome that is compatible with single-cell data from different platforms.

If you have both single-cell data from 10X Genomics and SeekOne™ products, it is recommended to use 10X CellRanger to build the reference genome. SeekSoulTools is compatible with the reference genome built by CellRanger. The code for processing gene annotation files (GTF files) is as follows:

The code for processing gene annotation files (GTF files) is as follows:

/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

Note

If the reference genome built by CellRanger is not compatible with the STAR version of SeekSoulTools, you can specify the STAR path of CellRanger for SeekSoulTools with --star_path /path/to/cellranger-5.0.1/lib/bin/STAR.
The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is chr1, then the name of chromosome 1 in the gtf file must also be chr1.

Scenario 2: if you only have SeekOne™ products, there is no need to consider platform compatibility.

The code for building genome index using STAR is as follows:

/demo/seeksoultools.1.2.0/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \                        
  --genomeDir /path/to/star \             
  --genomeFastaFiles /path/to/genome.fa \  
  --sjdbGTFfile /path/to/genome.gtf \      
  --sjdbOverhang 149 \                     
  --limitGenomeGenerateRAM 17179869184     

Note

The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is chr1, then the name of chromosome 1 in the gtf file must also be chr1.