How to build reference genome?

Preparing the files required to build a reference genome

When using Seeksoultools rna or fast module, the species’ reference genome sequence and the corresponding gtf annotation file are required. The relevant file formats are as follows

Genome sequence file

The genome sequence files requires a fa format file. The chromosome id must be consistent with the seqname of the gtf file, and the gtf seqname must to be a subset of the chromosome id of the fa file. Note that the file cannot have blank lines.

gtf file

The gtf file format is described as follows

  1. seqname: sequence name, usually chromosome or contig ID

  2. source: the source of the annotation, which can be the name of a database, such as from the RefSeq database, or the name of a software, such as predicted by GeneScan software. Of course, it can also be empty and filled with a dot.

  3. Feature: represents the feature type corresponding to the interval. In GTF, common types include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.

  4. start: the starting position of the feature

  5. end: the end position of the feature

  6. score: indicates the confidence of feature existence and coordinates, which can be a floating point number or an integer. “.” means it is empty, which means it is not needed.

  7. Strand: The feature is located on the positive strand (+) or negative strand (-) of the reference genome

  8. frame: 0 means the first complete codon of the reading frame is located at the 5’ end, 1 means there is an extra base before the first complete codon, and 2 means there are two extra bases before the first complete codon. Note that frame is not the remainder of the CDS length divided by 3. If the strand is ‘-’, the first base value of the region is ‘end’, because the corresponding coding region will be from end to start on the antisense strand.

  9. attribute: The format should be attributes_name ‘attributes_values’; each attribute must end with a semicolon and be separated from the next attribute by a space, and the attribute value is surrounded by double quotes. Generally, the following three attributes are required

attribute

means

gene_id “value”;

The unique ID of the locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it means there is no corresponding gene.

transcript_id “value”;

The unique ID of the predicted transcript. transcript_id and value are separated by a space, and an empty value indicates no transcript.

gene_type “value”;

Biological type of gene, protein coding, lncRNA, etc.

Notes on gtf file preparation:

  • The feature column of each gene in the gtf file must contain three pieces of information: gene, transcript, and exon;

  • 当feature列为’gene’时,attributes列需要包含gene_id,gene_type,如果没有gene_name,则将gene_id的值当作genename;当feature列为’ transcript’时,attributes列需要包含transcript_id;当feature列为’exon’时,attributes列需要包含exon_id,否则会影响read注释到多个基因时的处理。

  • GTF files should also not have blank lines.

  • gtf文件中线粒体基因的gene_name需要以“Mt-”或者“mt-”开头,否则在报告中mito部分均为0.

Scenario 1: Building a reference genome that is compatible with single-cell data from different platforms.

If you have both single-cell data from 10X Genomics and SeekOne® products, it is recommended to use 10X CellRanger to build the reference genome. SeekSoulTools is compatible with the reference genome built by CellRanger. The code for processing gene annotation files (GTF files) is as follows:

The code for processing gene annotation files (GTF files) is as follows:

/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

Note

  • If the reference genome built by CellRanger is not compatible with the STAR version of SeekSoulTools, you can specify the STAR path of CellRanger for SeekSoulTools with --star_path /path/to/cellranger-5.0.1/lib/bin/STAR.

  • The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is chr1, then the name of chromosome 1 in the gtf file must also be chr1.


Scenario 2: if you only have SeekOne® products, there is no need to consider platform compatibility.

The code for building genome index using STAR is as follows:

/demo/seeksoultools.1.2.0/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \                        
  --genomeDir /path/to/star \             
  --genomeFastaFiles /path/to/genome.fa \  
  --sjdbGTFfile /path/to/genome.gtf \      
  --sjdbOverhang 149 \                     
  --limitGenomeGenerateRAM 17179869184     

Note

  • The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is chr1, then the name of chromosome 1 in the gtf file must also be chr1.