How to Build Reference Genome?
Preparing Files Required for Genome Construction
When using Seeksoultools software for transcriptome and whole sequence analysis, it is necessary to prepare the reference genome sequence and corresponding GTF annotation file for the species. The file format specifications are as follows:
Genome Sequence
The genome sequence must be in FA format. The chromosome IDs must match the seqname in the first column of the GTF file, and the seqname in the GTF must be a subset of the chrom IDs in the FA file. Note that the file should not contain empty lines.
GTF File
The GTF file format specifications are as follows:
seqname: Sequence name, typically chromosome or Contig ID
source: Annotation source, which can be a database name (e.g., RefSeq database) or software name (e.g., GeneScan prediction), or can be empty, filled with a dot (.)
feature: Represents the feature type corresponding to the interval. Common types in GTF include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.
start: Starting position of the feature
end: Ending position of the feature
score: Represents the confidence level of feature existence and coordinates, can be a floating-point number or integer, "." indicates empty or not required
strand: Indicates whether the feature is on the positive (+) or negative (-) strand of the reference genome
frame: 0 indicates the first complete codon of the reading frame is at the 5' end, 1 indicates one extra base before the first complete codon, 2 indicates two extra bases before the first complete codon. Note that frame is not the remainder of CDS length divided by 3. If the strand is '-', the first base value of this region is 'end', as the corresponding coding region will be on the antisense strand from end to start
attribute: Should have the format attributes_name "attributes_values"; each attribute must end with a semicolon and be separated from the next attribute by a space, and the attribute value must be enclosed in double quotes. Contains the following three attributes:
attribute |
Meaning |
---|---|
gene_id "value"; |
Represents the unique ID of the gene locus where the transcript is located on the genome. gene_id and value are separated by a space. If the value is empty, it indicates no corresponding gene. |
transcript_id "value"; |
Unique ID of the predicted transcript. transcript_id and value are separated by a space. Empty indicates no transcript. |
gene_type "value"; |
Biological type of the gene, such as protein coding, lncRNA, etc. |
Notes for GTF file preparation:
The feature column in the GTF file must include gene, transcript, and exon information for each gene;
When the feature column is 'gene', the attributes column must include gene_id and gene_type. If there is no gene_name, the gene_id value will be used as the gene name. When the feature column is 'transcript', the attributes column must include transcript_id. When the feature column is 'exon', the attributes column must include exon_id, otherwise it will affect the processing when reads are annotated to multiple genes.
The GTF file should not contain empty lines.
In the GTF file, mitochondrial gene names must start with "Mt-" or "mt-", otherwise the mito section in the report will all be 0.
Scenario 1: Building Reference Genome Compatible with Different Single-cell Data Platforms
If you have both 10X Genomics single-cell data and SeekOne™ product single-cell data, it is recommended to use 10X CellRanger to build the reference genome. SeekSoulTools can be compatible with the reference genome built by CellRanger.
Processing the gene annotation file (GTF file) is as follows:
/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf
Note:
When the reference genome built by cellranger is not compatible with the STAR version of SeekSoulTools, you can specify the STAR path of cellranger to SeekSoulTools, for example:
--star_path /path/to/cellranger-5.0.1/lib/bin/STAR
.The chromosome name in the fasta file must match the chromosome name in the gtf file. For example, if the chromosome name in the fasta file is
chr1
, then the chromosome name in the gtf file must also bechr1
.
Scenario 2: Only SeekOne™ Product, No Need to Consider Platform Compatibility
The code for building the STAR index is as follows:
/demo/seeksoultools.1.2.0/bin/STAR \
--runMode genomeGenerate \
--runThreadN 16 \
--genomeDir /path/to/star \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/genome.gtf \
--sjdbOverhang 149 \
--limitGenomeGenerateRAM 17179869184
Note:
The chromosome name in the fasta file must match the chromosome name in the gtf file. For example, if the chromosome name in the fasta file is
chr1
, then the chromosome name in the gtf file must also bechr1
.