Protein-coding genes in genomes ====== Open reading frames (ORFs) of individual genomes, uniformly predicted using Prodigal (rather than adopted from NCBI). ORF prediction was performed using Prodigal v2.6.3. Command: ``` prodigal -p single -g -i input.fna \ -f gff -o output.gff -a output.faa -d output.ffn ``` Sequence files: - all.faa: Translated protein sequences of ORFs. - The ORF IDs are in the format of "genome ID ORF index". For example, "G000123456_789" stands for the 789th ORF in genome G000123456. This file is the input for building protein databases (see databases/). - all.ffn: DNA sequences of protein-coding regions. - prodigal/: Raw Prodigal ORF prediction results (GFF format). Mapping files: - coords.txt: Coordinates of ORFs in their host genome sequences (bp). - Format: ORF index start end Coordinates are 1-based. Both starts and ends are inclusive. For example, "1 30" represents the first 30 nucleotides of the genome. - length.map: Mapping of ORF IDs to lengths (bp). - This file is useful for normalizing gene frequencies by gene length. - nucl.map: Mapping of genome-based ORF IDs (like "G000123456_789") to the original nucleotide-based ORF IDs (like "NC_123456.1_789") inferred by Prodigal. - gcode.map: Mapping of genome IDs to genetic code tables. - Genetic code tables of genomes were determined based on the original NCBI annotation (see ../taxonomy/ncbi/), the observation of the WoL2 phylogeny, and literature review. The results were adopted by individual Prodigal runs (see above). Taxonomy: Taxonomic assignment of individual ORFs may follow their host genomes. Please explore ../taxonomy to determine the desired taxonomy system. Then, the following command can generate a mapping of ORFs to TaxIDs (using the default taxonomy for example). ``` join -j1 -t$'\t' --nocheck-order <(xzcat length.txt.xz | cut -f1 |\ tr '_' '\t') ../taxonomy/taxid.map | sed 's/\t/_/' > taxid.map ```