Protein-coding genes in genomes
======

Open reading frames (ORFs) of individual genomes, uniformly predicted using
Prodigal (rather than adopted from NCBI).

ORF prediction was performed using Prodigal v2.6.3. Command:

```
prodigal -p single -g <genetic_code> -i input.fna \
  -f gff -o output.gff -a output.faa -d output.ffn
```

Sequence files:

 - all.faa: Translated protein sequences of ORFs.

   - The ORF IDs are in the format of "genome ID <underscore> ORF index". For
     example, "G000123456_789" stands for the 789th ORF in genome G000123456.

     This file is the input for building protein databases (see databases/).

 - all.ffn: DNA sequences of protein-coding regions.

 - prodigal/: Raw Prodigal ORF prediction results (GFF format).

Mapping files:

 - coords.txt: Coordinates of ORFs in their host genome sequences (bp).

   - Format: ORF index <tab> start <tab> end

     Coordinates are 1-based. Both starts and ends are inclusive. For example,
     "1 <tab> 30" represents the first 30 nucleotides of the genome.

 - length.map: Mapping of ORF IDs to lengths (bp).

   - This file is useful for normalizing gene frequencies by gene length.

 - nucl.map: Mapping of genome-based ORF IDs (like "G000123456_789") to the
   original nucleotide-based ORF IDs (like "NC_123456.1_789") inferred by
   Prodigal.

 - gcode.map: Mapping of genome IDs to genetic code tables.

   - Genetic code tables of genomes were determined based on the original NCBI
     annotation (see ../taxonomy/ncbi/), the observation of the WoL2 phylogeny,
     and literature review. The results were adopted by individual Prodigal
     runs (see above).

Taxonomy:

Taxonomic assignment of individual ORFs may follow their host genomes.

Please explore ../taxonomy to determine the desired taxonomy system. Then, the
following command can generate a mapping of ORFs to TaxIDs (using the default
taxonomy for example).

```
join -j1 -t$'\t' --nocheck-order <(xzcat length.txt.xz | cut -f1 |\
  tr '_' '\t') ../taxonomy/taxid.map | sed 's/\t/_/' > taxid.map
```