GTDB taxonomy
======

Taxonomic annotation of WoL2 genomes based on GTDB R207.

 - Source: https://data.gtdb.ecogenomic.org/releases/release207/207.0/

Database files:

 - taxid.map: Genome ID to TaxID mapping.
 - nodes.dmp: NCBI taxdump-style node mapping.
 - names.dmp: NCBI taxdump-style name mapping.
 - lineages.txt: Lineage strings with taxon names.
 - linetids.txt: Lineage strings with TaxIDs.
 - ncbi2gtdb.tsv: Translation table of 1,525 genomes which are absent in GTDB
   R207 into GTDB taxonomy.
 - tax2tree/: Tax2Tree-curated taxonomy.
 - r207.0/: The entire GTDB R207 taxonomy.

Note: The current directory hosts original (uncurated) GTDB annotations.


Dummy taxdump:

NCBI-style taxdump files (taxid.map, nodes.dmp, and names.dmp) were generated
based on the GTDB lineages using gtdb_to_taxdump.py such that they can be
adopted by a wider variety of downstream applications.

The dummy TaxIDs were assigned through a level-order traversal of the taxonomy
tree: 1 - root, 2 - domain Archaea, 3 - domain Bacteria, then phyla, then
classes, so on so forth.

Note: They should not be confused with the NCBI TaxIDs.


Translation:

GTDB R207 contains a total of 317,542 taxa. Among the 15,953 WoL2 genomes,
14,428 (90.4%) were found in this pool. Their taxonomic assignments were
directly adopted.

The other 1,525 genomes were not present. These genomes were classified by
translating NCBI taxonomic assignments to GTDB using a translation table
provided in the GTDB data release.

Because each NCBI taxon may be mapped to multiple GTDB taxa, the translation
process only considers mappings where at least 95% members of a NCBI taxon
are assigned to one GTDB taxon.

For example (see r207.0/ncbi2gtdb.map):

 - NCBI taxon: c__Acidimicrobiia
 - GTDB taxa:  c__Acidimicrobiia(p__Actinobacteriota) 98.9%,
               c__Vicinamibacteria(p__Acidobacteriota) 0.55%,
               c__Acidobacteriae(p__Acidobacteriota) 0.22%,
               c__Alphaproteobacteria(p__Proteobacteria) 0.22%,
               c__Actinomycetia(p__Actinobacteriota) 0.11%

In this scenario, because 98.9% > 95%, NCBI taxon c__Acidimicrobiia is
translated into GTDB taxon c__Acidimicrobiia.

Each of the 1,525 genomes were assigned GTDB taxonomy based on the lowest
rank that can be translated.