"RS225" reference genome database ====== RS225 is a collection of reference microbial genomes sampled from the NCBI RefSeq genome database, as of 2024-08-01. This time point corresponds to RefSeq release 225. RS225 contains 40,987 genomes from NCBI RefSeq and 11,771 genomes from external sources. The total number of genomes, 52,758, represents an 78% increase from the previous version of the database. Statistics: - Total number of genomes: 52,758 - Total length of genomes (after adding linkers): 170,326,480,530 bp Number of genomes by category: - Archaea: 870 - Bacteria: 32,894 - Fungi: 610 - Protozoa: 93 - Viral: 18,279 - SynDNA Constructs: 12 Number of taxonomic units by rank: - Domains: 4 - Phyla: 109 - Classes: 231 - Orders: 496 - Families: 1,266 - Genera: 7,244 - Species: 32,216 Sampling methods (NCBI genomes): 0. Start with all NCBI RefSeq genomes (n = 388,047). 1. Keep five categories: archaea, bacteria, fungi, protozoa, and viral. 2. Manually curate TaxID mapping to match the NCBI taxonomy database. 3. Drop genomes without species-level assignment. 4. Drop genomes with any of the following words in the organism name: "unknown", "uncultured", "unidentified", "unclassified", "unresolved", "environmental", "synthetic". 5. Keep all reference, representative, and type material genomes. 6. Sample one genome per species with a Latinate species name, if not already. The sampling order is complete genome > scaffolds > contigs. 7. From genus to phylum, sample one genome per taxonomic group, if not already. Sampling methods (non-NCBI genomes): 8. Start with non-NCBI genomes collected from external sources (n = 11,865). 9. Drop genomes without known taxid or taxonomy. 10. If the taxid is available, retrieve NCBI taxonomy using ete3. If the taxid is not available but the GTDB taxonomy is availablem retrieve closest taxid based on the GTDB taxonomy and then retrieve NCBI taxonomy using ete3. 11. Assign genome IDs 'HXXXXXXXXX' for external genomes. 12. Combine metadata from NCBI and non-NCBI genomes. Database files: - all.fna: Concatenated genome sequences in multi-FASTA format. Specifically: Nucleotide sequences of each genome were concatenated with a linker of 20 "N"s into one sequence, and named following the genome ID (e.g., G000123456). Sequences of all genomes were then merged into one FASTA file. - assembly.tsv: Genome assembly report retrieved from NCBI RefSeq. TaxIDs were curated to match the taxonomy database. - length.map: Mapping of genomes to total lengths (bp) (after adding linkers). - category.map: Mapping of genomes to organism categories (archaea, bacteria, fungi, protozoa, viral). - taxonomy/: Taxonomic classification of genomes, according to NCBI taxonomy database release 2024-09-01. - bowtie2/: Bowtie2 database of genome sequences. - ncbi/: Raw data files retrieved from NCBI.