The Greengenes Database Release 13_5 -- The Greengenes Database, a public resource since 2002 (DeSantis 2003, DeSantis 2006, McDonald 2011), is a well-characterized and curated database of small subunit ribosomal near-full length sequences from the kingdoms Bacteria and Archaea. With every release, a de novo phylogenetic tree is constructed that incorporates the novel branching patterns, branch length and candidate divisions of the sequences that have been deposited into public databases since the prior release. Latest versions are available for download at http://greengenes.secondgenome.com. Greengenes is operated and maintained by an international consortium, The Greengenes Database Consortium, representing academic and biotech interests. The consortium is composed of: Phil Hugenholtz, Australian Center for Ecogenomics University of Queensland, School of Chemistry and Molecular Biosciences University of Queensland, Institute for Molecular Bioscience University of Queensland Daniel McDonald, Dept. Computer Science University of Colorado, BioFrontiers Institute University of Colorado Todd DeSantis, Bioinformatics Department, Second Genome, Inc. Rob Knight, Dept. Chemistry and Biochemisty University of Colorado, Dept. Computer Science University of Colorado, BioFrontiers Institute University of Colorado, Howard Hughes Medical Institute For inquiries regarding database build methods, database build software and sequence inclusion/exclusion criteria, please contact Daniel McDonald (daniel.mcdonald@colorado.edu). For inquiries on taxonomic nomenclature curation or the Arb distribution please contact Phil Hugenholtz (phugenholtz@gmail.com). For general inquiries on web tools to interact with The Greengenes Database please contact the help desk (greengenes@secondgenome.com). Many improvements and changes have been made to the process by which the database is built in an effort to streamline the release cycle. For this release, a few major changes have been made: - Chimera detection now relies on UCHIME (Edgar 2011) using, as a reference, consensus sequences derived from the 94% OTUs from Greengenes 12_10. Previously identified chimeras are still considered chimeric. - The inference for determining if a record named_isolate, unnamed_isolate, or clone has been dramatically improved. All existing Greengenes records have been updated and now reflect the improved decision status. - Mappings to the Integrated Microbial Genomes databas v400 are now provided. See gg_13_5_img.txt for a mapping between the Greengenes ID and IMG genomes. - PyNAST aligned sequences are now provided, and are included in the ARB database. Thank you Les Dethlefsen for pointing out the need for these. - Taxon names above the rank of genus with square brackets are names proposed by the greengenes curators and will not be found in NCBI. Genus names with square brackets are contested names (usually due to polyphyly of the genus) some of which will be found in NCBI. As always, various taxonomic updates have been made and over the last few months, we have received valuable feedback about the taxonomy. Specifically, we'd like to thank the following people who provided feedback on errors or inconsistencies in the taxonomy: Alex Probst Bing Ma Francesca DeFilippis Kyle Bittinger Niels Larsen Cathy Lozupone Greengenes is a living project, and we sincerely value comments from the community, and are working on improving the methods in which we can better solicit taxonomic feedback. Methods. Sequences not previously observed were obtained from Genbank during January of 2013. These records were parsed for viable sequence. All obtained sequences were run through SSU-Align 0.1 (Nawrocki 2009). SSU-Align was run as follows: First, ssu-prep was specified with –dna and set to perform a parallel run. Second, the alignments were run through ssu-mask specifying –dna, to output as aligned FASTA (--afa) and to use the default SSU-Align masks (-d). Sequences that aligned were then filtered by length, dropping all reads less then 1200nt, tallied after SSU-Align deleted bases not fitting its secondary structure model. The remaining sequences were then inflated to NAST width. This bag of reads were run through UCHIME. The reference database used was composed of consensus sequences from the 94% GG_12_10 OTUs requiring a minimum of 10 sequences per cluster. Any sequence flagged as chimeric that bridge classes or higher were dropped and Any sequence with >= 1% of non-ACGT bases was dropped. Any sequence with >10% of positions that varied on invariant 16S positions was dropped OTU picking. Sequences were sorted as to increase stability in the picked representative sequences. The following preferences were used: 1) in the previous release 2) correctly named isolate 3) length OTUs were picked using QIIME (Caporaso, et al. 2011) and UCLUST (Edgar, 2010) where the sequences clustered were unaligned 16S reads. UCLUST was run after sorting by the above criteria. Initial tree construction was performed using FastTree (Price, et al 2010) with the representative sequences from the 99% OTUs. The parameters specified to FastTree were “-nt -gamma -fastest -no2nd -spr 4” as recommended by FastTree’s author, Morgan Price. A donor taxonomy based off the previous Greengenes release was decorated onto the tree. Taxonomy curation was performed on the 99% tree and expanded out to the full set of sequences. Taxonomy verification was performed to ensure that all paths through the tree contained all 7 levels of the taxonomy in proper order, and that the taxonomy itself formed a true hierarchy (custom scripts at the moment, see now below about Greengenes source code). File listing and descriptions. All files are prefixed with gg_ where the is composed of _. For instance, gg_13_5 refers to the Greengenes release for May 2013. 00README - This file 00CHANGELOG - Important changes since the last release 00ROADMAP - Planned changes and additions including release dates 00STATS - Quick stats on the number of sequences included at various similarity levels gg_13_5_otus_99_annotated.tree.gz - 99% OTU rooted tree with taxonomy decorated - Phylogenetic reconstruction performed with FastTree (Price, et al 2010) - Base taxonomy decorated with tax2tree (McDonald, et al 2011) gg_13_5_taxonomy.txt.gz - Full taxonomy for every sequence in the release. - All taxonomy strings are strictly 7 level and prefixed gg_13_5.fasta.gz - Full release unaligned sequences (no bases dropped). gg_13_5_ssualign.fasta.gz - Full release aligned sequences - Sequences were aligned with SSU-Align (Nawrocki 2009). As a consequence of this software, bases corresponding to structural diversions from SSU-Align’s model are removed from the sequence. It is not recommended to use this alignment for probe design or any other operation where you need access to all contiguous bases in the sequence. - Each sequence is represented with 7,682 characters. Dashes (-) represent either missing data, as on the 5‘ and 3‘ termini, or an alignment gap, as is interspersed throughout the sequence. gg_13_5_pynast.fasta.gz - Near full release of aligned sequences. Approximately 1400 sequences that were alignable by SSU Align failed PyNAST (Caporaso 2010) alignment. The original Greengenes coreset was used which, due to out-of-date coverage, probably contributed to the alignment failures. gg_13_5_accessions.txt.gz - A mapping from Greengenes IDs to external databases - This is primarily Genbank references, but includes a few hundred links to IMG genome IDs as there was not an automatic means to infer some of the NCBI accessions. gg_13_5_img.txt.gz - A mapping specifically between Greengenes IDs and IMG Genomes. - Performed by accession not nearest neighbor gg_13_5_otus.tgz - QIIME (Caporaso, et al 2011) compatible OTUs - Sequences are sorted to place seed preference on: - in the previous release - named isolates - longer sequences - Clusters are determined using QIIME-wrapped UCLUST (Edgar 2010) - Now includes the representative sequences as aligned by SSU-Align and can be used as a template for PyNAST. - Code can be obtained from - https://github.com/qiime-dev/nested_reference_otus - commit 821a98df6773ea4e4d209af20b9a8cf34d00324 gg_13_5.sql.gz - Full Greengenes records - This is a mysqldump. The user is 'greengenes' without a password. The database is named 'greengenes' - NOTE: This database currently only contains sequence information for those records included in the release, however all examined Genbank records that contained alignable 16S are described. This is a work in progress with additional record data and functionality to be added in subsequent releases gg_13_5_chimeras.txt.gz - The current chimera blacklist Source code. Parts of the Greengenes code base are provided here: https://github.com/greengenes/Greengenes However, the provided code is still quite limited. We're continuing to make changes on the backend of Greengenes, and the structure of the code base has not yet finalized. References: Caporaso JG, Kuczynski J, Stombaugh S, Bittinger K, Bushman FD, Costello EK, Fierer N, Gonzalez A, Goodrich JK, Gordon GI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J and Knight R (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 2010; doi:10.1038/nmeth.f.303 Price, M.N., Dehal, P.S., and Arkin, A.P. (2010) FastTree 2 -- Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3):e9490. doi:10.1371/journal.pone.0009490. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R and Hugenholtz P. (2011). An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal 6: 610-618. doi:10.1038/ismej.2011.139. Nawrocki, EP. (2009). Structural RNA Homology Search and Alignment Using Covariance Models. PhD Thesis: Washington University School of Medicine Edgar, R.C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics.doi: 10.1093/bioinformatics/btq461 Edgar,RC, Haas,BJ, Clemente,JC, Quince,C, Knight,R (2011) UCHIME improves sensitivity and speed of chimera detection, Bioinformatics doi: 10.1093/bioinformatics/btr381. DeSantis, T. Z., I. Dubosarskiy, S. R. Murray, and G. L. Andersen (2003), Comprehensive aligned sequence construction for automated design of effective probes (CASCADE-P) using 16S rDNA, Bioinformatics, 19(12), 1461-1468. DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen (2006), Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB, Appl Environ Microbiol, 72(7), 5069-5072. Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, and Knight R (2010), PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26: 266-267