The Greengenes Database Release 12_10
--
README version 2012-10-18

The Greengenes Database, a public resource since 2002 (DeSantis 2003, 
DeSantis 2006, McDonald 2011), is a well-characterized and curated database of
small subunit ribosomal near-full length sequences from the kingdoms Bacteria 
and Archaea. With every release, a de novo phylogenetic tree is constructed 
that incorporates the novel branching patterns, branch length and candidate
divisions of the sequences that have been deposited into public databases 
since the prior release.  Latest versions are available for download at 
http://greengenes.secondgenome.com.

Greengenes is operated and maintained by an international consortium, The 
Greengenes Database Consortium, representing academic and biotech interests. 
The consortium is composed of:

Phil Hugenholtz, Australian Center for Ecogenomics University of Queensland,
    School of Chemistry and Molecular Biosciences University of Queensland,
    Institute for Molecular Bioscience University of Queensland
Daniel McDonald, Dept. Computer Science University of Colorado, BioFrontiers 
    Institute University of Colorado
Todd DeSantis, Bioinformatics Department, Second Genome, Inc.
Rob Knight, Dept. Chemistry and Biochemistry University of Colorado, Dept. 
    Computer Science University of Colorado, BioFrontiers Institute University
    of Colorado, Howard Hughes Medical Institute

For inquiries regarding database build methods, database build software and 
sequence inclusion/exclusion criteria, please contact Daniel McDonald 
(daniel.mcdonald@colorado.edu).

For inquiries on taxonomic nomenclature curation or the Arb distribution please
contact Phil Hugenholtz (phugenholtz@gmail.com).

For general inquiries on web tools to interact with The Greengenes Database 
please contact the help desk (greengenes@secondgenome.com).


Many improvements and changes have been made to the process by which the 
database is built in an effort to streamline the release cycle. Notably:

- Chimera detection now relies on Chimera Slayer, Bellerophon and curation by 
    Phil Hugenholtz. The 2-study heuristic has been dropped. The chimeras are
    described in the gg_12_10.chimeras.txt file including the Genbank accession,
    Greengenes ID and a brief description for the justification. Furthermore, 
    the reference database used is now based off of consensus sequences at the
    94% OTU level from the previous release.
     
- The percent invariant calculation is now performed on a per-kingdom basis. We
    discovered a sizable set of sequences were falsely being dropped when 
    ignoring kingdom-specific positions.  MSA columns that were considered
    invariant were those that did not vary in 99% of the sequences from the 
    previous release of Greengenes. Only the characters retained by the 
    SSU-Align method, following masking by SSU-Align, were evaluated.

- Many of the taxonomic group name differences between Silva and Greengenes 
    have been resolved. These groups are now communicating on a regular basis
    to help maintain consistency between the databases.

Our goal for now and indefinitely into the future is for biannual releases.
This release structure will allow for Greengenes to rapidly absorb novel 
diversity as our knowledge of the full tree of life expands.

Methods. Sequences were obtained from Genbank during June and early July 2012.
These records were parsed for viable sequence. All obtained sequences were run
through SSU-Align 0.1 (Nawrocki 2009). The SSU-Align was run as follows: 
First, ssu-prep was specified with -dna and set to perform a parallel run. 
Second, the alignments were run through ssu-mask specifying -dna, to output as
aligned FASTA (--afa) and to use the default SSU-Align masks (-d). Sequences 
that aligned were then filtered by length, dropping all reads less then 1200nt,
tallied after SSU-Align deleted bases not fitting its secondary structure 
model. The remaining sequences were then inflated to NAST width. This bag of 
reads were run through Chimera Slayer and Bellerophon. The reference database
used was composed of consensus sequences from the 94% 4feb2011 Greengenes OTUs 
requiring a minimum of 10 sequences per cluster. Any sequence flagged by both
programs were marked as chimeric. Additionally, any sequence that introduced a
conflict at the class level as identified by Chimera Slayer was dropped and any
sequence with a divergence ratio > 1.2 as well as a class level divergence as
identified by Bellerophon were dropped. Any sequence with >= 1% of non-ACGT 
bases was dropped. Any sequence with >10% of positions that varied on
invariant 16S positions was dropped 
  
OTU picking. Sequences were sorted as to increase stability in the picked 
representative sequences. The following preferences were used:

1) in the previous release
2) correctly named isolate
3) length

OTUs were picked using QIIME (Caporaso, et al. 2011) and UCLUST (Edgar, 2010) 
where the sequences clustered were unaligned 16S reads. UCLUST was run after 
sorting by the above criteria.

Initial tree construction was performed using FastTree (Price, et al 2010) with
the representative sequences from the 99% OTUs. The parameters specified to 
FastTree were -nt -gamma -fastest -no2nd -spr 4, as recommended by FastTree's
author, Morgan Price. A donor taxonomy based off the previous Greengenes 
release was decorated onto the tree. Visual inspection of this tree resulted 
in the identification of additional chimeric sequences. These sequences were 
flagged and dropped from the cleansed sequence set. OTUs were re-picked with
the same criteria as before. Trees were constructed with the full set and 99%
OTU set.

Taxonomy curation was performed on the 99% tree and expanded out to the full 
set of sequences.

File listing and descriptions. All files are prefixed with gg_<release> where
the <release> is composed of <year>_<month>. For instance, gg_12_10 refers to 
the Greengenes release for October 2012.

gg_12_10_00README
    - This file
gg_12_10_00CHANGELOG
    - Important changes since the last release
gg_12_10_00ROADMAP
    - Planned changes and additions including release dates
gg_12_10_00STATS
    - Quick stats on the number of sequences included at various similarity 
        levels
gg_12_10_otus_99_annotated.tree.gz
    - 99% OTU rooted tree with taxonomy decorated
    - Phylogenetic reconstruction performed with FastTree (Price, et al 2010)
    - Base taxonomy decorated with tax2tree (McDonald, et al 2011)
gg_12_10_taxonomy.txt.gz
    - Full taxonomy for every tip
    - All taxonomy strings are strictly 7 level and prefixed
gg_12_10.fasta.gz
    - Unaligned sequences (no bases dropped) corresponding to the tips 
        comprising the full tree
gg_12_10_aligned.fasta.gz
    - Aligned sequences corresponding to the tips comprising the full tree
    - Sequences were aligned with SSU-Align (Nawrocki 2009). As a consequence
        of this software, bases corresponding to structural diversions from 
        SSU-Align's model are removed from the sequence. It is not recommended
        to use this alignment for primer or probe design or any other operation
        where you need access to all contiguous bases in the sequence.
    - Each sequence is represented with 7,682 characters.  Dashes (-) represent
        either missing data, as on the 5' and 3' termini, or an alignment gap,
        as is interspersed throughout the sequence.
gg_12_10_rep99_NOT_SAFE_FOR_PRIMER_OR_PROBE_DESIGN.arb.gz
    - Sequences were aligned with SSU-Align (Nawrocki 2009). As a consequence
        of this software, bases corresponding to structural diversions from 
        SSU-Align's model are removed from the sequence. It is not recommended
        to use this arb file for primer or probe design or any other operation
        where you need access to all contiguous bases in the sequence.
    - One representative gene from each 99% similarity OTU is included.
    - Silva taxonomy decorated on to the same topology included.
gg_12_10_genbank.map.gz
    - A mapping from Greengenes IDs to Genbank accessions
gg_12_10_otus.tgz
    - QIIME (Caporaso, et al 2011) compatible OTUs
    - Sequences are sorted to place seed preference on:
        - in the previous release
        - named isolates
        - longer sequences
    - Clusters are determined using QIIME-wrapped UCLUST (Edgar 2010)
    - Code can be obtained from
        - https://github.com/qiime-dev/nested_reference_otus
        - commit 821a98df6773ea4e4d209af20b9a8cf34d00324
gg_12_10.sql.gz
    - Full Greengenes records
    - This is a mysqldump. The user is 'greengenes' without a password. The 
        database is named 'greengenes'
    - NOTE: This database currently only contains sequence information for those
        records included in the release, however all examined Genbank records
        that contained alignable 16S are described. This is a work in progress
        with additional record data and functionality to be added in subsequent
        releases
gg_12_10_chimeras.txt
    - The current chimera blacklist
    
References:

Caporaso JG, Kuczynski J, Stombaugh S, Bittinger K, Bushman FD, Costello EK, 
    Fierer N, Gonzalez A, Goodrich JK, Gordon GI, Huttley GA, Kelley ST, 
    Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M,
    Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, 
    Zaneveld J and Knight R (2010) QIIME allows analysis of high-throughput 
    community sequencing data. Nature Methods, 2010; doi:10.1038/nmeth.f.303

Price, M.N., Dehal, P.S., and Arkin, A.P. (2010) FastTree 2 -- Approximately 
    Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3):e9490. 
    doi:10.1371/journal.pone.0009490.

McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, 
    Andersen GL, Knight R and Hugenholtz P. (2011). An improved Greengenes 
    taxonomy with explicit ranks for ecological and evolutionary analyses of 
    bacteria and archaea. The ISME Journal 6: 610-618. 
    doi:10.1038/ismej.2011.139. 

Nawrocki, EP. (2009). Structural RNA Homology Search and Alignment Using 
    Covariance Models. PhD Thesis: Washington University School of Medicine

Edgar, R.C. (2010). Search and clustering orders of magnitude faster than 
    BLAST. Bioinformatics.doi: 10.1093/bioinformatics/btq461

DeSantis, T. Z., I. Dubosarskiy, S. R. Murray, and G. L. Andersen (2003), 
    Comprehensive aligned sequence construction for automated design of 
    effective probes (CASCADE-P) using 16S rDNA, Bioinformatics, 19(12), 
    1461-1468.

DeSantis, T. Z., P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller,
    T. Huber, D. Dalevi, P. Hu, and G. L. Andersen (2006), Greengenes, a 
    chimera-checked 16S rRNA gene database and workbench compatible with ARB,
    Appl Environ Microbiol, 72(7), 5069-5072.